Closed freef4ll closed 1 year ago
%7|Tue Sep 6 17:08:12 2022.168|RECV|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator/3: Received InitProducerIdResponse (v1, 16 bytes, CorrId 4, rtt 20.68ms)
%7|Tue Sep 6 17:08:12 2022.168|GETPID|rdkafka#producer-1| [thrd:main]: TxnCoordinator/3: Failed to acquire PID: Broker: Producer attempted a transactional operation in an invalid state
%7|Tue Sep 6 17:08:12 2022.168|IDEMPSTATE|rdkafka#producer-1| [thrd:main]: Idempotent producer state change WaitPID -> RequestPID
%4|Tue Sep 6 17:08:12 2022.168|GETPID|rdkafka#producer-1| [thrd:main]: Failed to acquire transactional PID from broker TxnCoordinator/3: Broker: Producer attempted a transactional operation in an invalid state: retrying
%7|Tue Sep 6 17:08:12 2022.168|TXN|rdkafka#producer-1| [thrd:main]: Starting PID FSM timer: Failed to acquire transactional PID from broker TxnCoordinator/3: Broker: Producer attempted a transactional operation in an invalid state
Unable to write to anything because the following warning is recorded:
Sep 6 14:46:15 redpanda-5 rpk[278920]: WARN 2022-09-06 16:46:15,934 [shard 0] tx - tx_gateway_frontend.cc:776 - got error cluster::tx_errc:19 on rolling previous tx.id={librdkafka_transactions_example_16807} with status=3
Sep 6 14:46:16 redpanda-5 rpk[278920]: WARN 2022-09-06 16:46:16,470 [shard 0] tx - tx_gateway_frontend.cc:776 - got error cluster::tx_errc:19 on rolling previous tx.id={librdkafka_transactions_example_16807} with status=3
Sep 6 14:46:16 redpanda-5 rpk[278920]: WARN 2022-09-06 16:46:16,982 [shard 0] tx - tx_gateway_frontend.cc:776 - got error cluster::tx_errc:19 on rolling previous tx.id={librdkafka_transactions_example_16807} with status=3
Sep 6 14:46:17 redpanda-5 rpk[278920]: WARN 2022-09-06 16:46:17,519 [shard 0] tx - tx_gateway_frontend.cc:776 - got error cluster::tx_errc:19 on rolling previous tx.id={librdkafka_transactions_example_16807} with status=3
Sep 6 14:46:18 redpanda-5 rpk[278920]: WARN 2022-09-06 16:46:18,041 [shard 0] tx - tx_gateway_frontend.cc:776 - got error cluster::tx_errc:19 on rolling previous tx.id={librdkafka_transactions_example_16807} with status=3
Sep 6 14:46:18 redpanda-5 rpk[278920]: WARN 2022-09-06 16:46:18,568 [shard 0] tx - tx_gateway_frontend.cc:776 - got error cluster::tx_errc:19 on rolling previous tx.id={librdkafka_transactions_example_16807} with status=3
Adjusting expiration to 5 minutes doesn't help to abort this transaction, lock to purge is acquired but the group is not purged:
PROPERTY PRIOR NEW
transactional_id_expiration_ms 604800000 300000
Sep 6 15:00:17 redpanda-4 rpk[278488]: TRACE 2022-09-06 17:00:17,015 [shard 1] tx - tx_gateway_frontend.cc:44 - got_lock name:expire_old_tx, tx_id:{librdkafka_transactions_example_16807}
Sep 6 15:00:17 redpanda-4 rpk[278488]: TRACE 2022-09-06 17:00:17,015 [shard 1] tx - tx_gateway_frontend.cc:48 - released_lock name:expire_old_tx, tx_id:{librdkafka_transactions_example_16807}
Sep 6 15:02:47 redpanda-4 rpk[278488]: TRACE 2022-09-06 17:02:47,019 [shard 1] tx - tx_gateway_frontend.cc:44 - got_lock name:expire_old_tx, tx_id:{librdkafka_transactions_example_16807}
Sep 6 15:02:47 redpanda-4 rpk[278488]: TRACE 2022-09-06 17:02:47,019 [shard 1] tx - tx_gateway_frontend.cc:48 - released_lock name:expire_old_tx, tx_id:{librdkafka_transactions_example_16807}
Sep 6 15:05:17 redpanda-4 rpk[278488]: TRACE 2022-09-06 17:05:17,026 [shard 1] tx - tx_gateway_frontend.cc:44 - got_lock name:expire_old_tx, tx_id:{librdkafka_transactions_example_16807}
Sep 6 15:05:17 redpanda-4 rpk[278488]: TRACE 2022-09-06 17:05:17,026 [shard 1] tx - tx_gateway_frontend.cc:48 - released_lock name:expire_old_tx, tx_id:{librdkafka_transactions_example_16807}
Sep 6 15:07:47 redpanda-4 rpk[278488]: TRACE 2022-09-06 17:07:47,031 [shard 1] tx - tx_gateway_frontend.cc:44 - got_lock name:expire_old_tx, tx_id:{librdkafka_transactions_example_16807}
Sep 6 15:07:47 redpanda-4 rpk[278488]: TRACE 2022-09-06 17:07:47,032 [shard 1] tx - tx_gateway_frontend.cc:48 - released_lock name:expire_old_tx, tx_id:{librdkafka_transactions_example_16807}
EDIT: only way managed to clean this up is by wipe of the entire clusters data; but reproduced again and created https://github.com/redpanda-data/redpanda/issues/6327
Steps to reproduce
%7|Wed Sep 7 09:31:50 2022.294|TOPPAR|rdkafka#producer-1| [thrd:10.142.65.91:9092/3]: 10.142.65.91:9092/3: transaction-output [2] 1 message(s) in xmit queue (1 added from partition queue)
%7|Wed Sep 7 09:31:50 2022.294|TXNAPI|rdkafka#producer-1| [thrd:app]: Transactional API called: send_offsets_to_transaction (in txn state InTransaction, idemp state Assigned)
%7|Wed Sep 7 09:31:50 2022.294|SEND|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator/4: Sent AddOffsetsToTxnRequest (v0, 99 bytes @ 0, CorrId 105)
%7|Wed Sep 7 09:31:50 2022.299|TOPPAR|rdkafka#producer-1| [thrd:10.142.65.91:9092/3]: 10.142.65.91:9092/3: transaction-output [2] 1 message(s) in xmit queue (0 added from partition queue)
%7|Wed Sep 7 09:31:50 2022.299|PRODUCE|rdkafka#producer-1| [thrd:10.142.65.91:9092/3]: 10.142.65.91:9092/3: transaction-output [2]: Produce MessageSet with 1 message(s) (71 bytes, ApiVersion 7, MsgVersion 2, MsgId 628, BaseSeq 627, PID{Id:1,Epoch:4}, uncompressed)
%7|Wed Sep 7 09:31:50 2022.299|SEND|rdkafka#producer-1| [thrd:10.142.65.91:9092/3]: 10.142.65.91:9092/3: Sent ProduceRequest (v7, 173 bytes @ 0, CorrId 50)
%7|Wed Sep 7 09:31:50 2022.307|RECV|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator/4: Received AddOffsetsToTxnResponse (v0, 6 bytes, CorrId 105, rtt 12.20ms)
%7|Wed Sep 7 09:31:50 2022.307|ADDOFFSETS|rdkafka#producer-1| [thrd:main]: AddOffsetsToTxn response from TxnCoordinator/4: NO_ERROR ()
%7|Wed Sep 7 09:31:50 2022.307|SEND|rdkafka#producer-1| [thrd:10.142.65.91:9092/3]: 10.142.65.91:9092/3: Sent TxnOffsetCommitRequest (v3, 192 bytes @ 0, CorrId 51)
%7|Wed Sep 7 09:31:50 2022.308|RECV|rdkafka#producer-1| [thrd:10.142.65.91:9092/3]: 10.142.65.91:9092/3: Received ProduceResponse (v7, 62 bytes, CorrId 50, rtt 8.87ms)
%7|Wed Sep 7 09:31:50 2022.308|REQERR|rdkafka#producer-1| [thrd:10.142.65.91:9092/3]: 10.142.65.91:9092/3: ProduceRequest failed: Broker: Not leader for partition: actions Refresh,MsgNotPersisted
%7|Wed Sep 7 09:31:50 2022.308|MSGSET|rdkafka#producer-1| [thrd:10.142.65.91:9092/3]: 10.142.65.91:9092/3: transaction-output [2]: MessageSet with 1 message(s) (MsgId 628, BaseSeq 627) encountered error: Broker: Not leader for partition (actions Refresh,MsgNotPersisted)
%7|Wed Sep 7 09:31:50 2022.308|BROKERUA|rdkafka#producer-1| [thrd:10.142.65.91:9092/3]: transaction-output [2]: broker unavailable: produce: Broker: Not leader for partition
%7|Wed Sep 7 09:31:50 2022.308|FASTQUERY|rdkafka#producer-1| [thrd:10.142.65.91:9092/3]: Starting fast leader query
%7|Wed Sep 7 09:31:50 2022.308|DRAIN|rdkafka#producer-1| [thrd:10.142.65.91:9092/3]: transaction-output [2] beginning partition drain: drain before retrying
%7|Wed Sep 7 09:31:50 2022.308|TOPPAR|rdkafka#producer-1| [thrd:10.142.65.91:9092/3]: 10.142.65.91:9092/3: transaction-output [2] all in-flight requests drained from queue
%7|Wed Sep 7 09:31:50 2022.308|TOPPAR|rdkafka#producer-1| [thrd:10.142.65.91:9092/3]: 10.142.65.91:9092/3: transaction-output [2] 1 message(s) in xmit queue (1 added from partition queue)
%7|Wed Sep 7 09:31:50 2022.318|RECV|rdkafka#producer-1| [thrd:10.142.65.91:9092/3]: 10.142.65.91:9092/3: Received TxnOffsetCommitResponse (v3, 38 bytes, CorrId 51, rtt 10.90ms)
%7|Wed Sep 7 09:31:50 2022.318|TOPPAR|rdkafka#producer-1| [thrd:10.142.65.91:9092/3]: 10.142.65.91:9092/3: transaction-output [2] 1 message(s) in xmit queue (0 added from partition queue)
%7|Wed Sep 7 09:31:50 2022.318|TXNAPI|rdkafka#producer-1| [thrd:app]: Transactional API called: commit_transaction (begin) (in txn state InTransaction, idemp state Assigned)
%7|Wed Sep 7 09:31:50 2022.318|TXNSTATE|rdkafka#producer-1| [thrd:main]: Transaction state change InTransaction -> BeginCommit
%7|Wed Sep 7 09:31:50 2022.318|TXNCOMMIT|rdkafka#producer-1| [thrd:app]: Flushing 58 outstanding message(s) prior to commit
%7|Wed Sep 7 09:31:50 2022.318|WAKEUP|rdkafka#producer-1| [thrd:app]: 10.142.65.91:9092/3: Wake-up: flushing
%7|Wed Sep 7 09:31:50 2022.318|WAKEUP|rdkafka#producer-1| [thrd:app]: 10.142.65.124:9092/1: Wake-up: flushing
%7|Wed Sep 7 09:31:50 2022.318|WAKEUP|rdkafka#producer-1| [thrd:app]: 10.142.65.152:9092/4: Wake-up: flushing
%7|Wed Sep 7 09:31:50 2022.318|WAKEUP|rdkafka#producer-1| [thrd:app]: 10.142.65.14:9092/0: Wake-up: flushing
%7|Wed Sep 7 09:31:50 2022.318|WAKEUP|rdkafka#producer-1| [thrd:app]: TxnCoordinator/4: Wake-up: flushing
Topic migration happens:
%7|Wed Sep 7 09:31:50 2022.573|METADATA|rdkafka#producer-1| [thrd:main]: 10.142.65.91:9092/3: Topic #0/1: transaction-output with 6 partitions
%7|Wed Sep 7 09:31:50 2022.573|METADATA|rdkafka#producer-1| [thrd:main]: Topic transaction-output partition 0 Leader 1
%7|Wed Sep 7 09:31:50 2022.573|METADATA|rdkafka#producer-1| [thrd:main]: Topic transaction-output partition 1 Leader 1
%7|Wed Sep 7 09:31:50 2022.573|METADATA|rdkafka#producer-1| [thrd:main]: Topic transaction-output partition 2 Leader 2
%7|Wed Sep 7 09:31:50 2022.573|TOPICUPD|rdkafka#producer-1| [thrd:main]: Topic transaction-output [2]: migrating from broker 3 to 2 (leader is 2): leader updated
%7|Wed Sep 7 09:31:50 2022.573|BRKDELGT|rdkafka#producer-1| [thrd:main]: transaction-output [2]: delegate to broker 10.142.65.197:9092/2 (rktp 0x14f834e00, term 0, ref 4)
%7|Wed Sep 7 09:31:50 2022.573|BRKDELGT|rdkafka#producer-1| [thrd:main]: transaction-output [2]: no longer delegated to broker 10.142.65.91:9092/3
%7|Wed Sep 7 09:31:50 2022.573|BRKDELGT|rdkafka#producer-1| [thrd:main]: transaction-output [2]: delegating to broker 10.142.65.197:9092/2 for partition with 0 messages (0 bytes) queued
%7|Wed Sep 7 09:31:50 2022.573|BRKMIGR|rdkafka#producer-1| [thrd:main]: Migrating topic transaction-output [2] 0x14f834e00 from 10.142.65.91:9092/3 to 10.142.65.197:9092/2 (sending PARTITION_LEAVE to 10.142.65.91:9092/3)
%7|Wed Sep 7 09:31:50 2022.573|METADATA|rdkafka#producer-1| [thrd:main]: Topic transaction-output partition 3 Leader 0
%7|Wed Sep 7 09:31:50 2022.573|METADATA|rdkafka#producer-1| [thrd:main]: Topic transaction-output partition 4 Leader 4
%7|Wed Sep 7 09:31:50 2022.573|METADATA|rdkafka#producer-1| [thrd:main]: Topic transaction-output partition 5 Leader 3
%7|Wed Sep 7 09:31:50 2022.636|REQERR|rdkafka#producer-1| [thrd:10.142.65.197:9092/2]: 10.142.65.197:9092/2: ProduceRequest failed: Broker: Producer attempted an operation with an old epoch: explicit actions Permanent,MsgNotPersisted
%7|Wed Sep 7 09:31:50 2022.636|MSGSET|rdkafka#producer-1| [thrd:10.142.65.197:9092/2]: 10.142.65.197:9092/2: transaction-output [2]: MessageSet with 1 message(s) (MsgId 628, BaseSeq 627) encountered error: Broker: Producer attempted an operation with an old epoch (actions Permanent,MsgNotPersisted)
%1|Wed Sep 7 09:31:50 2022.636|TXNERR|rdkafka#producer-1| [thrd:10.142.65.197:9092/2]: Fatal transaction error: ProduceRequest for transaction-output [2] with 1 message(s) failed: Broker: Producer attempted an operation with an old epoch (broker 2 PID{Id:1,Epoch:4}, base seq 627): transactional producer fenced by newer producer instance (_FENCED)
%7|Wed Sep 7 09:31:50 2022.636|FATAL|rdkafka#producer-1| [thrd:10.142.65.197:9092/2]: Fatal error: Local: This instance has been fenced by a newer instance: ProduceRequest for transaction-output [2] with 1 message(s) failed: Broker: Producer attempted an operation with an old epoch (broker 2 PID{Id:1,Epoch:4}, base seq 627): transactional producer fenced by newer producer instance
%3|Wed Sep 7 09:31:50 2022.636|ERROR|rdkafka#producer-1| [thrd:10.142.65.197:9092/2]: Fatal error: Local: This instance has been fenced by a newer instance: ProduceRequest for transaction-output [2] with 1 message(s) failed: Broker: Producer attempted an operation with an old epoch (broker 2 PID{Id:1,Epoch:4}, base seq 627): transactional producer fenced by newer producer instance
%7|Wed Sep 7 09:31:50 2022.636|TXNAPI|rdkafka#producer-1| [thrd:app]: Transactional API called: commit_transaction (in txn state FatalError, idemp state FatalError)
%7|Wed Sep 7 09:31:50 2022.636|PURGE|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator/4: Purging queues with flags queue,non-blocking
%7|Wed Sep 7 09:31:50 2022.636|TXN|rdkafka#producer-1| [thrd:10.142.65.197:9092/2]: Starting PID FSM timer (fire immediately): Drain done
%7|Wed Sep 7 09:31:50 2022.636|PURGE|rdkafka#producer-1| [thrd:10.142.65.124:9092/1]: transaction-output [0]: purging queues (purge_flags 0x5, include xmit_msgq)
%7|Wed Sep 7 09:31:50 2022.636|PURGE|rdkafka#producer-1| [thrd:10.142.65.124:9092/1]: transaction-output [1]: purging queues (purge_flags 0x5, include xmit_msgq)
%7|Wed Sep 7 09:31:50 2022.636|PURGEQ|rdkafka#producer-1| [thrd:10.142.65.124:9092/1]: 10.142.65.124:9092/1: Purged 0 message(s) from 0 partition(s)
%7|Wed Sep 7 09:31:50 2022.636|PURGEQ|rdkafka#producer-1| [thrd:main]: Purged 0 message(s) from 0 UA-partition(s)
%7|Wed Sep 7 09:31:50 2022.636|PURGE|rdkafka#producer-1| [thrd:10.142.65.197:9092/2]: 10.142.65.197:9092/2: Purging queues with flags queue,non-blocking
%7|Wed Sep 7 09:31:50 2022.636|PURGEQ|rdkafka#producer-1| [thrd:10.142.65.197:9092/2]: 10.142.65.197:9092/2: Purged 0 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
%7|Wed Sep 7 09:31:50 2022.637|PURGE|rdkafka#producer-1| [thrd:10.142.65.197:9092/2]: transaction-output [2]: purging queues (purge_flags 0x5, include xmit_msgq)
%7|Wed Sep 7 09:31:50 2022.636|PURGE|rdkafka#producer-1| [thrd:10.142.65.152:9092/4]: 10.142.65.152:9092/4: Purging queues with flags queue,non-blocking
FATAL ERROR: Failed to commit transaction: _FENCED: ProduceRequest for transaction-output [2] with 1 message(s) failed: Broker: Producer attempted an operation with an old epoch (broker 2 PID{Id:1,Epoch:4}, base seq 627): transactional producer fenced by newer producer instance
%7|Wed Sep 7 09:31:50 2022.637|PURGEQ|rdkafka#producer-1| [thrd:10.142.65.197:9092/2]: 10.142.65.197:9092/2: Purged 0 message(s) from 0 partition(s)
%7|Wed Sep 7 09:31:50 2022.636|PURGE|rdkafka#producer-1| [thrd:10.142.65.14:9092/bootstrap]: 10.142.65.14:9092/0: Purging queues with flags queue,non-blocking
At this time leadership transfer happened:
Sep 7 07:31:50 redpanda-4 rpk[290228]: TRACE 2022-09-07 09:31:50,009 [shard 0] raft - [group_id:28, {kafka/transaction-output/2}] consensus.cc:1474 - Vote request: {node_id: {id: {2}, revision: {140}}, target_node_id{id: {3}, revision: {140}}, group: {28}, term:{2}, prev_log_index:{640}, prev_log_term: {1}, leadership_xfer: 1}
Sep 7 07:31:50 redpanda-4 rpk[290228]: INFO 2022-09-07 09:31:50,009 [shard 0] raft - [group_id:28, {kafka/transaction-output/2}] consensus.cc:1566 - Received vote request with larger term from node {id: {2}, revision: {140}}, received 2, current 1
Sep 7 07:31:50 redpanda-4 rpk[290228]: TRACE 2022-09-07 09:31:50,009 [shard 1] raft - [group_id:4, {kafka/__consumer_offsets/3}] consensus.cc:1474 - Vote request: {node_id: {id: {2}, revision: {84}}, target_node_id{id: {3}, revision: {56}}, group: {4}, term:{7}, prev_log_index:{19}, prev_log_term: {6}, leadership_xfer: 1}
Sep 7 07:31:50 redpanda-4 rpk[290228]: INFO 2022-09-07 09:31:50,009 [shard 1] raft - [group_id:4, {kafka/__consumer_offsets/3}] consensus.cc:1566 - Received vote request with larger term from node {id: {2}, revision: {84}}, received 7, current 6
transactions_20220907_0921_after_enabling_rack.txt redpanda_cluster_log.txt.gz
Sounds like https://github.com/redpanda-data/redpanda/issues/6213 to me. @bharathv wdyt?
Yes, seems like leadership changes is contributing to this (part of the problem). Some state can be lost if a leadership change happens on tx coordinator / data partitions (#6213, #6214) with in-progress txns resulting in such weird errors like unknown_server. The situation is probably exacerbated by leadership balancer that tries to balance leaders around in the cluster. We are working on some changes to improve the behavior in such cases and have better user experience.
@bharathv where are we with this?
think this can be closed. @freef4ll agree?
Yes, I think it can be closed as should be fixed by https://github.com/redpanda-data/redpanda/pull/7061 and https://github.com/redpanda-data/redpanda/pull/7695
Who is this for and what problem do they have today?
Rack awareness with transactions looks currently not working, debug logs generated by
librdkafka
for a cluster runningenable_transactions: true
andenable_rack_awareness: true
makes transactions work more flakey than without rack awareness.Steps to reproduce:
Why is solving this problem impactful?
Distributed applications will make use of both features together to guarantee data correctness and availability of what was stored.