Closed Aerdayne closed 3 months ago
I can confirm that sub-sequent post raise error dispatch indeed causes an issue on next dispatch. Though I am getting a different error: out_of_order_sequence_number
. Nonetheless the state is indeed leaking out of failed operation.
FWIW, on my end the listener detects E, [2024-06-06T12:39:41.215598 #74403] ERROR -- : rdkafka: [thrd:127.0.0.1:9092/1]: Current transaction failed in state InTransaction: skipped sequence numbers (OUT_OF_ORDER_SEQUENCE_NUMBER, requires epoch bump)
, although it does not raise that.
Yes I can repro now both things. This is indeed not raised because it is intermediate and goes via events listener. I also get the error you mentioned. Let me reply to you on that:
in this case when the message passes the local WaterDrop payload size validation due to a misconfiguration
Well it is on you ;) There were reasons for me to introduce this check
Or should the producers be discarded instead?
Since it is working I can assume it would continue to work, though I do not like a solution that retries on operations of this nature always. I think that what needs to happen here is a retry on conditional raise on failure of last transaction, since transactional state does not have race conditions. For now granular and limited to the queue purge that needs to happen on broker errors.
Ok. I know what is happening. The PID is updated because of the broker error and on next request it needs to be bumped. inflight requests need to be drained for it to happen, hence the purge_queue
. In theory after the bump things should work however the general recommendation from librdkafka code is to treat those errors as fatals and reload the producer. What that means is, that any broker error that bubbles up should cause a client reload.
@Aerdayne here's my proposition how to fix this: https://github.com/karafka/waterdrop/pull/503
It works well. Can you confirm? (I'll add relevant specs soon)
Confirming that the fix works. Thanks!
It seems that when a transactional producer fails to produce messages (in this case when the message passes the local WaterDrop payload size validation due to a misconfiguration, and is then blocked by rdkafka's message size validation), the subsequent usage of such producer needs to be wrapped in a retry, as it will always raise a
#<Rdkafka::RdkafkaError: Local: Purged in queue (purge_queue)>
exception.Is this the intended behaviour? If so, then even if there are multiple transactional producers wrapped in a connection pool, as the documentation recommends, would this mean that all message producing should be wrapped in a retry? Or should the producers be discarded instead?
Here's a reproduction script that should be ran against the current main branch.
OS: Darwin Kernel Version 22.6.0
librdkafka version: 2.1.1_1
Kafka was ran via WaterDrop's compose file locally.
Output (consumer output omitted, the final message will obviously be printed out):