Closed hikui closed 1 year ago
Your understanding on the working flow it's correct. So far I didn't encounter this problem even if I process over 20 M messages a day on the platform where we are using erlkaf. But it's true that all our events are designed in a way that doesn't take more than few seconds.
I don't have too much time in this moment to look into this but calling of consumer_queue_cleanup it's taking also place when the erlang is releasing the state of the killed process. And even if it doesn't leak this might get called after the re-balancing is taking place and might lead to this problem.
I'm wonder if we change the code to kill the process with normal reason will fix your problem.
Merged in master.
erlkaf version: 2.0.8
Hi Silviu,
Thanks for your work.
We are currently using erlkaf in our server. However, I noticed sometimes when partition rebalance is triggered, some partition consumers may get stuck. This happens when a partition consumer is doing heavy work that takes long time.
After dig deeper in erlkaf, I found that when
:revoke_partition
is received by the consumer group, it tries to stop all partition consumers. It does so by sending a stop message to each partition consumer and wait for 5s. Now that we have a heavy consumer that is doing a task longer than 5s, it will then be force killed. When this happens,consumer_queue_cleanup
is not run. I suspect it may be related to partition stuck issue.Not sure if my understanding is correct.