Potential consumer queue cleanup issue?

hikui commented 1 year ago

erlkaf version: 2.0.8

Hi Silviu,

Thanks for your work.

We are currently using erlkaf in our server. However, I noticed sometimes when partition rebalance is triggered, some partition consumers may get stuck. This happens when a partition consumer is doing heavy work that takes long time.

After dig deeper in erlkaf, I found that when :revoke_partition is received by the consumer group, it tries to stop all partition consumers. It does so by sending a stop message to each partition consumer and wait for 5s. Now that we have a heavy consumer that is doing a task longer than 5s, it will then be force killed. When this happens, consumer_queue_cleanup is not run. I suspect it may be related to partition stuck issue.

Not sure if my understanding is correct.

silviucpp commented 1 year ago

Your understanding on the working flow it's correct. So far I didn't encounter this problem even if I process over 20 M messages a day on the platform where we are using erlkaf. But it's true that all our events are designed in a way that doesn't take more than few seconds.

I don't have too much time in this moment to look into this but calling of consumer_queue_cleanup it's taking also place when the erlang is releasing the state of the killed process. And even if it doesn't leak this might get called after the re-balancing is taking place and might lead to this problem.

I'm wonder if we change the code to kill the process with normal reason will fix your problem.

silviucpp commented 1 year ago

Merged in master.

silviucpp / erlkaf

Potential consumer queue cleanup issue? #47