Open langesven opened 3 years ago
Hi, i have the same problem, but with ConsumerGroup. The problem was solved by setting different IDs in them for different topics (e.g. PartitionResize107 and PartitionResize108). If you have the same GroupID for different topics, then try to set unique GroupIDs
Describe the bug The library picks up that the amount of partitions on a topic have changed (via the PartitionWatcher) but the subsequent stop/join flow doesn't generate a new generation of the ConsumerGroup as such the consumer does not subscribe to the new partitions and thus doesn't consume any messages from them.
Kafka Version AWS MSK with kafka version 2.2.1
To Reproduce
Expected behavior I'd expect the partition watcher to always kick off a rebalance which then once the ConsumerGroup has done so will also consume messages from the new partitions it discovered on the topic.
Additional context We've observed this during a resizing of partitions on a few topics in our cluster. Applications were suddenly randomly skipping events with no clear reason why, no (visible) lag was building up but things did not go as expected. It later turned out that the ConsumerGroups simply didn't subscribe to the new partitions, at least not all of them.
I've spent a bit of time debugging this on our end and I'm at a loss by now, hence the issue. We were using kafka-go in version 0.3.10 originally but I've also verified the same behaviour still happens for us in version 0.4.15.
Let me show you a snippet of the logs of what I can see happening. This is the output of the client and this is right after I've resized the partitions on topics
PartitionResize107
andPartitionResize108
. We can see the partition watcher correctly recognized that the partitions on the two topics were changed and that a rebalance needs to happen. Then we can see that it rejoins the group twice (for each of the planned rebalances) and both times in the same generation, which is also the generation the ConsumerGroup was already in prior to the resize. As such the ConsumerGroup is not subscribed to the new partitions on my test topics and also completely ignores any messages sent to the topics. Once I restart the application the ConsumerGroup will join as generation 340 and will pick up the new partitions. If I then resize one of the topics again usually the same thing happens and it doesn't pick them up.My reader looks like this
and it's basically running in a loop to fetch stuff
I've read most of the issues here that relate to partitions and couldn't find similar problems that weren't explained by "you forgot to turn the partition watcher on" so I wouldn't be completely surprised if we're somehow using something wrong? Unless resizing partitions is such a rare thing that people don't run into this frequently. I've been wondering if this maybe relates to the thing mentioned https://github.com/segmentio/kafka-go/blob/master/consumergroup.go#L341-L353 and in #357 and somehow our client implementation doesn't allow the ConsumerGroup context to end so that it never exits the current generation thus staying in it instead of triggering a rebalance?
Sometimes it randomly works, I can't reproduce it explicitly. Sometimes I can resize a partition, it's being picked up, a rebalance actually happens (can confirm this with the kafka broker logs as well that shows rebalance activity which is not the case for the stuff pasted above) and the client is then (as expected) subscribed to the new partitions. I can't explain when this happens, I can't explain why this happens :/