Describe the bug
Given a consumer that consumes from fewer partitions than there are nodeIds in the kafka cluster.
Then, kafkajs does not remove this consumer from its group even if it blocks for > session timeout.
This is because kafkajs creates fetchers for all nodeIds in the cluster.
Then, for the fetcher with a nodeId that is not a partion leader, ConsumerGroup.fetch(nodeId) filters out
all topic partitions for this nodeId which results in an empty requests array.
It then sleeps for maxWaitTime and returns an empty batch (basically simulating a broker fetch to an empty partition).
Expected behavior
Rebalance after eachMessage handler has been stuck longer than the session timeout.
The stuck consumer should be removed from its group.
Observed behavior
Consumer keeps sending hearbeats to the broker nodeId that is not a parition leader.
This keeps the hanging consumer in the group.
{ value: 'Message number 17' }
fetch from nodeId: 1
heartbeat
{ value: 'Message number 19' }
Sleep 3 minutes.
fetch from nodeId: 1
heartbeat
fetch from nodeId: 1
heartbeat
Environment:
OS: Mac OS Ventura 13.4
KafkaJS version 2.2.4
Kafka version confluentinc/cp-kafka:5.3.1
NodeJS version v20.3.1
Additional context
Add any other context about the problem here.
Describe the bug Given a consumer that consumes from fewer partitions than there are nodeIds in the kafka cluster. Then, kafkajs does not remove this consumer from its group even if it blocks for > session timeout.
This is because kafkajs creates fetchers for all nodeIds in the cluster. Then, for the fetcher with a nodeId that is not a partion leader,
ConsumerGroup.fetch(nodeId)
filters out all topic partitions for this nodeId which results in an emptyrequests
array. It then sleeps formaxWaitTime
and returns an empty batch (basically simulating a broker fetch to an empty partition).The empty batch returned by
ConsumerGroup.fetch
will causeRunner.fetch(nodeId)
to send a heartbeat:To Reproduce
To reproduce this behaviour you can, for example,
In both cases, block one consumer in its
eachMessage
oreachBatch
handler for > session timeout:See https://github.com/ckuehne/kafkajs/blob/47cca410b6dcf22e6268b70ebe59e6428e52539b/bug/consumerLongPause.js for the whole consumer.
Detailed reproduction hints below.
With 2 partitions and 1 consumer
See https://github.com/ckuehne/kafkajs/blob/47cca410b6dcf22e6268b70ebe59e6428e52539b/bug/reproduce-with-2-partition-topic.sh
With 3 partitions and 3 consumers See https://github.com/ckuehne/kafkajs/blob/47cca410b6dcf22e6268b70ebe59e6428e52539b/bug/reproduce-with-3-partition-topic.sh
Expected behavior Rebalance after
eachMessage
handler has been stuck longer than the session timeout. The stuck consumer should be removed from its group.Observed behavior
Consumer keeps sending hearbeats to the broker nodeId that is not a parition leader. This keeps the hanging consumer in the group.
Environment:
Additional context Add any other context about the problem here.