tulios / kafkajs

A modern Apache Kafka client for node.js
https://kafka.js.org
MIT License
3.66k stars 522 forks source link

kafkajs holding one partition but not fetching from it #1456

Open chengB12 opened 1 year ago

chengB12 commented 1 year ago

Describe the bug I have 2 load-balancing instances which consuming a kafka topic with 2 partitions

when both started about same time, one pod reported getting partition 1 and then without any issue however one pod never generating any kakfa logs, and neither have connection error log, and not fetching anything. However, apparently, it was holding partition 0 since the other instance never got group rebalanced to get both partition.

Other non-kafka operation on affect instances looks fine.

This situation last for hours until I am aware of it and kill and restart affected instance. then new instance got partition 0 on start. and works fine.

My guess is some communication to brokers failed at some point, but heartbeat is still on, which makes brokers treat this consumer still alive

To Reproduce Seems one off, can't reproduce

Expected behavior If there is any troubles on connection to kakfa, it should throw error at least

Observed behavior it connected and hold partition, but do nothing, without generating any success/fail log

Environment:

Additional context Add any other context about the problem here.

jp928 commented 1 year ago

I probably have same problem that within a consumer group with 3 consumers. when I produce message to different partitions, only partition0 would get respond

khoinguyen commented 10 months ago

I also get this issue quite often, but randomly, I cannot stably reproduce it. My topic has 32 partitions, randomly one of the partitions was left behind, but all others are finished and always keep up with the newly produced messages.

There are no errors/exceptions; the eachBatch callback keeps calling with other partitions, and even though those partitions have only one new message, the offsetLag is always 0.

My deployment was 16 pods (on k8s) consuming 32-partitions topic. The number of pods can vary as we have auto-scaling. Killing the pod holding the partitions sometimes does not help.