Closed keitsi closed 5 days ago
hey @keitsi!
Thank you for providing detailed information. Based on your description, this does not appear to be a bug but rather a result of Kafka's specific behavior in handling ungraceful shutdowns.
It seems that one of the nodes was stopped abruptly (perhaps using kill -9
or a similar forceful method), which caused Kafka to initiate a rebalance without allowing the consumers to stop gracefully. This can lead to a temporary system hang as Kafka waits for the consumers' assigned partitions to time out.
The timeout you're experiencing is likely due to the max.poll.interval.ms
configuration in Kafka, which is set to 5 minutes by default. During this time, Kafka waits for a heartbeat from the consumers to determine their status. When the timeout expires, Kafka will consider the consumer dead and reassign its partitions, which may explain why the system returns to normal after several minutes.
To avoid this issue in the future, ensure that nodes are stopped gracefully, allowing the consumers to commit offsets and exit cleanly. If you need further assistance or clarification, feel free to reach out.
hey @keitsi!
Thank you for providing detailed information. Based on your description, this does not appear to be a bug but rather a result of Kafka's specific behavior in handling ungraceful shutdowns.
It seems that one of the nodes was stopped abruptly (perhaps using
kill -9
or a similar forceful method), which caused Kafka to initiate a rebalance without allowing the consumers to stop gracefully. This can lead to a temporary system hang as Kafka waits for the consumers' assigned partitions to time out.The timeout you're experiencing is likely due to the
max.poll.interval.ms
configuration in Kafka, which is set to 5 minutes by default. During this time, Kafka waits for a heartbeat from the consumers to determine their status. When the timeout expires, Kafka will consider the consumer dead and reassign its partitions, which may explain why the system returns to normal after several minutes.To avoid this issue in the future, ensure that nodes are stopped gracefully, allowing the consumers to commit offsets and exit cleanly. If you need further assistance or clarification, feel free to reach out.
Got it, thanks for the detailed explanation.
And we also did some tests, and the results showed that the lag was indeed caused by these kafka client parameter max-poll-records
and session-timeout-ms
.
If I set both max-poll-records
and session-timeout-ms
to 10000ms, what situations do I need to consider or balance? Thanks a lot.
Describe the bug I am run two tbmq nodes on same os with different ports. The remaining node seems stuck when one of the node was stopped by manual.
Your Server Environment
Your Client Environment Windows 10 64bit, EMQX
To Reproduce
Expected behavior Expect the remaining nodes no affect by the offline node.