[Bug] TBMQ cluster stuck when one of the nodes goes down

keitsi commented 6 days ago

Describe the bug I am run two tbmq nodes on same os with different ports. The remaining node seems stuck when one of the node was stopped by manual.

Your Server Environment

Deployment: Running function test on windows 10 64bit
TBMQ version: 1.4.0
OS name and version: Windows 10 64bit
JDK: 18

Your Client Environment Windows 10 64bit, EMQX

To Reproduce

Start two tbmq nodes with different ports in same os.
Create two MQTT clients to connect to the two nodes respectively.
Test the subscription and publish all work well.
Stop one of the tbmq node.
The remaining node seems stuck and cannot delivery any message, even no accept any connection request.
But it become normal after minutes waiting.

Expected behavior Expect the remaining nodes no affect by the offline node.

dmytro-landiak commented 6 days ago

hey @keitsi!

Thank you for providing detailed information. Based on your description, this does not appear to be a bug but rather a result of Kafka's specific behavior in handling ungraceful shutdowns.

It seems that one of the nodes was stopped abruptly (perhaps using kill -9 or a similar forceful method), which caused Kafka to initiate a rebalance without allowing the consumers to stop gracefully. This can lead to a temporary system hang as Kafka waits for the consumers' assigned partitions to time out.

The timeout you're experiencing is likely due to the max.poll.interval.ms configuration in Kafka, which is set to 5 minutes by default. During this time, Kafka waits for a heartbeat from the consumers to determine their status. When the timeout expires, Kafka will consider the consumer dead and reassign its partitions, which may explain why the system returns to normal after several minutes.

To avoid this issue in the future, ensure that nodes are stopped gracefully, allowing the consumers to commit offsets and exit cleanly. If you need further assistance or clarification, feel free to reach out.

keitsi commented 5 days ago

hey @keitsi!

Thank you for providing detailed information. Based on your description, this does not appear to be a bug but rather a result of Kafka's specific behavior in handling ungraceful shutdowns.

It seems that one of the nodes was stopped abruptly (perhaps using kill -9 or a similar forceful method), which caused Kafka to initiate a rebalance without allowing the consumers to stop gracefully. This can lead to a temporary system hang as Kafka waits for the consumers' assigned partitions to time out.

The timeout you're experiencing is likely due to the max.poll.interval.ms configuration in Kafka, which is set to 5 minutes by default. During this time, Kafka waits for a heartbeat from the consumers to determine their status. When the timeout expires, Kafka will consider the consumer dead and reassign its partitions, which may explain why the system returns to normal after several minutes.

To avoid this issue in the future, ensure that nodes are stopped gracefully, allowing the consumers to commit offsets and exit cleanly. If you need further assistance or clarification, feel free to reach out.

Got it, thanks for the detailed explanation.

And we also did some tests, and the results showed that the lag was indeed caused by these kafka client parameter max-poll-records and session-timeout-ms.

If I set both max-poll-records and session-timeout-ms to 10000ms, what situations do I need to consider or balance? Thanks a lot.

thingsboard / tbmq

[Bug] TBMQ cluster stuck when one of the nodes goes down #181