A node in the NATS cluster suddenly 'crashed,' causing the entire cluster to stop providing services [v2.10.12]

liushiqi1001 commented 2 months ago

Observed behavior

Here is the monitoring query for CPU idle.

All MQTT clients are unable to connect, and those that were previously connected are not receiving data. (It's unclear whether all nodes are unable to provide services, or if it's just node 2.) When node 2 is stopped, nodes 1 and 3 might be able to provide MQTT services normally.

node2 log dublin-nats-02_15_00_00-18_80_00.log

There are many 'timeout after 4s' entries in the logs, indicating that calls to JSA are timing out. But I'm not sure if this is the root cause.

MQTT clients exceed 20k.

Expected behavior

All nodes provide MQTT services externally.

Server and client version

[nats-serv] (nats-server: v2.10.12)

[mqtt-client] 3.1.1

Host environment

aws ecs Linux 3.10.0-1160.102.1.el7.x86_64 #1 SMP Tue Oct 17 15:42:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Steps to reproduce

Not sure how to reproduce.

wallyqs commented 2 months ago

do you think you would be able to find the exit reason in the logs somewhere? was in an OOM? is it running in Docker?

[1] 2024/08/21 18:20:34.785047 [WRN] JetStream request queue has high pending count: 45
[1] 2024/08/21 18:20:34.946822 [WRN] JetStream request queue has high pending count: 46
[1] 2024/08/21 18:20:34.947169 [WRN] JetStream request queue has high pending count: 47
[1] 2024/08/21 18:20:36.177984 [INF] Starting nats-server
[1] 2024/08/21 18:20:36.178013 [INF]   Git:      [121169ea]
[1] 2024/08/21 18:20:36.178022 [INF]   Node:     CQ10Rfcq

liushiqi1001 commented 2 months ago

do you think you would be able to find the exit reason in the logs somewhere? was in an OOM? is it running in Docker?

I'm now quite certain that the issue is at least caused by the asynchronous handling of JetStream API (JSA), as the reply is extremely slow. However, I don't fully understand why the reply is so slow. This causes the entire NATS cluster to be unable to function properly, including impacting the NATS protocol itself.

derekcollison commented 2 months ago

I would imagine you are overwhelming the system and the servers are growing in memory and eventually get OOM'd.

nats-io / nats-server