Closed liushiqi1001 closed 2 months ago
do you think you would be able to find the exit reason in the logs somewhere? was in an OOM? is it running in Docker?
[1] 2024/08/21 18:20:34.785047 [WRN] JetStream request queue has high pending count: 45
[1] 2024/08/21 18:20:34.946822 [WRN] JetStream request queue has high pending count: 46
[1] 2024/08/21 18:20:34.947169 [WRN] JetStream request queue has high pending count: 47
[1] 2024/08/21 18:20:36.177984 [INF] Starting nats-server
[1] 2024/08/21 18:20:36.178013 [INF] Git: [121169ea]
[1] 2024/08/21 18:20:36.178022 [INF] Node: CQ10Rfcq
do you think you would be able to find the exit reason in the logs somewhere? was in an OOM? is it running in Docker?
I'm now quite certain that the issue is at least caused by the asynchronous handling of JetStream API (JSA), as the reply is extremely slow. However, I don't fully understand why the reply is so slow. This causes the entire NATS cluster to be unable to function properly, including impacting the NATS protocol itself.
I would imagine you are overwhelming the system and the servers are growing in memory and eventually get OOM'd.
Observed behavior
Here is the monitoring query for CPU idle.
All MQTT clients are unable to connect, and those that were previously connected are not receiving data. (It's unclear whether all nodes are unable to provide services, or if it's just node 2.) When node 2 is stopped, nodes 1 and 3 might be able to provide MQTT services normally.
node2 log dublin-nats-02_15_00_00-18_80_00.log
There are many 'timeout after 4s' entries in the logs, indicating that calls to JSA are timing out. But I'm not sure if this is the root cause.
MQTT clients exceed 20k.
Expected behavior
All nodes provide MQTT services externally.
Server and client version
[nats-serv] (nats-server: v2.10.12)
[mqtt-client] 3.1.1
Host environment
aws ecs Linux 3.10.0-1160.102.1.el7.x86_64 #1 SMP Tue Oct 17 15:42:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Steps to reproduce
Not sure how to reproduce.