Open yadvlz opened 3 years ago
Some questions:
The fact that all 3 nodes were experiencing high CPU usage makes me believe that it may have been due to the amount of clients reconnecting at the same time. If not, and if the stopped node would have been down for a while and required a lot of catching up on restart, this would be with the leader, so would then expect only 2 to - possibly - be very active.
CPU jumps in all nodes after restart one of them.
To make sure I have correctly understood the description of the issue:
The issue was triggered when simply restarting one NATS Streaming server of the cluster that probably had between 12 to 15,000 clients. After the restart, all nodes were 100% CPU for 6 to 8 hours. You then blocked traffic for all incoming clients for about 30 seconds, and that solved the issue.
Is it about right?
The issue was triggered when simply restarting one NATS Streaming server of the cluster that probably had between 12 to 15,000 clients.
Correct.
After the restart, all nodes were 100% CPU for 6 to 8 hours.
Maybe more, I don't tested.
You then blocked traffic for all incoming clients for about 30 seconds, and that solved the issue.
Clients are located on the 10.0.0.0/8 network, I block all incoming packets for about 20 minutes, after that the CPU load drops. If I unlock incoming packets, the load increases again. To prevent this, I am forced to unlock incoming packets partially, by /16 every 30 seconds.
The issue was triggered when simply restarting one NATS Streaming server of the cluster that probably had between 12 to 15,000 clients.
12,000 - 15,000. Sorry.
12,000 - 15,000. Sorry.
Yes, I understood that, no worries.
To prevent this, I am forced to unlock incoming packets partially, by /16 every 30 seconds.
So do you get to a point where all clients traffic is resumed and servers behave normally (no high CPU usage on all nodes)?
So do you get to a point where all clients traffic is resumed and servers behave normally (no high CPU usage on all nodes)?
Correct. Right now the cluster is in this state and the unlock script is running on all nodes :)
Correct. Right now the cluster is in this state and the unlock script is running on all nodes :)
Oh so you mean that you are still filtering clients? So if you were to "open the gate" there would still be the issue? And we are sure that the client traffic is same than before and the issue was really started with a restart of a node and not simply some client usage that has suddenly increased?
How many channels do you have, and what is the normal message rate? How many subscriptions exist and are they mainly 1 to 1 or do 1 message go to many different subscriptions?
Oh so you mean that you are still filtering clients?
The problem has been observed for several months and every time it appears I have to solve it by blocking/unblocking clients.
And we are sure that the client traffic is same than before and the issue was really started with a restart of a node and not simply some client usage that has suddenly increased?
Sometimes the load increases without restarting the node. The cluster works normally, then suddenly jumps. The problem is solved by blocking / unblocking clients.
How many channels do you have, and what is the normal message rate?
~42,000 channels.
{ "num_subscriptions": 124494, "num_cache": 10469384, "num_inserts": 117474553, "num_removes": 117350059, "num_matches": 2124441735, "cache_hit_rate": 0.0279807824430638, "max_fanout": 0, "avg_fanout": 0 }
Looks like you may have 1 channel per client since you mentioned about 40,000 clients and 40,000 channels. I see that there is no limit on these channels except for max_age or max_inactivity. Is there cases where those channels are removed due to inactivity or is it mainly messages removed after 6 days?
When the event occurs (either on server restart or on its own), have you correlated to an increase in message rate in some channels? Have you noticed some channels unexpectedly filling up with messages maybe? Are you using streaming only or do you have NATS core usage from client applications on this cluster?
I need the help of my teammate. He'll be able to respond in a few hours.
Looks like you may have 1 channel per client since you mentioned about 40,000 clients and 40,000 channels.
We have one channel with 40,000 subscribers and 40,000 1-to-1 channels.
Is there cases where those channels are removed due to inactivity or is it mainly messages removed after 6 days?
Messages removed after 6 days.
When the event occurs (either on server restart or on its own), have you correlated to an increase in message rate in some channels?
We need to check it out.
Have you noticed some channels unexpectedly filling up with messages maybe?
Noticed - where?
Are you using streaming only or do you have NATS core usage from client applications on this cluster?
Both of them.
We have one channel with 40,000 subscribers
Ouch, so 1 message will be sent to 40,000 subscribers... this could become a bottleneck if the message rate to this channel is high.
Noticed - where?
Are you using any kind of metric or the monitoring endpoints to see if some of the channels unexpectedly increase in size? In case you don't know, you can use monitoring endpoints to check on some of the streaming resources: https://docs.nats.io/nats-streaming-concepts/monitoring and https://docs.nats.io/nats-streaming-concepts/monitoring/endpoints
Both of them.
I have a nagging suspicion that it either a sudden influx of streaming messages or NATS related (or some subscriptions being constantly created/destroyed). The reason I say that is that all 3 nodes are affected. That can be the case if streaming messages are added to the system since they get replicated to all 3 nodes. That being said it is the same for any other activity: connection create/close, subscription create/close, messages delivered to sub/ack'ed: all that get replicated. So an increase in activity would cause each node to have to work to process the replicated information.
If you have any way that you measure normal activity in the cluster, you should then see if you notice any increase activity. You could even have a simple group of NATS (not streaming) consumers on ">" that simply measure the message rate/size and report every second. It does not have to be too sophisticated, just a way to see a jump. Or, when the situation occurs (100% CPU), start a NATS consumer on ">" and capture traffic for few seconds and try to see what the bulk of the traffic is?
As you can see, I am a bit at loss to explain what is happening...
Ouch, so 1 message will be sent to 40,000 subscribers... this could become a bottleneck if the message rate to this channel is high.
1-2 messages per month.
Are you using any kind of metric or the monitoring endpoints to see if some of the channels unexpectedly increase in size? In case you don't know, you can use monitoring endpoints to check on some of the streaming resources: https://docs.nats.io/nats-streaming-concepts/monitoring and https://docs.nats.io/nats-streaming-concepts/monitoring/endpoints
We use prometheus-metrics-exporter, but we haven't set up triggers yet.
Two questions:
[ERR] STREAM: Asynchronous error on connection _NSS-stan-eas-general, subject _STAN.sub.stan-eas: nats: slow consumer, messages dropped
What is it? CPU load is 2900/3200.
[Client:replicator_412025] Duplicate durable subscription registration
They appear in dozens per second. Maybe this is the case?
[ERR] STREAM: Asynchronous error on connection _NSS-stan-eas-general, subject _STAN.sub.stan-eas: nats: slow consumer, messages dropped What is it? CPU load is 2900/3200.
The Streaming server is not a server per-se (https://docs.nats.io/nats-streaming-concepts/intro and https://docs.nats.io/nats-streaming-concepts/relation-to-nats). It creates NATS connections to the NATS Server and uses subscriptions to receive messages from clients. This subscription is for "subscription requests" coming from clients. By default NATS subscriptions have a limit of pending messages (waiting to be dispatched/processed) after which the connection does not enqueue them but drop them and notifies the async error handler. So the error above indicates that the server is receiving way more subscriptions requests that it can handle and they are dropped.
[Client:replicator_412025] Duplicate durable subscription registration They appear in dozens per second. Maybe this is the case?
Unless this is a new issue and/or a side effect of something else, it could indeed be the reason for the issue. Could it be that you have an application that tries to create a durable subscription and for some reason loops trying to do so? It could be that initially it got a timeout waiting for the response, but the durable was actually created. In that case, your app may want to check for the error returned and in the case of "Duplicate durable subscription" it would have to close the connection otherwise any other attempt will fail.
@yadvlz Could it be that issue #1135 is related to this issue? If so, I have submitted a PR to add a new configuration option that would cause the server to accept the close the current durable and accept the new durable subscription. That may help the issue that you have seen?
@yadvlz Could it be that issue #1135 is related to this issue?
I think it is.
@yadvlz I was actually wondering if the issue was from a colleague of yours. If not, there have been 2 issues reported in the streaming server repo and 2 in the stan.go for the same "duplicate durable" issue in a short amount of time, so this is why I was wondering if they come from the same deployment or are actually different users. Nevertheless, I believe that PR #1136 will help.
I am closing this since PR #1136 has been merged. Thank you for your patience!
I deployed new version of nats-streaming-server (with "replace_durable: true"), but my problem still exist. Log from the leader at the time of restarting the follower
[32749] 2021/01/08 12:33:27.428845 [ERR] STREAM: [Client:replicator_659469] Timed out on heartbeats [32749] 2021/01/08 12:33:28.690700 [ERR] STREAM: [Client:replicator_606369] Timed out on heartbeats [32749] 2021/01/08 12:33:38.565970 [ERR] STREAM: [Client:replicator_693023] Timed out on heartbeats [32749] 2021/01/08 12:33:52.827200 [ERR] STREAM: [Client:replicator_157785] Timed out on heartbeats [32749] 2021/01/08 12:33:54.833447 [INF] 10.193.68.251:38968 - rid:10961 - Router connection closed: Read Error [32749] 2021/01/08 12:33:55.699539 [ERR] STREAM: [Client:replicator_462420] Timed out on heartbeats [32749] 2021/01/08 12:33:55.887039 [INF] STREAM: server lost leadership, performing leader stepdown actions [32749] 2021/01/08 12:33:55.908709 [INF] 10.193.68.251:40086 - rid:97807 - Route connection created [32749] 2021/01/08 12:33:57.227611 [WRN] 10.112.24.9:63226 - cid:97429 - Readloop processing time: 2.263951272s [32749] 2021/01/08 12:33:57.230102 [WRN] 10.92.47.72:57609 - cid:97354 - Readloop processing time: 2.017834321s [32749] 2021/01/08 12:33:57.262937 [INF] STREAM: finished leader stepdown actions [32749] 2021/01/08 12:33:57.276524 [WRN] 10.87.91.162:64614 - cid:97364 - Readloop processing time: 2.246782296s [32749] 2021/01/08 12:33:57.334539 [WRN] 10.100.141.210:57416 - cid:97291 - Readloop processing time: 2.106756313s [32749] 2021/01/08 12:33:57.340296 [WRN] 10.106.78.130:64324 - cid:97318 - Readloop processing time: 2.310444523s [32749] 2021/01/08 12:33:57.586867 [WRN] 10.109.121.2:51399 - cid:97379 - Readloop processing time: 2.623306455s [32749] 2021/01/08 12:33:57.589297 [WRN] 10.113.77.67:63701 - cid:97433 - Readloop processing time: 2.60364601s [32749] 2021/01/08 12:33:57.589529 [WRN] 10.108.213.197:52596 - cid:97422 - Readloop processing time: 2.625908581s [32749] 2021/01/08 12:33:57.589593 [WRN] 10.92.108.200:61466 - cid:97273 - Readloop processing time: 2.615766941s [32749] 2021/01/08 12:33:57.589661 [WRN] 10.92.84.194:55216 - cid:97287 - Readloop processing time: 2.669059357s [32749] 2021/01/08 12:33:57.589721 [WRN] 10.24.84.2:60485 - cid:97323 - Readloop processing time: 2.647962912s [32749] 2021/01/08 12:33:57.589780 [WRN] 10.93.96.146:58826 - cid:97378 - Readloop processing time: 2.669010652s
I am re-opening this issue. Seeing the timed out errors and the routed disconnected (and leader step down), I wonder if you did not have a network issue there too. But otherwise, we are now back to do the investigation I suggested in earlier comments.
[42312] 2021/01/11 11:55:42.398945 [FTL] STREAM: Failed to start: nats: no servers available for connection [42349] 2021/01/11 11:55:42.711058 [INF] STREAM: Starting nats-streaming-server[stan-eas] version 0.19.0 [42349] 2021/01/11 11:55:42.711155 [INF] STREAM: ServerID: q1zdQ4ez66kGwQEFWz5stM [42349] 2021/01/11 11:55:42.711163 [INF] STREAM: Go version: go1.15.6 [42349] 2021/01/11 11:55:42.711169 [INF] STREAM: Git commit: [not set] [42349] 2021/01/11 11:55:42.712526 [INF] Starting nats-server version 2.1.9 [42349] 2021/01/11 11:55:42.712550 [INF] Git commit [not set] [42349] 2021/01/11 11:55:42.712790 [INF] Starting http monitor on 0.0.0.0:8222 [42349] 2021/01/11 11:55:42.714242 [INF] Listening for client connections on 0.0.0.0:4222 [42349] 2021/01/11 11:55:42.714262 [INF] Server id is ND4V3Q4K6EOQX62KBQWBAPD4GUFBQGMC3X4PAISUG5XSHAK6JEMKK6TF [42349] 2021/01/11 11:55:42.714268 [INF] Server is ready [42349] 2021/01/11 11:55:42.714798 [INF] Listening for route connections on 0.0.0.0:6222 [42349] 2021/01/11 11:55:42.716096 [INF] 10.193.68.252:6222 - rid:1 - Route connection created [42349] 2021/01/11 11:55:42.716140 [INF] 10.193.68.253:6222 - rid:2 - Route connection created [42349] 2021/01/11 11:55:45.010386 [WRN] 10.77.28.252:63791 - cid:105 - Readloop processing time: 2.048394269s [42349] 2021/01/11 11:55:45.021490 [WRN] 10.82.77.136:61951 - cid:64 - Readloop processing time: 2.014462329s [42349] 2021/01/11 11:55:45.041510 [WRN] 10.80.91.131:53914 - cid:65 - Readloop processing time: 2.149629235s [42349] 2021/01/11 11:55:45.049700 [WRN] 10.105.127.98:63396 - cid:120 - Readloop processing time: 2.087822186s [42349] 2021/01/11 11:55:45.056968 [WRN] 10.41.89.2:63512 - cid:114 - Readloop processing time: 2.073160631s [42349] 2021/01/11 11:55:45.060889 [FTL] STREAM: Failed to start: nats: no servers available for connection [42385] 2021/01/11 11:55:45.210725 [INF] STREAM: Starting nats-streaming-server[stan-eas] version 0.19.0 [42385] 2021/01/11 11:55:45.210821 [INF] STREAM: ServerID: olFM1hi2L8w7kgv5B0pb6O [42385] 2021/01/11 11:55:45.210832 [INF] STREAM: Go version: go1.15.6 [42385] 2021/01/11 11:55:45.210839 [INF] STREAM: Git commit: [not set] [42385] 2021/01/11 11:55:45.212160 [INF] Starting nats-server version 2.1.9 [42385] 2021/01/11 11:55:45.212184 [INF] Git commit [not set] [42385] 2021/01/11 11:55:45.212469 [INF] Starting http monitor on 0.0.0.0:8222 [42385] 2021/01/11 11:55:45.213904 [INF] Listening for client connections on 0.0.0.0:4222 [42385] 2021/01/11 11:55:45.213925 [INF] Server id is NAO7GBYFGAJ6OFZMRRRQ3I4TS6SIC2YD344ERBPRZNLTEUBBOTIOOR75 [42385] 2021/01/11 11:55:45.213930 [INF] Server is ready
These messages appear cyclically on the node after restart.
If I block clients, the log looks like this:
[42965] 2021/01/11 11:57:19.464101 [INF] Starting nats-server version 2.1.9 [42965] 2021/01/11 11:57:19.464116 [INF] Git commit [not set] [42965] 2021/01/11 11:57:19.464408 [INF] Starting http monitor on 0.0.0.0:8222 [42965] 2021/01/11 11:57:19.467430 [INF] Listening for client connections on 0.0.0.0:4222 [42965] 2021/01/11 11:57:19.467444 [INF] Server id is NCYRM2UX76PYBASHN3PO4IN7EA6RBMAW62EW7GDL7B7I42RYZKPBA3V4 [42965] 2021/01/11 11:57:19.467448 [INF] Server is ready [42965] 2021/01/11 11:57:19.468038 [INF] Listening for route connections on 0.0.0.0:6222 [42965] 2021/01/11 11:57:19.469184 [INF] 10.193.68.252:6222 - rid:1 - Route connection created [42965] 2021/01/11 11:57:19.469228 [INF] 10.193.68.253:6222 - rid:2 - Route connection created [42965] 2021/01/11 11:57:20.535651 [INF] STREAM: Recovering the state... [42965] 2021/01/11 12:00:34.815674 [INF] STREAM: Recovered 42537 channel(s) [42965] 2021/01/11 12:00:34.923018 [INF] STREAM: Cluster Node ID : a [42965] 2021/01/11 12:00:34.923112 [INF] STREAM: Cluster Log Path: /storage/nats-streaming-server/nsslog [42965] 2021/01/11 12:00:34.923121 [INF] STREAM: Cluster known peers: [42965] 2021/01/11 12:00:34.923132 [INF] STREAM: peer 1: "b" [42965] 2021/01/11 12:00:34.923138 [INF] STREAM: peer 2: "c" [42965] 2021/01/11 12:00:35.166678 [INF] STREAM: restoring from snapshot
It looks to me that the server - as soon as routes are created - is overwhelmed with traffic to the point that the internal connections requests to the NATS server fail! Now the fact that the server is embedding its own NATS server makes it even more worrisome because this is just loopback. The default connect timeout is 2 seconds, so that may explain why it fails seeing those warnings. If you can recompile the server, you could try bumping the connect timeout (this is not currently configurable through options). This would be here (based on main branch): https://github.com/nats-io/nats-streaming-server/blob/814b51a89ccac0107d4e323c6b8270bbe1d23ccd/server/server.go#L1581 and you could add:
ncOpts.Timeout = 10 * time.Second
before the Connect()
call. But to be honest, I fear that this is not going to help much. I am thinking that the volume is too much for a single server to handle.
I am thinking that the volume is too much for a single server to handle.
So how can I deal with it?
So how can I deal with it?
I described some of the ways we could try to debug this by capturing some of the traffic and see if this traffic is "expected". From your earlier description, it sounded to me like there are times where all is well, but then suddenly an event (server restart or other) causes this issue and the only way to restore is to basically stop client traffic for a while.
If we determine that the traffic is normal, just higher volume from time to time, then maybe you need to have several clusters and separate connections/traffic. If the traffic shows an abnormal pattern (could have been a constant attempt to create durable subscriptions as we had suspected at one point) then it could help us fix the root of this traffic.
ncOpts.Timeout = 10 * time.Second
Tried this, didn't help.
If we determine that the traffic is normal, just higher volume from time to time, then maybe you need to have several clusters and separate connections/traffic.
I think the problem occurs in cases where multiple clients are trying to connect to the server (establish new connection) at the same time.
I think the problem occurs in cases where multiple clients are trying to connect to the server (establish new connection) at the same time.
If this is following a disconnect/reconnect, and depending which library (and version) you use, we have some ways to add jitter to the reconnect logic. This may help to prevent too many connections from reconnecting at the same time. https://github.com/nats-io/nats.go/blob/acca6d69b4e3ee1abef2b96cae85a3878aa2a520/nats.go#L790 or https://github.com/nats-io/nats.go/blob/acca6d69b4e3ee1abef2b96cae85a3878aa2a520/nats.go#L800 to implement your own back-off computation.
Our cluster consists of three nodes:
Configuration file:
Number of clients: ~ 40 000. The average consumption of CPU resources: 400/3200 (from "top"). If I restart one of the nodes, the utilization on all nodes jumps to 3200.
Log: