100% CPU usage after restarting one of the cluster nodes

yadvlz commented 3 years ago

Our cluster consists of three nodes:

32xXeon(R) CPU E7-8860 v4 @ 2.20GHz, 128 GB RAM, SSD, VMware VM.
nats-streaming-server version 0.19.0, nats-server: v2.1.9

Configuration file:

port: 4222
http_port: 8222
write_deadline: "15s"
debug: false
trace: false
logtime: true
logfile_size_limit: 10MB
log_file: "/storage/nats-streaming-server/logs/nats.log"

authorization {
    timeout: 2
    users: [
        { user: user1, password: 1 }
            ...
        { user: nss_cluster, password: clusterpass }
    ]
}

cluster: {
    listen: 0.0.0.0:6222
    authorization: {
        timeout: 2
        user: nats_route
        password: natsroutepass
    }
    routes: ["nats-route://nats_route:natsroutepass@10.0.0.2:6222", "nats-route://nats_route:natsroutepass@10.0.0.3:6222"]
}

streaming: {
    id: "stan-eas"
    sd: false
    sv: false
    store: "file"
    dir: "/storage/nats-streaming-server/data"
    hb_interval: "30s"
    hb_timeout: "10s"
    hb_fail_count: 10
    username: nss_cluster
    password: clusterpass

    store_limits: {
        max_channels: 0
        max_msgs: 0
        max_bytes: 0MB
        max_age: "144h"
        max_subs: 0
        max_inactivity: "144h"
        channels: {
            "LocalAction": { max_age: "144h" }
            "PME.Control": { max_age: "144h" }
            "PME.Rules": { max_age: "8760h", max_msgs: 5 }
            "PerfCounter.Metrics": { max_age: "4h" }
            "SQLExporter.Metrics": { max_age: "4h" }
            "WatchDog.Metrics": { max_age: "4h" }
            "PerfCounter.Log": { max_age: "4h" }
            "SQLExporter.Log": { max_age: "4h" }
            "WatchDog.Log": { max_age: "4h" }
        }
    }

    file: {
        compact_min_size: 100MB
        sync_on_flush: true
        file_descriptors_limit: 100
        parallel_recovery: 4
    }

    cluster: {
        node_id: "a"
        peers: ["b", "c"]
        log_path: "/storage/nats-streaming-server/nsslog"
        log_cache_size: 1024
        log_snapshots: 10
        trailing_logs: 256
        raft_logging: false
    }
}

Number of clients: ~ 40 000. The average consumption of CPU resources: 400/3200 (from "top"). If I restart one of the nodes, the utilization on all nodes jumps to 3200.

Log:

[64189] 2020/12/22 14:12:58.305625 [WRN] 10.121.75.67:52353 - cid:3100306 - Readloop processing time: 3.453755114s
[64189] 2020/12/22 14:12:58.305672 [WRN] 10.52.214.213:55416 - cid:3100375 - Readloop processing time: 3.607654096s
[64189] 2020/12/22 14:12:58.305736 [WRN] 10.28.100.161:56442 - cid:3100352 - Readloop processing time: 3.657804437s
[64189] 2020/12/22 14:12:58.305799 [WRN] 10.42.118.194:50449 - cid:3100231 - Readloop processing time: 3.519814355s
[64189] 2020/12/22 14:12:58.305843 [WRN] 10.52.214.213:55420 - cid:3100472 - Readloop processing time: 3.552920741s
[64189] 2020/12/22 14:12:58.305903 [WRN] 10.125.220.213:52240 - cid:3099976 - Readloop processing time: 3.991156933s
[64189] 2020/12/22 14:12:58.305970 [WRN] 10.25.81.205:49452 - cid:3099913 - Readloop processing time: 6.234881178s
[64189] 2020/12/22 14:12:58.305991 [WRN] 10.126.77.154:58490 - cid:3099999 - Readloop processing time: 3.49807845s
[64189] 2020/12/22 14:12:58.306033 [WRN] 10.32.7.145:63423 - cid:3100051 - Readloop processing time: 3.728901556s
[64189] 2020/12/22 14:12:58.306076 [WRN] 10.118.201.117:55069 - cid:3100431 - Readloop processing time: 3.650559115s
[64189] 2020/12/22 14:12:58.306110 [WRN] 10.121.75.67:53058 - cid:3100164 - Readloop processing time: 3.618370455s
[64189] 2020/12/22 14:12:58.306163 [WRN] 10.84.89.227:51199 - cid:3100208 - Readloop processing time: 3.640992134s
[64189] 2020/12/22 14:12:58.306206 [WRN] 10.66.110.225:58100 - cid:3100222 - Readloop processing time: 3.704752827s
[64189] 2020/12/22 14:12:58.306256 [WRN] 10.121.75.67:53157 - cid:3100299 - Readloop processing time: 3.487310526s
[64189] 2020/12/22 14:12:58.306304 [WRN] 10.96.8.16:56316 - cid:3100012 - Readloop processing time: 3.882298912s
[64189] 2020/12/22 14:12:58.306336 [WRN] 10.113.81.66:52301 - cid:3100159 - Readloop processing time: 3.729128858s
[64189] 2020/12/22 14:12:58.306374 [WRN] 10.42.93.98:54259 - cid:3100947 - Readloop processing time: 2.452962671s
[64189] 2020/12/22 14:12:58.306382 [WRN] 10.39.210.213:49808 - cid:3100371 - Readloop processing time: 3.542212148s
[64189] 2020/12/22 14:12:58.306416 [WRN] 10.118.88.4:52712 - cid:3100117 - Readloop processing time: 3.520391606s
[64189] 2020/12/22 14:12:58.306490 [WRN] 10.28.97.66:50986 - cid:3100471 - Readloop processing time: 3.542449103s
[64189] 2020/12/22 14:12:58.306587 [WRN] 10.93.217.5:51259 - cid:3100391 - Readloop processing time: 3.552975942s
[64189] 2020/12/22 14:12:58.306633 [WRN] 10.121.75.67:52246 - cid:3100373 - Readloop processing time: 3.608532296s
[64189] 2020/12/22 14:12:58.306677 [WRN] 10.121.75.67:53091 - cid:3100210 - Readloop processing time: 3.586355502s

kozlovic commented 3 years ago

Some questions:

Does the problem resolves on its own after a while, if so how long? If not, how long did you wait?
Any idea on how many connections were connected to the restarted node? You mentioned something like 40,000 clients total.
How long was the restarted node left not running? Was there a lot of activity in that time?
Were the warning reported the whole extent of them or just a sample? Were there in a single server log or on all of them?

The fact that all 3 nodes were experiencing high CPU usage makes me believe that it may have been due to the amount of clients reconnecting at the same time. If not, and if the stopped node would have been down for a while and required a lot of catching up on restart, this would be with the leader, so would then expect only 2 to - possibly - be very active.

yadvlz commented 3 years ago

It doesn't. I waited for about 6-8 hours. To resolve the issue I block all incoming packets with a firewall and run "for i in {0..255}; do echo "$i"; iptables -I INPUT 1 -s 10.$i.0.0/16 -j ACCEPT; sleep 30; done" on each node.
12-15k.
Just run "systemctl restart nats-streaming-server".
Warnings were on all servers.

yadvlz commented 3 years ago

CPU jumps in all nodes after restart one of them.

kozlovic commented 3 years ago

To make sure I have correctly understood the description of the issue:

The issue was triggered when simply restarting one NATS Streaming server of the cluster that probably had between 12 to 15,000 clients. After the restart, all nodes were 100% CPU for 6 to 8 hours. You then blocked traffic for all incoming clients for about 30 seconds, and that solved the issue.

Is it about right?

yadvlz commented 3 years ago

The issue was triggered when simply restarting one NATS Streaming server of the cluster that probably had between 12 to 15,000 clients.

Correct.

After the restart, all nodes were 100% CPU for 6 to 8 hours.

Maybe more, I don't tested.

You then blocked traffic for all incoming clients for about 30 seconds, and that solved the issue.

Clients are located on the 10.0.0.0/8 network, I block all incoming packets for about 20 minutes, after that the CPU load drops. If I unlock incoming packets, the load increases again. To prevent this, I am forced to unlock incoming packets partially, by /16 every 30 seconds.

yadvlz commented 3 years ago

The issue was triggered when simply restarting one NATS Streaming server of the cluster that probably had between 12 to 15,000 clients.

12,000 - 15,000. Sorry.

kozlovic commented 3 years ago

12,000 - 15,000. Sorry.

Yes, I understood that, no worries.

To prevent this, I am forced to unlock incoming packets partially, by /16 every 30 seconds.

So do you get to a point where all clients traffic is resumed and servers behave normally (no high CPU usage on all nodes)?

yadvlz commented 3 years ago

So do you get to a point where all clients traffic is resumed and servers behave normally (no high CPU usage on all nodes)?

Correct. Right now the cluster is in this state and the unlock script is running on all nodes :)

kozlovic commented 3 years ago

Correct. Right now the cluster is in this state and the unlock script is running on all nodes :)

Oh so you mean that you are still filtering clients? So if you were to "open the gate" there would still be the issue? And we are sure that the client traffic is same than before and the issue was really started with a restart of a node and not simply some client usage that has suddenly increased?

How many channels do you have, and what is the normal message rate? How many subscriptions exist and are they mainly 1 to 1 or do 1 message go to many different subscriptions?

yadvlz commented 3 years ago

Oh so you mean that you are still filtering clients?

The problem has been observed for several months and every time it appears I have to solve it by blocking/unblocking clients.

And we are sure that the client traffic is same than before and the issue was really started with a restart of a node and not simply some client usage that has suddenly increased?

Sometimes the load increases without restarting the node. The cluster works normally, then suddenly jumps. The problem is solved by blocking / unblocking clients.

How many channels do you have, and what is the normal message rate?

~42,000 channels.

{ "num_subscriptions": 124494, "num_cache": 10469384, "num_inserts": 117474553, "num_removes": 117350059, "num_matches": 2124441735, "cache_hit_rate": 0.0279807824430638, "max_fanout": 0, "avg_fanout": 0 }

kozlovic commented 3 years ago

Looks like you may have 1 channel per client since you mentioned about 40,000 clients and 40,000 channels. I see that there is no limit on these channels except for max_age or max_inactivity. Is there cases where those channels are removed due to inactivity or is it mainly messages removed after 6 days?

When the event occurs (either on server restart or on its own), have you correlated to an increase in message rate in some channels? Have you noticed some channels unexpectedly filling up with messages maybe? Are you using streaming only or do you have NATS core usage from client applications on this cluster?

yadvlz commented 3 years ago

I need the help of my teammate. He'll be able to respond in a few hours.

yadvlz commented 3 years ago

Looks like you may have 1 channel per client since you mentioned about 40,000 clients and 40,000 channels.

We have one channel with 40,000 subscribers and 40,000 1-to-1 channels.

Is there cases where those channels are removed due to inactivity or is it mainly messages removed after 6 days?

Messages removed after 6 days.

When the event occurs (either on server restart or on its own), have you correlated to an increase in message rate in some channels?

We need to check it out.

Have you noticed some channels unexpectedly filling up with messages maybe?

Noticed - where?

Are you using streaming only or do you have NATS core usage from client applications on this cluster?

Both of them.

kozlovic commented 3 years ago

We have one channel with 40,000 subscribers

Ouch, so 1 message will be sent to 40,000 subscribers... this could become a bottleneck if the message rate to this channel is high.

Noticed - where?

Are you using any kind of metric or the monitoring endpoints to see if some of the channels unexpectedly increase in size? In case you don't know, you can use monitoring endpoints to check on some of the streaming resources: https://docs.nats.io/nats-streaming-concepts/monitoring and https://docs.nats.io/nats-streaming-concepts/monitoring/endpoints

Both of them.

I have a nagging suspicion that it either a sudden influx of streaming messages or NATS related (or some subscriptions being constantly created/destroyed). The reason I say that is that all 3 nodes are affected. That can be the case if streaming messages are added to the system since they get replicated to all 3 nodes. That being said it is the same for any other activity: connection create/close, subscription create/close, messages delivered to sub/ack'ed: all that get replicated. So an increase in activity would cause each node to have to work to process the replicated information.

If you have any way that you measure normal activity in the cluster, you should then see if you notice any increase activity. You could even have a simple group of NATS (not streaming) consumers on ">" that simply measure the message rate/size and report every second. It does not have to be too sophisticated, just a way to see a jump. Or, when the situation occurs (100% CPU), start a NATS consumer on ">" and capture traffic for few seconds and try to see what the bulk of the traffic is?

As you can see, I am a bit at loss to explain what is happening...

yadvlz commented 3 years ago

Ouch, so 1 message will be sent to 40,000 subscribers... this could become a bottleneck if the message rate to this channel is high.

1-2 messages per month.

Are you using any kind of metric or the monitoring endpoints to see if some of the channels unexpectedly increase in size? In case you don't know, you can use monitoring endpoints to check on some of the streaming resources: https://docs.nats.io/nats-streaming-concepts/monitoring and https://docs.nats.io/nats-streaming-concepts/monitoring/endpoints

We use prometheus-metrics-exporter, but we haven't set up triggers yet.

Two questions:

at the moment I observe the following messages in the logs:

[ERR] STREAM: Asynchronous error on connection _NSS-stan-eas-general, subject _STAN.sub.stan-eas: nats: slow consumer, messages dropped

What is it? CPU load is 2900/3200.

logs also contain the following messages:

[Client:replicator_412025] Duplicate durable subscription registration

They appear in dozens per second. Maybe this is the case?

kozlovic commented 3 years ago

[ERR] STREAM: Asynchronous error on connection _NSS-stan-eas-general, subject _STAN.sub.stan-eas: nats: slow consumer, messages dropped What is it? CPU load is 2900/3200.

The Streaming server is not a server per-se (https://docs.nats.io/nats-streaming-concepts/intro and https://docs.nats.io/nats-streaming-concepts/relation-to-nats). It creates NATS connections to the NATS Server and uses subscriptions to receive messages from clients. This subscription is for "subscription requests" coming from clients. By default NATS subscriptions have a limit of pending messages (waiting to be dispatched/processed) after which the connection does not enqueue them but drop them and notifies the async error handler. So the error above indicates that the server is receiving way more subscriptions requests that it can handle and they are dropped.

[Client:replicator_412025] Duplicate durable subscription registration They appear in dozens per second. Maybe this is the case?

Unless this is a new issue and/or a side effect of something else, it could indeed be the reason for the issue. Could it be that you have an application that tries to create a durable subscription and for some reason loops trying to do so? It could be that initially it got a timeout waiting for the response, but the durable was actually created. In that case, your app may want to check for the error returned and in the case of "Duplicate durable subscription" it would have to close the connection otherwise any other attempt will fail.

kozlovic commented 3 years ago

@yadvlz Could it be that issue #1135 is related to this issue? If so, I have submitted a PR to add a new configuration option that would cause the server to accept the close the current durable and accept the new durable subscription. That may help the issue that you have seen?

yadvlz commented 3 years ago

@yadvlz Could it be that issue #1135 is related to this issue?

I think it is.

kozlovic commented 3 years ago

@yadvlz I was actually wondering if the issue was from a colleague of yours. If not, there have been 2 issues reported in the streaming server repo and 2 in the stan.go for the same "duplicate durable" issue in a short amount of time, so this is why I was wondering if they come from the same deployment or are actually different users. Nevertheless, I believe that PR #1136 will help.

kozlovic commented 3 years ago

I am closing this since PR #1136 has been merged. Thank you for your patience!

yadvlz commented 3 years ago

I deployed new version of nats-streaming-server (with "replace_durable: true"), but my problem still exist. Log from the leader at the time of restarting the follower

[32749] 2021/01/08 12:33:27.428845 [ERR] STREAM: [Client:replicator_659469] Timed out on heartbeats [32749] 2021/01/08 12:33:28.690700 [ERR] STREAM: [Client:replicator_606369] Timed out on heartbeats [32749] 2021/01/08 12:33:38.565970 [ERR] STREAM: [Client:replicator_693023] Timed out on heartbeats [32749] 2021/01/08 12:33:52.827200 [ERR] STREAM: [Client:replicator_157785] Timed out on heartbeats [32749] 2021/01/08 12:33:54.833447 [INF] 10.193.68.251:38968 - rid:10961 - Router connection closed: Read Error [32749] 2021/01/08 12:33:55.699539 [ERR] STREAM: [Client:replicator_462420] Timed out on heartbeats [32749] 2021/01/08 12:33:55.887039 [INF] STREAM: server lost leadership, performing leader stepdown actions [32749] 2021/01/08 12:33:55.908709 [INF] 10.193.68.251:40086 - rid:97807 - Route connection created [32749] 2021/01/08 12:33:57.227611 [WRN] 10.112.24.9:63226 - cid:97429 - Readloop processing time: 2.263951272s [32749] 2021/01/08 12:33:57.230102 [WRN] 10.92.47.72:57609 - cid:97354 - Readloop processing time: 2.017834321s [32749] 2021/01/08 12:33:57.262937 [INF] STREAM: finished leader stepdown actions [32749] 2021/01/08 12:33:57.276524 [WRN] 10.87.91.162:64614 - cid:97364 - Readloop processing time: 2.246782296s [32749] 2021/01/08 12:33:57.334539 [WRN] 10.100.141.210:57416 - cid:97291 - Readloop processing time: 2.106756313s [32749] 2021/01/08 12:33:57.340296 [WRN] 10.106.78.130:64324 - cid:97318 - Readloop processing time: 2.310444523s [32749] 2021/01/08 12:33:57.586867 [WRN] 10.109.121.2:51399 - cid:97379 - Readloop processing time: 2.623306455s [32749] 2021/01/08 12:33:57.589297 [WRN] 10.113.77.67:63701 - cid:97433 - Readloop processing time: 2.60364601s [32749] 2021/01/08 12:33:57.589529 [WRN] 10.108.213.197:52596 - cid:97422 - Readloop processing time: 2.625908581s [32749] 2021/01/08 12:33:57.589593 [WRN] 10.92.108.200:61466 - cid:97273 - Readloop processing time: 2.615766941s [32749] 2021/01/08 12:33:57.589661 [WRN] 10.92.84.194:55216 - cid:97287 - Readloop processing time: 2.669059357s [32749] 2021/01/08 12:33:57.589721 [WRN] 10.24.84.2:60485 - cid:97323 - Readloop processing time: 2.647962912s [32749] 2021/01/08 12:33:57.589780 [WRN] 10.93.96.146:58826 - cid:97378 - Readloop processing time: 2.669010652s

kozlovic commented 3 years ago

I am re-opening this issue. Seeing the timed out errors and the routed disconnected (and leader step down), I wonder if you did not have a network issue there too. But otherwise, we are now back to do the investigation I suggested in earlier comments.

yadvlz commented 3 years ago

[42312] 2021/01/11 11:55:42.398945 [FTL] STREAM: Failed to start: nats: no servers available for connection [42349] 2021/01/11 11:55:42.711058 [INF] STREAM: Starting nats-streaming-server[stan-eas] version 0.19.0 [42349] 2021/01/11 11:55:42.711155 [INF] STREAM: ServerID: q1zdQ4ez66kGwQEFWz5stM [42349] 2021/01/11 11:55:42.711163 [INF] STREAM: Go version: go1.15.6 [42349] 2021/01/11 11:55:42.711169 [INF] STREAM: Git commit: [not set] [42349] 2021/01/11 11:55:42.712526 [INF] Starting nats-server version 2.1.9 [42349] 2021/01/11 11:55:42.712550 [INF] Git commit [not set] [42349] 2021/01/11 11:55:42.712790 [INF] Starting http monitor on 0.0.0.0:8222 [42349] 2021/01/11 11:55:42.714242 [INF] Listening for client connections on 0.0.0.0:4222 [42349] 2021/01/11 11:55:42.714262 [INF] Server id is ND4V3Q4K6EOQX62KBQWBAPD4GUFBQGMC3X4PAISUG5XSHAK6JEMKK6TF [42349] 2021/01/11 11:55:42.714268 [INF] Server is ready [42349] 2021/01/11 11:55:42.714798 [INF] Listening for route connections on 0.0.0.0:6222 [42349] 2021/01/11 11:55:42.716096 [INF] 10.193.68.252:6222 - rid:1 - Route connection created [42349] 2021/01/11 11:55:42.716140 [INF] 10.193.68.253:6222 - rid:2 - Route connection created [42349] 2021/01/11 11:55:45.010386 [WRN] 10.77.28.252:63791 - cid:105 - Readloop processing time: 2.048394269s [42349] 2021/01/11 11:55:45.021490 [WRN] 10.82.77.136:61951 - cid:64 - Readloop processing time: 2.014462329s [42349] 2021/01/11 11:55:45.041510 [WRN] 10.80.91.131:53914 - cid:65 - Readloop processing time: 2.149629235s [42349] 2021/01/11 11:55:45.049700 [WRN] 10.105.127.98:63396 - cid:120 - Readloop processing time: 2.087822186s [42349] 2021/01/11 11:55:45.056968 [WRN] 10.41.89.2:63512 - cid:114 - Readloop processing time: 2.073160631s [42349] 2021/01/11 11:55:45.060889 [FTL] STREAM: Failed to start: nats: no servers available for connection [42385] 2021/01/11 11:55:45.210725 [INF] STREAM: Starting nats-streaming-server[stan-eas] version 0.19.0 [42385] 2021/01/11 11:55:45.210821 [INF] STREAM: ServerID: olFM1hi2L8w7kgv5B0pb6O [42385] 2021/01/11 11:55:45.210832 [INF] STREAM: Go version: go1.15.6 [42385] 2021/01/11 11:55:45.210839 [INF] STREAM: Git commit: [not set] [42385] 2021/01/11 11:55:45.212160 [INF] Starting nats-server version 2.1.9 [42385] 2021/01/11 11:55:45.212184 [INF] Git commit [not set] [42385] 2021/01/11 11:55:45.212469 [INF] Starting http monitor on 0.0.0.0:8222 [42385] 2021/01/11 11:55:45.213904 [INF] Listening for client connections on 0.0.0.0:4222 [42385] 2021/01/11 11:55:45.213925 [INF] Server id is NAO7GBYFGAJ6OFZMRRRQ3I4TS6SIC2YD344ERBPRZNLTEUBBOTIOOR75 [42385] 2021/01/11 11:55:45.213930 [INF] Server is ready

These messages appear cyclically on the node after restart.

yadvlz commented 3 years ago

If I block clients, the log looks like this:

[42965] 2021/01/11 11:57:19.464101 [INF] Starting nats-server version 2.1.9 [42965] 2021/01/11 11:57:19.464116 [INF] Git commit [not set] [42965] 2021/01/11 11:57:19.464408 [INF] Starting http monitor on 0.0.0.0:8222 [42965] 2021/01/11 11:57:19.467430 [INF] Listening for client connections on 0.0.0.0:4222 [42965] 2021/01/11 11:57:19.467444 [INF] Server id is NCYRM2UX76PYBASHN3PO4IN7EA6RBMAW62EW7GDL7B7I42RYZKPBA3V4 [42965] 2021/01/11 11:57:19.467448 [INF] Server is ready [42965] 2021/01/11 11:57:19.468038 [INF] Listening for route connections on 0.0.0.0:6222 [42965] 2021/01/11 11:57:19.469184 [INF] 10.193.68.252:6222 - rid:1 - Route connection created [42965] 2021/01/11 11:57:19.469228 [INF] 10.193.68.253:6222 - rid:2 - Route connection created [42965] 2021/01/11 11:57:20.535651 [INF] STREAM: Recovering the state... [42965] 2021/01/11 12:00:34.815674 [INF] STREAM: Recovered 42537 channel(s) [42965] 2021/01/11 12:00:34.923018 [INF] STREAM: Cluster Node ID : a [42965] 2021/01/11 12:00:34.923112 [INF] STREAM: Cluster Log Path: /storage/nats-streaming-server/nsslog [42965] 2021/01/11 12:00:34.923121 [INF] STREAM: Cluster known peers: [42965] 2021/01/11 12:00:34.923132 [INF] STREAM: peer 1: "b" [42965] 2021/01/11 12:00:34.923138 [INF] STREAM: peer 2: "c" [42965] 2021/01/11 12:00:35.166678 [INF] STREAM: restoring from snapshot

kozlovic commented 3 years ago

It looks to me that the server - as soon as routes are created - is overwhelmed with traffic to the point that the internal connections requests to the NATS server fail! Now the fact that the server is embedding its own NATS server makes it even more worrisome because this is just loopback. The default connect timeout is 2 seconds, so that may explain why it fails seeing those warnings. If you can recompile the server, you could try bumping the connect timeout (this is not currently configurable through options). This would be here (based on main branch): https://github.com/nats-io/nats-streaming-server/blob/814b51a89ccac0107d4e323c6b8270bbe1d23ccd/server/server.go#L1581 and you could add:

ncOpts.Timeout = 10 * time.Second

before the Connect() call. But to be honest, I fear that this is not going to help much. I am thinking that the volume is too much for a single server to handle.

yadvlz commented 3 years ago

I am thinking that the volume is too much for a single server to handle.

So how can I deal with it?

kozlovic commented 3 years ago

So how can I deal with it?

I described some of the ways we could try to debug this by capturing some of the traffic and see if this traffic is "expected". From your earlier description, it sounded to me like there are times where all is well, but then suddenly an event (server restart or other) causes this issue and the only way to restore is to basically stop client traffic for a while.

If we determine that the traffic is normal, just higher volume from time to time, then maybe you need to have several clusters and separate connections/traffic. If the traffic shows an abnormal pattern (could have been a constant attempt to create durable subscriptions as we had suspected at one point) then it could help us fix the root of this traffic.

yadvlz commented 3 years ago

ncOpts.Timeout = 10 * time.Second

Tried this, didn't help.

If we determine that the traffic is normal, just higher volume from time to time, then maybe you need to have several clusters and separate connections/traffic.

I think the problem occurs in cases where multiple clients are trying to connect to the server (establish new connection) at the same time.

kozlovic commented 3 years ago

I think the problem occurs in cases where multiple clients are trying to connect to the server (establish new connection) at the same time.

If this is following a disconnect/reconnect, and depending which library (and version) you use, we have some ways to add jitter to the reconnect logic. This may help to prevent too many connections from reconnecting at the same time. https://github.com/nats-io/nats.go/blob/acca6d69b4e3ee1abef2b96cae85a3878aa2a520/nats.go#L790 or https://github.com/nats-io/nats.go/blob/acca6d69b4e3ee1abef2b96cae85a3878aa2a520/nats.go#L800 to implement your own back-off computation.

nats-io / nats-streaming-server

100% CPU usage after restarting one of the cluster nodes #1134