Open jason-magnetic-io opened 3 years ago
This is a larger log snippet including info about which cluster node was the source of each line https://gist.github.com/jason-magnetic-io/096fbca1d797efc7e3680b0d27f76e98
Timestamp 2021/01/23 20:22:47 is when we begin to manually restart the nodes.
@jason-magnetic-io Thank you for using NATS Streaming and sorry about your latest troubles.
Just to be clear: is the issue solely about the "timed out due to heartbeats" errors or are you say also that new connections cannot be made?
You mentioned that you do not notice any CPU usage increase that would explain this. Have you seen an increased in the number of clients connecting? Could it be that there is any ulimit on the system that causes new connections to fail?
The "timed out on heartbeats" could be the result of applications being stopped without them having a chance to call the connection "Close()" API. As a refresher: the NATS Streaming server is not a server, but a client to core NATS server. Therefore, clients are not directly TCP connecting to the streaming "server". The streaming server knows that client are "connected" based on heartbeats, and so having client send a close protocol for the server to remove the client is required. Without the close, the server relies on heartbeats and after a configurable number of missed responses, it will consider the client connection lost. See: https://docs.nats.io/nats-streaming-concepts/intro and https://docs.nats.io/nats-streaming-concepts/relation-to-nats for a bit more details.
If new clients cannot connect at that same time, it could be really some networking issue that would explain both type of failures.
Since I see that you are running the server with monitoring port enabled, one thing we could try, when the server is in this situation of failing clients due to HB timeout and rejecting new clients, is to capture the stacks of the leader process. It would be hitting the monitoring endpoint: http://<leader host>:8222/stacksz
. Note that this will dump the stack of all the go routines but without stopping the server. However, this is more of a debug practice and there is always a risk that this is causing the server to fail (say because the result is huge and this causes to allocate too much memory, etc..). But since normally when in that situation your course of action is to restart the server (with cleaning the file store), I think it is worth the risk.
Thanks for the http://<leader host>:8222/stacksz
debugging tip, that is helpful to know.
To clarify the language, at the time of the issue, there were:
We ruled out it being a network issue.
The 6 clients within the same Kubernetes cluster were unaffected.
Connections from within the same data centre and from a remote data centre were able to connect and authorise with NATS but could not connect with the NATS streaming cluster despite the NATS and NATS streaming Kubernetes Pods sharing the same VMs.
Regarding CPU and memory, the CPU load for each NATS streaming instance was between 0.1 and 0.2 vCPU and the memory usage was flat at 28MB per instance.
We have experienced issues in the past with connections not being cleaned up. All the clients take steps to actively disconnect including intercepting the container shutdown hooks.
We have experience of when the channel, subscription, message and memory limits are exceeded. That was not the behaviour we observed in this case.
- 6 clients connected to NATS streaming without any issue. These were long running connections established days earlier.
- 3 clients connected but were timing out, these clients were able to connect long enough to subscribe to channels but started reporting the connection was lost within 1-2 minutes of subscribing. These correspond to the timeouts seen in the stan log.
Were those 2 groups of clients located in the same "region" or connecting to the same server?
- When we were actively debugging 3 out of 4 connections failed to connect to NATS streaming despite being able to connect to NATS. We ruled out it being a network issue.
Well, it depends. Say you have a NATS core cluster of 3 servers called N1, N2 and N3. Now say that you have a streaming cluster consisting of S1, S2 and S3 which connect to the NATS cluster. They can have connected/reconnected to any NATS server in the cluster, that is, it is not guaranteed that S1 is connected to N1, etc..
When a client connects, it connects to NATS, so it is possible that say it connects to N3, but the Streaming server leader is S1 and for instance this one connects to N1. Your client will connect/authorized fine when connecting to N3 but the streaming connection request has to reach S1 (and be replicated to S2 and S3). So it is not out of the question that you still have network issues even if a client can TCP connect successfully to its close NATS core server.
Connections from within the same data centre and from a remote data centre were able to connect and authorise with NATS but could not connect with the NATS streaming cluster despite the NATS and NATS streaming Kubernetes Pods sharing the same VMs.
See above. I am not saying that there is a network issue, but we can't rule that out.
But it could be that somehow the server is locking up or being deadlocked, but then I would expect leadership to be lost, etc.. so next time the problem occurs, try capturing the /stacksz
output, say 2 or 3 at few seconds interval (to see if something is deadlocked).
Thanks!
First, I want to say that we've been using NATS streaming for nearly 2 years without issues.
Last week we observed timeouts when new clients tried to connect to stan. Approximately, 3 out of 4 connection attempts were unsuccessful.
The existing clients remained connected and were unaffected.
The only way we could resolve the issue was by restarting the cluster with empty file stores and reconnecting all the clients.
Versions: nats-streaming-server version 0.19.0; nats-server: v2.1.9; nats-account-server: 0.8.4
The cluster is running on Kubernetes on Google Cloud GKE using this configuration:
There were no unusual spikes in memory or CPU usage.
These are the logs for stan-o, stan-1 and stan-2 building up to the issue: