Closed sm4ll-3gg closed 4 years ago
What are your client pings settings? Using defaults or did you override? Default is every 5 seconds, and max is 3. Once the client report the connection as closed, the streaming connection (and nats connection is owned) will be closed. The user is responsible from recreating it (along with subscriptions if applicable).
The streaming server leader can change role, this is out of our control and based on what RAFT decides. It normally does not change but missed/delayed RAFT heartbeats can lead to that. If the server are overloaded, it could also cause re-election since those RAFT heartbeats may not be processed on time.
I would look at the trace in this log or other to see when the previous server lost leadership and how long it took for a new leader to be elected. It may be that delay that caused the application to consider the connection closed (due to stan.Pings() settings).
Thank you for your reply! We're using default client ping settings.
Unfortunately, we don't have this logs now 😞
Could you please clarify to me what happens when a leader is reelected? Whether established connections keep working with the ex-leader or them transparently reconnecting to the new leader or might the user responsible for reconnecting to the new leader?
If there is documentation describing this case, I would like to read it.
Only the leader responds to the client PING messages. So if there is a leadership lost, the client will not get any ping back until a new leader is elected. If all that happens within the number of pings sent by the client to decide if the connection is lost, you have nothing to do. Once the connection lost handler is invoked, it is the user responsibility to recreate it and its subscriptions if applicable. Here is some background info on the connection status: https://github.com/nats-io/stan.go#connection-status
Note that all that is a higher level than the low level NATS connection. That is, a client could have been connected to a server in the cluster and never loses its TCP connection and still you could have the STAN connection lost because there was no communication between the streaming client and the streaming server leader. (see it more like a session if this is less confusing).
Oh, it's clear now! Thank you very much!
Hi! A few days ago my service stared to answer badly on health checks (panics on
nil
pointer dereference onconn.NatsConn().Status()
, underlying NATS connection wasnil
). In logs we saw this message:We started research root cause and saw this message in stan logs at the same time:
I don't completely understand why leader re-election happened and why it caused problems with pings. I have an idea that clients can perform actions only with the leader, so them should reconnect to the leader after re-election, but I didn't find any proof of that in the documentation.
Could you please help me with this case?