nats-io / stan.go

NATS Streaming System
https://nats.io
Apache License 2.0
706 stars 117 forks source link

Connection lost when leader is re-elected #313

Closed sm4ll-3gg closed 4 years ago

sm4ll-3gg commented 4 years ago

Hi! A few days ago my service stared to answer badly on health checks (panics on nil pointer dereference on conn.NatsConn().Status(), underlying NATS connection was nil). In logs we saw this message:

... stan: connection lost due to PING failure

We started research root cause and saw this message in stan logs at the same time:

[1] 2020/05/27 18:02:03.247410 [INF] STREAM: server became leader, performing leader promotion actions [1] 2020/05/27 18:02:03.251397 [INF] STREAM: finished leader promotion actions

I don't completely understand why leader re-election happened and why it caused problems with pings. I have an idea that clients can perform actions only with the leader, so them should reconnect to the leader after re-election, but I didn't find any proof of that in the documentation.

Could you please help me with this case?

kozlovic commented 4 years ago

What are your client pings settings? Using defaults or did you override? Default is every 5 seconds, and max is 3. Once the client report the connection as closed, the streaming connection (and nats connection is owned) will be closed. The user is responsible from recreating it (along with subscriptions if applicable).

The streaming server leader can change role, this is out of our control and based on what RAFT decides. It normally does not change but missed/delayed RAFT heartbeats can lead to that. If the server are overloaded, it could also cause re-election since those RAFT heartbeats may not be processed on time.

I would look at the trace in this log or other to see when the previous server lost leadership and how long it took for a new leader to be elected. It may be that delay that caused the application to consider the connection closed (due to stan.Pings() settings).

sm4ll-3gg commented 4 years ago

Thank you for your reply! We're using default client ping settings.

Unfortunately, we don't have this logs now 😞

Could you please clarify to me what happens when a leader is reelected? Whether established connections keep working with the ex-leader or them transparently reconnecting to the new leader or might the user responsible for reconnecting to the new leader?
If there is documentation describing this case, I would like to read it.

kozlovic commented 4 years ago

Only the leader responds to the client PING messages. So if there is a leadership lost, the client will not get any ping back until a new leader is elected. If all that happens within the number of pings sent by the client to decide if the connection is lost, you have nothing to do. Once the connection lost handler is invoked, it is the user responsibility to recreate it and its subscriptions if applicable. Here is some background info on the connection status: https://github.com/nats-io/stan.go#connection-status

Note that all that is a higher level than the low level NATS connection. That is, a client could have been connected to a server in the cluster and never loses its TCP connection and still you could have the STAN connection lost because there was no communication between the streaming client and the streaming server leader. (see it more like a session if this is less confusing).

sm4ll-3gg commented 4 years ago

Oh, it's clear now! Thank you very much!