Closed fowlerp-qlik closed 3 years ago
Are you using default server-to-client heartbeats? If you are using defaults, then it will take time for the server to detect the failure. You seem to have configured the Pings on the client side, what values have you set? It sounds like the client detects "fast", but maybe the server is still configured with a much larger window.
On server side, from doc: https://docs.nats.io/nats-streaming-server/configuring/cfgfile
hb_interval: Interval at which the server sends an heartbeat to a client
example: hb_interval: "10s"
default: 30s
hb_timeout: How long the server waits for a heartbeat response from the client before considering it a failed heartbeat
example: hb_timeout: "10s"
default: 10s
hb_fail_count: Count of failed heartbeats before server closes the client connection. The actual total wait is: (fail count + 1) * (hb interval + hb timeout)
example: hb_fail_count: 2
default: 10
We are using default server to client heartbeat values. On the client side, we are calling
stan.Pings(3, 3);
So I guess this is relatively fast to fail? Should the server to client and client to server ping/heatbeat values align?
Your client settings are way too low. This will cause your client to abandon a Streaming connection after about 10 seconds of interruption with the server, which means that any leader election that takes a bit too long, or a store recovery if running in FT mode, or a network glitch will cause your application to "fail", especially that you mentioned: "Client ConnectLost handler is called but we don't do much within it at all". Once this callback is invoked it means that your streaming connection is closed and you won't be able to send/receive any new message.
I would recommend that you bump those numbers, but can also reduce the ones in the server if you feel like the defaults are too high. But yes, both should be aligned. This is why we changed the default in the client library to align with the server.
Hi I have read
https://github.com/nats-io/stan.go/issues/333
but we too are seeing this error and do not know how to proceed. Here is the scenario
1) Kubernetes cluster environment 2) GO clients are using https://github.com/nats-io/stan.go/blob/main/stan.go 3) Clients have configured stan connection with Pings 4) Do to a weave pod update TCP connectivity dropped for many seconds 5) Ping timeout occurs and so https://github.com/nats-io/stan.go/blob/main/stan.go#L589 is called 6) Client ConnectLost handler is called but we don't do much within it at all 7) For whatever reason, the pings/heartbeats between nats streaming and nats did not fail 8) Client attempts to make a new Stan connection but received "clientID already registered" since by now TCP connectivity is re-established. No NATS NATS streaming or client Pods restarted at any time in this sequence. As a result NATS streaming still has "old" client stan client id and views the "old" client as still healthy.
Apart from creating a new connection with a new client id (not preferred), how can a client get Nats Streaming to drops is client stan connection.