nats-io / stan.go

NATS Streaming System
https://nats.io
Apache License 2.0
706 stars 117 forks source link

Client fails to connect to NATS server with error EOF #327

Closed codifierr closed 3 years ago

codifierr commented 4 years ago

In our stage and prod environment, we are seeing errors while publishing data to nats. where clients reported error EOF and failed to connect to nats. Sometimes clients automatically recovers from this state may be in 4-12 hours. To recover this as of now we have to restart clients and nats servers. Below func returns EOF while we use stan.Connect in which we provide server URL, New UUID and clusterid as "test-cluster"

// Connect will form a connection to the NATS Streaming subsystem. // Note that clientID can contain only alphanumeric and - or _ characters. func Connect(stanClusterID, clientID string, options ...Option) (Conn, error)

When does the client throw EOF error? Can somebody throw some light on it ?

library version Version = "0.6.0" Nss version 0.17.0

kozlovic commented 4 years ago

EOF means that the TCP connection was closed before the client was able to read everything from the socket. It could be that the server rejected the connection, or the TCP connection gets severed by some intermediate elements.

Are you using TLS for the NATS connection? Any log in the NATS and/or Streaming server logs?

codifierr commented 4 years ago

Thanks for replying @kozlovic. I am not using TLS to connect Nats and NSS server do not show any logs. The only logs which we receive is this on the clients that are trying to push data to Nats. Is there anything else we can to debug this further?

kozlovic commented 4 years ago

Is there anything between your client and the NATS Server? It could be that something is dropping the connection. Are you using a dedicated NATS Server (or cluster) or just the NATS Streaming server? Have you enabled debug in the server to see if at least the connection is accepted? Are you using any authentication? Could it be a timeout during authentication?

codifierr commented 4 years ago

@kozlovic Our deployment is in k8s env and we are using Nats streaming server. we do have istio enabled on these clusters to monitor traffic which was done recently. I will run NATS in debug mode and update if I can find anything more on this. We are not using any authentication so we can rule out the possibility of a timeout during authentication.

kozlovic commented 4 years ago

we do have istio enabled on these clusters to monitor traffic which was done recently.

Likely to be the culprit. For instance, if you are using TLS, NATS Server sends the first INFO message as plain text and then client upgrade to TLS. Lots of proxies, etc.. will cause failure because they expect TLS right away. If the traffic is randomly routed, that may cause failures too.

codifierr commented 4 years ago

@kozlovic We saw one more observation. As the client loses connection to nats they keep retrying unlimited times till it gets connection but post EOF error they fail to connect for a very long duration. During the same duration, we see memory usage on the NATS server keep increasing until it hits the limit of the server. Looks like the connection request is reaching to the server that's why we see high CPU and memory usage on the server. The client failed to connect for about 730129 times during this duration. Attaching grafana screenshot

Screenshot 2020-08-25 at 9 15 58 PM

We have enabled debug mode on the NATS server will update.

kozlovic commented 4 years ago

So it means that the connection request reaches the server. If the connection goes away without proper close protocol, then the server will hold the connection until it times it out, which, by default can take several minutes. This can be configured with hb_interval, hb_timeout and hb_fail_count. Even if the library gets disconnected and tries again, it then should be rejected by the server because it would be a duplicate client ID. So it would not store a new connection. What is the graph output: the NATS Streaming process memory or memory used on that machine?

Again, I would ask that you run NATS components without Istio in between. As you said, it started when you started to deploy it and I believe this is the reason for the EOFs.

codifierr commented 4 years ago

@kozlovic

What is the graph output: the NATS Streaming process memory or memory used on that machine?

This graph is plotted nats Grafana dashboard and I am using nats-prometheus-exporter to export nats metrics if I am not wrong it is NATS streaming process memory reported by nats. I will disable Istio between NSS and clients and try again. But isn't if Istio causing this it should cause always we should not be able to connect for a single time?

codifierr commented 3 years ago

After monitoring for more than 2 weeks post disabling Istio between clients and Nats. We can assume the problem was caused by Istio in this case. Thank you very much, @kozlovic for your help in fixing this problem. Closing this issue.