splitio / split-synchronizer

Golang agent for Split SDKs
Other
16 stars 15 forks source link

Net::ReadTimeout with #<TCPSocket:(closed)> Errors #221

Closed roee-landesman closed 1 year ago

roee-landesman commented 1 year ago

Hi,

We are running the split proxy in our K8s clusters and are seeing sporadic timeouts between the Ruby SDK and the proxy. These occur at the 5s mark, which I see is the default http_timeout_config, however I'm curious if anyone has run into this issue before and has been able to solve it at the root cause?

Our stack is EKS/Istio/Two deployments one for admin and one for split proxy

hbqdev commented 1 year ago

Hi @roee-landesman Are you seeing the timeout from the SDK or from the Proxy? Do you have any logs you can share with us?

Regards

roee-landesman commented 1 year ago

Hi @roee-landesman Are you seeing the timeout from the SDK or from the Proxy? Do you have any logs you can share with us?

Regards

I'm seeing them happen in calls from the SDK to the proxy. I know I can also configure the SDK itself with a greater HTTP timeout but still am unsure of the root cause

hbqdev commented 1 year ago

Hi @roee-landesman Can you please share with us some SDK logs at DEBUG level if possible that has the error messages?

Regards,

roee-landesman commented 1 year ago
Screenshot 2023-04-11 at 9 18 33 AM
hbqdev commented 1 year ago

Hi @roee-landesman

Are you aware of any networking issues on the k8s clusters? When the timeout happens, how long until the SDK is able to connect again? Does the issue happen during a specific time of the day (high traffic) or randomly? Do you have any monitoring to see the traffic request from the SDK to the proxy? Likewise, do you see any errors in the proxy dashboard?

For better communication, you can also send an email to support@split.io

Regards

roee-landesman commented 1 year ago

I will continue to investigate networking issues in our clusters.

We are running the admin-console as a separate deployment from the split-proxy itself, is there any way for us to point the admin console pods to read from the proxies or do they only communicate over localhost?

Thanks!

hbqdev commented 1 year ago

Hi @roee-landesman Sorry for the late reply. Do you mean redirecting the SDK to admin-console? Unfortunately that is not possible, as the admin-console is only for showing the stats and the logs.

Regards,

roee-landesman commented 1 year ago

These intermittent errors continue to occur on a frequent basis, we're still unsure why. Any help would be greatly appreciated.

As far as the admin dashboard goes, I was able to configure it per-pod and can now see these error logs Screenshot 2023-05-09 at 10 11 16 AM

We are also running the pods in debug logging mode, but have not picked-up on anything useful from that front.

hbqdev commented 1 year ago

Hi @roee-landesman

As suspected this is a networking error coming from the k8s pod.

When these errors appear. how long until they recover? From the message, you can see that there is an interruption to the streaming connection, so it switched to polling.

Are you aware of any networking issues on the k8s cluster at the time? You can check the networking logs on the pods of the CNI service that you're using for your k8s networking and see if we have more information about the network error above.

Regards

roee-landesman commented 1 year ago

They happen in spurts of about 1 min, during which time thousands of requests will fail.

Screenshot 2023-05-10 at 4 49 12 PM

We have much higher throughput services running in the cluster, all of which handle traffic fine. This leads me to suspicion regarding the split-proxy server itself.

I don't see anything that stands out from the CNI logs, however will continue to monitor as well as look into the istio sidecars and see if there is anything wrong there.

hbqdev commented 1 year ago

Hi @roee-landesman

We also want to note that we have seen other issues with ISTIO in the past. Are you using PERMISSIVE or STRICT mode with your ISTIO config?

Regards

roee-landesman commented 1 year ago

We use permissive (by default)

hbqdev commented 1 year ago

Hi @roee-landesman We're just checking in to see if you have seen anything regarding the network.

Also, at the time when you saw thousands of requests fail, is it during peak hours? What is happening in the cluster at that moment?

Regards,

hbqdev commented 1 year ago

hi @roee-landesman We will archive this issue for now, please note if you require further assistant, please send an email to support@split.io and we will be ready to assist you.

Regards