process hangs due to default socket_timeout=None

zioproto commented 3 weeks ago

This is a recurring issue, previously reported in this GitHub issue.

I encountered the same problem while testing a Valkey Cluster on Kubernetes. When a Valkey Cluster pod is evicted, my client attempts to reconnect immediately, but gets stuck in the SYN_SENT state indefinitely, trying to connect to an IP address that no longer exists in the Kubernetes cluster. In Kubernetes, pods are ephemeral, and each new pod is assigned a new IP address.

From the ValkeyCluster documentation, it's not immediately clear that developers need to configure the TCP timeout. See the relevant section here. Furthermore, I don’t believe developers building applications that connect to Valkey should have to handle low-level TCP parameters.

Proposed solutions:

Set a default socket_timeout value
- Pros: This would solve the problem immediately.
- Cons: It may be difficult to choose a default value that fits all scenarios. For example, what would be a reasonable default value for Valkey traffic? Perhaps 2 seconds?
Make socket_timeout a mandatory parameter
- Pros: Avoids the need to guess a suitable default value.
- Cons: This would introduce a breaking change.

aiven-sal commented 3 weeks ago

Hi! Thanks for reporting the problem and providing a PR!

I'm not very keen on breaking compatibility, so I'd rule out solution 2. But also solution 1 has a potential of being a breaking change, regardless of the value. I'll need to dig a bit deeper to see how big of a problem this could be.

A solution 3 could be to improve the doc and make it very clear that socket_timeout is something that needs to be set. I agree that exposing a low-level setting is not ideal, but there is no way have a "one size fits all" default here, so many users may want to set this themselves anyway. Whatever new default value we choose isn't necessarily better than the current one, in the general case.

zioproto commented 3 weeks ago

@aiven-sal should we have a look what other Valkey / Redis client implementations ( even in different languages ) are doing when setting TCP timeouts?

For example I see the golang client uses a default value of 3 seconds for socket_timeout and 5 seconds for socket_connect_timeout: https://github.com/redis/go-redis/blob/80c9f5bb777dd4e6393d014cffe9d28428ecf756/options.go#L83-L97

The Node Client is using 5 seconds: https://github.com/redis/node-redis/blob/b87b8b113430eb5e860ab435dc84e141a2e8ae98/docs/client-configuration.md?plain=1#L11

It seems that not setting a default is not a good idea :(

aiven-sal commented 3 weeks ago

I agree that having a default would better. 2, 3 or 5 they should all be fine (but maybe I'd pick 5 just to stay on the safe side) I was just worried about the implications of changing the default: I think it's unlikely that someone out there is relying on the fact that the connection never times out, but not impossible.

But it's about time we make a 6.1 where we can introduce some breaking changes. So I think we can change this default to 5.

valkey-io / valkey-py

process hangs due to default socket_timeout=None #119

Proposed solutions: