valkey-io / valkey-glide

An open source Valkey client library that supports Valkey and Redis open source 6.2, 7.0 and 7.2. Valkey GLIDE is designed for reliability, optimized performance, and high-availability, for Valkey and Redis OSS based applications. GLIDE is a multi language client library, written in Rust with programming language bindings, such as Java and Python
Apache License 2.0
260 stars 53 forks source link

GLIDE fails to connect to server if it blocked by a long function #1505

Open Yury-Fridlyand opened 5 months ago

Yury-Fridlyand commented 5 months ago

Describe the bug

GLIDE fails to connect to server (standalone/cluster - both) if server is blocked by a long running function call.

ERROR logger_core: client creation - Connection error: Standalone(Received error for address `localhost:6379`: BUSY: Redis is busy running a script. You can only call FUNCTION KILL or SHUTDOWN NOSAVE.

Expected Behavior

GLIDE should connect

Current Behavior

GLIDE returns ClosingError to the wrapper:

ClosingException: Connection error: Standalone(Received error for address `localhost:6379`: BUSY: Redis is busy running a script. You can only call FUNCTION KILL or SHUTDOWN NOSAVE.

Reproduction Steps

  1. Create a function with an endless loop
  2. Call it
  3. Try to connect with GLIDE

1.1. Write to a file

#!lua name=experiment

local function sleep(keys, args)
  while (true) do
  end
  return 'OK'
end

redis.register_function{
function_name='sleep',
callback=sleep,
flags={ 'no-writes' }
}

1.2. Load it

cat mylib.lua | redis-cli -x FUNCTION LOAD REPLACE

2.0. Call the function

redis-cli FCALL sleep 0

To unblock the server after testing, run

redis-cli FUNCTION KILL

Possible Solution

Ignore [some] errors on INFO REPLICATION request automatically issued by GLIDE on connect.

Additional Information/Context

Wireshark dump image

Client version used

main @ 1415d4d

Redis Version

7.0

OS

Linux

Language

Python

Language Version

N/A

Cluster information

No response

Logs

No response

Other information

No response

barshaul commented 5 months ago

In general: long lua scripts are bad practice in redis, but if users are using long running scripts they should configure their client with long timeouts in order to handle them. For standalone: when we create a standalone client we need to find the primary node in the received nodes list. We do that by querying for REPLICATION INFO. If we fail to get the info replication in the primary node in all retries (which are configured by ConnectionRetryStrategy, that the user can modify), we won't be able to continue with client creation. So failing the client creation is the expected behavior. The user can increase the number of retries to try avoiding these errors.
For the cluster client: can you please share the logs? do you run the lua script on all nodes? again, if we can't query the nodes for their topology view, we won't be able to establish a new client and the client creation should fail

Yury-Fridlyand commented 5 months ago

please share the logs

Which logs do you need?

failing the client creation is the expected behavior

I don't agree. A user is trying to connect to do FUNCTION KILL. We should allow doing that regradless of topology info. I think I already mentioned on another GH issue, we need to support "connected with limitations" state.

asafpamzn commented 4 months ago

@barshaul what are the default timeouts? we should have it handled with the exponential backoff. The time slice in Valkey is 5 seconds. https://github.com/valkey-io/valkey/blob/32ca6e5b38bfc4a15878c2b6ff3d2c71da1026e3/src/config.c#L3177

What is the connection timeout for GLIDE?