Docker image becomes unresponsive after some time.

VagyokC4 commented 1 year ago

One thing I'm seeing is my docker becomes unresponsive at some point in time, and I have to restart the container. Maybe adding some health-check options so that it can recycle itself when it's not performing as expected?

I don't see any logging in the console one way or the other. It's starts with a blank screen and is crashed with a blank screen. Any ideas?

I get back a 503

I can restart the pod and then we are good.. until some time has passed and we are not.

nicolasff commented 1 year ago

Hi @VagyokC4,

A 503 "Service Unavailable" means that Webdis itself is responding – it's the one sending back the 503 after all – but that Redis is no longer available. Are you seeing this with the setup you described in a previous GitHub issue, with an external Redis server? Or is Webdis sending commands to its own Redis instance embedded in the same container?

Note that an error from Redis due to something like having run out of memory would not cause this, you would get the error back via Webdis, e.g.

{"SET":[false,"OOM command not allowed when used memory > 'maxmemory'."]}

If the Redis instance went down though, you would get a 503. See for example the demo below using a container that was started with the following command:

docker run --rm -ti -p 127.0.0.1:7379:7379 -d --name webdis-test nicolas/webdis:latest

Let's first validate that Webdis passes commands to Redis and responses back, and that we see the two processes:

$ curl -s http://localhost:7379/PING
{"PING":[true,"PONG"]}

$ docker exec -ti webdis-test ps | head -3
PID   USER     TIME  COMMAND
    1 root      0:00 /usr/local/bin/webdis /etc/webdis.prod.json
    8 root      0:00 /usr/bin/redis-server 127.0.0.1:6379

Let's then kill Redis:

$ docker exec -ti webdis-test killall -9 redis-server

We see Webdis logging the disconnections:

$ docker exec -ti webdis-test cat /var/log/webdis.log
[1] 23 Apr 00:22:51 I Webdis listening on port 7379
[1] 23 Apr 00:22:51 I Webdis 0.1.21 up and running
[1] 23 Apr 00:23:09 E Error disconnecting: Server closed the connection
[1] 23 Apr 00:23:09 E Error disconnecting: Server closed the connection
[1] 23 Apr 00:23:09 E Error disconnecting: Server closed the connection
[1] 23 Apr 00:23:09 E Error disconnecting: Server closed the connection
[1] 23 Apr 00:23:09 E Error disconnecting: Server closed the connection
[1] 23 Apr 00:23:09 E Error disconnecting: Server closed the connection
[1] 23 Apr 00:23:09 E Error disconnecting: Server closed the connection
[1] 23 Apr 00:23:09 E Error disconnecting: Server closed the connection

There are 8 disconnections logged, corresponding to the 8 connections established by Webdis, with 4 worker threads and 2 connections each by default – configured by threads and pool_size respectively.

We then get a 503 if we send a request once Redis is down:

$ curl -v http://localhost:7379/PING 2>&1 | grep -w HTTP
> GET /PING HTTP/1.1
< HTTP/1.1 503 Service Unavailable

It is difficult to say what could be happening here without more information, so I would suggest you first take a look at the logs you have. If those don't bring up anything relevant, you could increase the Webdis log verbosity by setting "verbosity" to 4 for DEBUG or 8 for TRACE in its config file, although you probably shouldn't have to since as shown above disconnections are logged at ERROR (0). The levels are defined here and only values 0–4 and 8 are used: https://github.com/nicolasff/webdis/blob/35f4900bbb85bd9ac4b71b95bf38ebb4d8a9acb7/src/slog.h#L6-L13 The code only logs events with a level lower or equal to what's configured: https://github.com/nicolasff/webdis/blob/35f4900bbb85bd9ac4b71b95bf38ebb4d8a9acb7/src/slog.c#L86-L89

You can also make Webdis write its log file somewhere that you can follow with docker logs: if you use "/dev/stderr" or "/dev/stdout" for the "logfile" config value, you'll be able to tail the logs with docker logs -f $containerName. That said you can also tail the logs without making any changes using docker exec -ti webdis-test tail -f /var/log/webdis.log (adjust the container name as needed or use the container hash).

Try also taking a look at the active connections to the Redis instance if it's external, making sure that the connections from Webdis are not being closed for some reason. You can use a tool like lsof for this purpose.

VagyokC4 commented 1 year ago

Or is Webdis sending commands to its own Redis instance embedded in the same container?

I'll test the log thing when I get home, but I just wanted to respond and let you know that I'm connecting to a Redis enterprise cloud instance from Redis labs, so I would expect that to never go down.

VagyokC4 commented 1 year ago

@nicolasff So it looks like it just needs some connection resiliency:

I did not get an error until I tried the request again some time later. If the connection is disconnected by peer, it should recycle and restart it's connection.

Thoughts?

nicolasff commented 1 year ago

I see, that explains the 503s.

In this case, "connection reset by peer" usually means that Webdis received a RST packet from the other side, either because the server explicitly disconnected or because the connection lost enough packets that the client (hiredis) could no longer consider it to be a valid connection. Seeing that it was noticed only when a request was sent, it's possible the connection was actually closed earlier but that for some reason this information did not make it all the way to Webdis – it might be a challenge to figure out exactly where and when it failed, given the layers involved with Docker and all.

Webdis does attempt to reconnect to Redis, of course. When a connection is detected to be broken Webdis will schedule a reconnection 100ms later (see the source for it here in pool_schedule_reconnect).

One thing you could try is to establish more connections to Redis, by setting the value of "pool_size" in the config file; it's 2 by default but you could try 4 to start with. If you have a way to configure Redis to close connections after a set amount of time, you could try that too. The first disconnection you received was after ~1h53m of uptime, so it would be interesting to try forcing Webdis to reconnect sooner.

On the Webdis side, a possible improvement here would be to "recycle" connections after some time, meaning that when a connection reaches a certain "age", it is dropped and replaced by a new one. This is not uncommon for long-running connections to a backend service and can help create a more stable link to it, but adding support for this would require a code change and I can't really give you a timeline for that.

I'll also take a look at hiredis to see if anything's available from that side, and will update this thread with what I find.

VagyokC4 commented 1 year ago

Is it possible to get some type of keep alive heartbeat to ping the server just to keep that connection open, I think what happens is if there's no activity for some time then it wants to shut it down from the cloud Redis instance side.

If the connection does get in this position and another request comes in, it should destroy and rebuild that connection (if needed), and retry the request again. Currently, only restarting the app will get the connection working again.

nicolasff commented 1 year ago

I think what happens is if there's no activity for some time then it wants to shut it down from the cloud Redis instance side.

Yes, that sounds right to me. There is a way to ask hiredis to enable SO_KEEPALIVE on its connections to Redis, this is an option for TCP sockets that makes the kernel send "keep-alive probes" at regular intervals to ensure a minimum amount of traffic is exchanged through the socket over time. In hiredis the function to enable this feature takes an int interval parameter expressed in seconds, with an alternate using a default value of 15.

With interval set to 15, hiredis will divide it by 3 to get 5, which is the value set for TCP_KEEPINTVL. It also configures TCP_KEEPCNT with a value of 3, so what "15" means here is:

TCP_KEEPIDLE = 15: start sending keep-alive probes after 15 seconds
TCP_KEEPINTVL = 5: send a keep-alive probe every 5 seconds
TCP_KEEPCNT = 3: consider the connection broken after 3 failed probes

See the logic used here in redisKeepAlive as well as man(7) tcp and man(7) socket for a longer description of these options.

The way this could work in Webdis would be to make this interval value configurable, maybe with a new "hiredis" section in the config file. Something like this (other keys omitted):

{
    "hiredis": {
        "keep_alive": 15
    }
}

Other hiredis options could be added there later if needed, without cluttering the "root" level of the config object.

I think it would make sense to keep the current behavior and not enable this by default. Hiredis requires an explicit call to turn this on and I'd prefer not to introduce surprises for users who upgrade and see a bunch of traffic they did not expect.

I think this is a useful addition and will figure out over the next few days how this can be implemented.

In the meantime, there *is* a way that you can keep connections alive, simply by sending regular commands to Redis – a simple PING would do. For a given thread with its connection pool, Webdis selects the Redis connection to use by going round-robin over the array of connections it maintains to the backend. This means that two consecutive PING commands served by the same Webdis thread will go over different Redis connections, and will therefore let Redis know that both sockets are still active and shouldn't be closed.

VagyokC4 commented 1 year ago

In the meantime, there is a way that you can keep connections alive, simply by sending regular commands to Redis – a simple PING would do. For a given thread with its connection pool, Webdis selects the Redis connection to use by going round-robin over the array of connections it maintains to the backend. This means that two consecutive PING commands served by the same Webdis thread will go over different Redis connections, and will therefore let Redis know that both sockets are still active and shouldn't be closed.

This works. I can set our kuberneties health-checks to do a PING and that will keep the connection alive no problem.

nicolasff commented 10 months ago

Hello!

Following up about this work I realized all had already been implemented back in May, and very little was missing to integrate it. So I did this today and released Webdis 0.1.22:

The configuration is slightly different from what I had written above, but still very close:

"hiredis": {
    "keep_alive_sec": 15
}

The feature is documented in the README, reproduced here:

Support for "Keep-Alive" connections to Redis: add "hiredis": { "keep_alive_sec": 15 } to webdis.json to enable it with the default value. See the Hiredis documentation for details, the value configured in webdis.json is the interval passed to redisEnableKeepAliveWithInterval. Important: note how it is used to set the value for TCP_KEEPALIVE (the same value) and to compute the value for TCP_KEEPINTVL (integer, set to 1/3 × interval).

The 1/3 value comes from here in Hiredis.

I would recommend looking up TCP_KEEPALIVE and TCP_KEEPINTVL to fully understand how this works in Hiredis. tcp(7) is a good place to start.

In a nutshell:

TCP_KEEPALIVE: How long a socket can be idle before sending keep-alive probes.
TCP_KEEPINTVL: How long to wait between each keep-alive probe (set to 1/3 or the TCP_KEEPALIVE value).

I hope this addresses your disconnection issues.

nicolasff commented 10 months ago

Closing this issue since there is now a supported config option to enable TCP keep-alive, which should address the disconnections reported originally.

nicolasff / webdis

Docker image becomes unresponsive after some time. #233