Closed VagyokC4 closed 10 months ago
Hi @VagyokC4,
A 503 "Service Unavailable" means that Webdis itself is responding – it's the one sending back the 503 after all – but that Redis is no longer available. Are you seeing this with the setup you described in a previous GitHub issue, with an external Redis server? Or is Webdis sending commands to its own Redis instance embedded in the same container?
Note that an error from Redis due to something like having run out of memory would not cause this, you would get the error back via Webdis, e.g.
{"SET":[false,"OOM command not allowed when used memory > 'maxmemory'."]}
If the Redis instance went down though, you would get a 503. See for example the demo below using a container that was started with the following command:
docker run --rm -ti -p 127.0.0.1:7379:7379 -d --name webdis-test nicolas/webdis:latest
Let's first validate that Webdis passes commands to Redis and responses back, and that we see the two processes:
$ curl -s http://localhost:7379/PING
{"PING":[true,"PONG"]}
$ docker exec -ti webdis-test ps | head -3
PID USER TIME COMMAND
1 root 0:00 /usr/local/bin/webdis /etc/webdis.prod.json
8 root 0:00 /usr/bin/redis-server 127.0.0.1:6379
Let's then kill Redis:
$ docker exec -ti webdis-test killall -9 redis-server
We see Webdis logging the disconnections:
$ docker exec -ti webdis-test cat /var/log/webdis.log
[1] 23 Apr 00:22:51 I Webdis listening on port 7379
[1] 23 Apr 00:22:51 I Webdis 0.1.21 up and running
[1] 23 Apr 00:23:09 E Error disconnecting: Server closed the connection
[1] 23 Apr 00:23:09 E Error disconnecting: Server closed the connection
[1] 23 Apr 00:23:09 E Error disconnecting: Server closed the connection
[1] 23 Apr 00:23:09 E Error disconnecting: Server closed the connection
[1] 23 Apr 00:23:09 E Error disconnecting: Server closed the connection
[1] 23 Apr 00:23:09 E Error disconnecting: Server closed the connection
[1] 23 Apr 00:23:09 E Error disconnecting: Server closed the connection
[1] 23 Apr 00:23:09 E Error disconnecting: Server closed the connection
There are 8 disconnections logged, corresponding to the 8 connections established by Webdis, with 4 worker threads and 2 connections each by default – configured by threads
and pool_size
respectively.
We then get a 503 if we send a request once Redis is down:
$ curl -v http://localhost:7379/PING 2>&1 | grep -w HTTP
> GET /PING HTTP/1.1
< HTTP/1.1 503 Service Unavailable
It is difficult to say what could be happening here without more information, so I would suggest you first take a look at the logs you have. If those don't bring up anything relevant, you could increase the Webdis log verbosity by setting "verbosity"
to 4 for DEBUG
or 8 for TRACE
in its config file, although you probably shouldn't have to since as shown above disconnections are logged at ERROR
(0). The levels are defined here and only values 0–4 and 8 are used: https://github.com/nicolasff/webdis/blob/35f4900bbb85bd9ac4b71b95bf38ebb4d8a9acb7/src/slog.h#L6-L13
The code only logs events with a level lower or equal to what's configured: https://github.com/nicolasff/webdis/blob/35f4900bbb85bd9ac4b71b95bf38ebb4d8a9acb7/src/slog.c#L86-L89
You can also make Webdis write its log file somewhere that you can follow with docker logs
: if you use "/dev/stderr"
or "/dev/stdout"
for the "logfile"
config value, you'll be able to tail the logs with docker logs -f $containerName
. That said you can also tail the logs without making any changes using docker exec -ti webdis-test tail -f /var/log/webdis.log
(adjust the container name as needed or use the container hash).
Try also taking a look at the active connections to the Redis instance if it's external, making sure that the connections from Webdis are not being closed for some reason. You can use a tool like lsof
for this purpose.
Or is Webdis sending commands to its own Redis instance embedded in the same container?
I'll test the log thing when I get home, but I just wanted to respond and let you know that I'm connecting to a Redis enterprise cloud instance from Redis labs, so I would expect that to never go down.
@nicolasff So it looks like it just needs some connection resiliency:
I did not get an error until I tried the request again some time later. If the connection is disconnected by peer, it should recycle and restart it's connection.
Thoughts?
I see, that explains the 503s.
In this case, "connection reset by peer" usually means that Webdis received a RST
packet from the other side, either because the server explicitly disconnected or because the connection lost enough packets that the client (hiredis) could no longer consider it to be a valid connection. Seeing that it was noticed only when a request was sent, it's possible the connection was actually closed earlier but that for some reason this information did not make it all the way to Webdis – it might be a challenge to figure out exactly where and when it failed, given the layers involved with Docker and all.
Webdis does attempt to reconnect to Redis, of course. When a connection is detected to be broken Webdis will schedule a reconnection 100ms later (see the source for it here in pool_schedule_reconnect
).
One thing you could try is to establish more connections to Redis, by setting the value of "pool_size"
in the config file; it's 2
by default but you could try 4 to start with.
If you have a way to configure Redis to close connections after a set amount of time, you could try that too. The first disconnection you received was after ~1h53m of uptime, so it would be interesting to try forcing Webdis to reconnect sooner.
On the Webdis side, a possible improvement here would be to "recycle" connections after some time, meaning that when a connection reaches a certain "age", it is dropped and replaced by a new one. This is not uncommon for long-running connections to a backend service and can help create a more stable link to it, but adding support for this would require a code change and I can't really give you a timeline for that.
I'll also take a look at hiredis to see if anything's available from that side, and will update this thread with what I find.
Is it possible to get some type of keep alive heartbeat to ping the server just to keep that connection open, I think what happens is if there's no activity for some time then it wants to shut it down from the cloud Redis instance side.
If the connection does get in this position and another request comes in, it should destroy and rebuild that connection (if needed), and retry the request again. Currently, only restarting the app will get the connection working again.
I think what happens is if there's no activity for some time then it wants to shut it down from the cloud Redis instance side.
Yes, that sounds right to me. There is a way to ask hiredis to enable SO_KEEPALIVE
on its connections to Redis, this is an option for TCP sockets that makes the kernel send "keep-alive probes" at regular intervals to ensure a minimum amount of traffic is exchanged through the socket over time.
In hiredis the function to enable this feature takes an int interval
parameter expressed in seconds, with an alternate using a default value of 15.
With interval
set to 15, hiredis will divide it by 3 to get 5, which is the value set for TCP_KEEPINTVL
. It also configures TCP_KEEPCNT
with a value of 3, so what "15" means here is:
TCP_KEEPIDLE
= 15: start sending keep-alive probes after 15 secondsTCP_KEEPINTVL
= 5: send a keep-alive probe every 5 secondsTCP_KEEPCNT
= 3: consider the connection broken after 3 failed probesSee the logic used here in redisKeepAlive
as well as man(7) tcp and man(7) socket for a longer description of these options.
The way this could work in Webdis would be to make this interval
value configurable, maybe with a new "hiredis"
section in the config file. Something like this (other keys omitted):
{
"hiredis": {
"keep_alive": 15
}
}
Other hiredis options could be added there later if needed, without cluttering the "root" level of the config object.
I think it would make sense to keep the current behavior and not enable this by default. Hiredis requires an explicit call to turn this on and I'd prefer not to introduce surprises for users who upgrade and see a bunch of traffic they did not expect.
I think this is a useful addition and will figure out over the next few days how this can be implemented.
In the meantime, there *is* a way that you can keep connections alive, simply by sending regular commands to Redis – a simple PING
would do. For a given thread with its connection pool, Webdis selects the Redis connection to use by going round-robin over the array of connections it maintains to the backend. This means that two consecutive PING
commands served by the same Webdis thread will go over different Redis connections, and will therefore let Redis know that both sockets are still active and shouldn't be closed.
In the meantime, there is a way that you can keep connections alive, simply by sending regular commands to Redis – a simple
PING
would do. For a given thread with its connection pool, Webdis selects the Redis connection to use by going round-robin over the array of connections it maintains to the backend. This means that two consecutivePING
commands served by the same Webdis thread will go over different Redis connections, and will therefore let Redis know that both sockets are still active and shouldn't be closed.
This works. I can set our kuberneties health-checks to do a PING and that will keep the connection alive no problem.
Hello!
Following up about this work I realized all had already been implemented back in May, and very little was missing to integrate it. So I did this today and released Webdis 0.1.22:
The configuration is slightly different from what I had written above, but still very close:
"hiredis": {
"keep_alive_sec": 15
}
The feature is documented in the README, reproduced here:
Support for "Keep-Alive" connections to Redis: add
"hiredis": { "keep_alive_sec": 15 }
towebdis.json
to enable it with the default value. See the Hiredis documentation for details, the value configured inwebdis.json
is the interval passed toredisEnableKeepAliveWithInterval
. Important: note how it is used to set the value forTCP_KEEPALIVE
(the same value) and to compute the value forTCP_KEEPINTVL
(integer, set to 1/3 ×interval
).
The 1/3 value comes from here in Hiredis.
I would recommend looking up TCP_KEEPALIVE
and TCP_KEEPINTVL
to fully understand how this works in Hiredis. tcp(7) is a good place to start.
In a nutshell:
TCP_KEEPALIVE
: How long a socket can be idle before sending keep-alive probes.TCP_KEEPINTVL
: How long to wait between each keep-alive probe (set to 1/3 or the TCP_KEEPALIVE
value).I hope this addresses your disconnection issues.
Closing this issue since there is now a supported config option to enable TCP keep-alive, which should address the disconnections reported originally.
One thing I'm seeing is my docker becomes unresponsive at some point in time, and I have to restart the container. Maybe adding some health-check options so that it can recycle itself when it's not performing as expected?
I don't see any logging in the console one way or the other. It's starts with a blank screen and is crashed with a blank screen. Any ideas?
I get back a 503
I can restart the pod and then we are good.. until some time has passed and we are not.