Open towest-nr opened 2 weeks ago
Pretty anecdotal but updating NET_MAX_WRITES_PER_EVENT (1024 * 64)
to NET_MAX_WRITES_PER_EVENT (1024 * 16)
either gets rid of the SSL errors or they are infrequent enough that I can't repro it locally anymore. I have no idea what other affects this would have though.
ok, I found an even easier to way to reproduce this issue. This might even be possible to repro on a single Redis node (not clustered, no replicas) but I'll test that out later today.
I am able to reliably reproduce the problem by simply running valkey-benchmark
:
valkey-benchmark -a <auth_token> --tls --cacert /tls/ca.crt --cluster -d 32768 -n 100000 -r 100000 --threads 10 -t lrange
Depending on where in the valkey code it fails the benchmark will return failure modes like:
Summary:
throughput summary: 24026.91 requests per second
latency summary (msec):
avg min p50 p95 p99 max
1.751 0.040 0.607 2.095 54.207 76.799
Error: Server closed the connection
Error: Success
Error: Success
or
Summary:
throughput summary: 19447.69 requests per second
latency summary (msec):
avg min p50 p95 p99 max
2.340 0.048 0.799 2.511 62.367 85.695
Error: Server closed the connection
or even this fun one:
Summary:
throughput summary: 15130.88 requests per second
latency summary (msec):
avg min p50 p95 p99 max
3.034 0.088 1.751 6.879 35.231 108.287
Error: Server closed the connections=10.6 (overall: 277.8) avg_msec=9.739 (overall: 9.739)
Segmentation fault
If you have verbose
logging turned on you can see the SSL bad length
error in the logs on the node:
1:M 08 Oct 2024 17:59:33.081 - Error writing to client: error:0A00010F:SSL routines::bad length
FWIW, I tried running this locally using a single valkey instance natively on my M1 and I'm struggling to reproduce this. I was able to repro it every time while running in a local k8s environment so I'm not yet sure what the key is to getting it to happen every time.
@madolson Not sure if you had a chance to look into this yet or not but I have another datapoint and steps to reproduce with a single node running locally.
make BUILD_TLS=yes
./utils/gen-test-certs.sh
./src/valkey-server --tls-port 6379 --port 0 --tls-cert-file ./tests/tls/valkey.crt --tls-key-file ./tests/tls/valkey.key --tls-ca-cert-file ./tests/tls/ca.crt --prefetch-batch-max-size 128 --max-new-tls-connections-per-cycle 5 --loglevel verbose
In another terminal:
./src/valkey-benchmark --tls --cert ./tests/tls/valkey.crt --key ./tests/tls/valkey.key --cacert ./tests/tls/ca.crt -d 32768 -n 100000 -r 100000 --threads 50 -t lrange
This reproduces nearly every time for me with this config. However, even if I revert these configs in our testing environment I still continue to see the bad length
issues under enough load. So it looks like these configs might just be allowing the bug to surface under a lower level of load.
Describe the bug
When running a high throughput Redis cluster (3 shards, 1 replica per shard) the read replicas start to hit a strange issue where we see very frequent reconnects and an inability to maintain a stable connection. Our cluster is set up with TLS (self-signed cert) and after enabling
verbose
logging I correlated these log lines to the reconnects that we're seeing in our application:To reproduce
I have attached two simple Kotlin apps that can reproduce this behavior, even against a local 3 shard cluster. This does not appear to be an io-threads issue because I set
io-threads: 1
to rule that out as being part of the cause.valkey-push.tar.gz valkey-range.tar.gz
Expected behavior
When running the
LPUSH
on the primary nodes and theLRANGE + UNLINK
on the read replica it should be able to handle the throughput without frequent reconnects andbad length
errors in theverbose
logs.Additional information
I spent some time adding additional logging, trying to figure out the source of the bug and so far I haven't found the conclusive cause but I do have some hints. The error above happens when
SSL_write
returns back error codeSSL_ERROR_WANT_WRITE
but the next write that occurs contains a different length than it expected (sinceSSL_MODE_ACCEPT_MOVING_WRITE_BUFFER
appears to be enabled it doesn't need to be the exact same buffer).What I see happening locally is we have some data to write (say,
81856
bytes). In the_writeToClient
method the replyList is not empty so we jump intowritevToClient()
. In this method the 81856 bytes get split up into 4 arrays and put intoiov
(with aniovcount
of4
. Each array has20464
bytes of data.These arrays get passed into
connTLSWritev()
intls.c
. The total size of the array is larger thanNET_MAX_WRITES_PER_EVENT
(64k) so it's going to invokeconnTLSWrite
one time for each array. The first call happens and it tries to write 20464 bytes of data but it's actually only able to write 16k (16384
) bytes. We hit the following statement and break out of the loop:That's going to send us back to
writevToClient()
innetworking.c
. We increment thetotwritten
and modify the underlyingiov
arrays to move past the 16k of data that we just wrote. After this, we call back intoconnTLSWritev()
intls.c
. This time we have81856 - 16384 = 65472
bytes to write. This is just under the 64k limit so we copy the arrays into a single buffer and try to write it out. Again, theSSL_write
only writes 16kb of data (I think this is the max allowed per write?) and we repeat the process above until we've written all the data out.This seems to work fine most of the time but I'm frequently seeing cases where it appears that the
c->reply
list on the connection has a different length between the time that we write out the first16kb
of data and the time that we re-enterwritevToClient()
and attempt to send the un-sent data (e.g. - the exact payload that theSSL_write
call is expecting to see). It's not clear to me whether or not it's expected that thec->reply
list would change in the midst of writing a payload, but I assume it's just continually appended to? And perhaps the issue here is just that there's no handling of the case where an exact payload needs to be re-written?I still have a pretty weak grasp on a lot of this code so I'm really hoping that someone who knows a lot more than me would be able to tell what's going on. I'm more than happy to try out any code changes locally to validate a fix.