[BUG] 0A00010F:SSL routines::bad length causing high latency and reconnects

towest-nr commented 2 weeks ago

Describe the bug

When running a high throughput Redis cluster (3 shards, 1 replica per shard) the read replicas start to hit a strange issue where we see very frequent reconnects and an inability to maintain a stable connection. Our cluster is set up with TLS (self-signed cert) and after enabling verbose logging I correlated these log lines to the reconnects that we're seeing in our application:

1:S 08 Oct 2024 00:41:39.521 - Error writing to client: error:0A00010F:SSL routines::bad length

To reproduce

I have attached two simple Kotlin apps that can reproduce this behavior, even against a local 3 shard cluster. This does not appear to be an io-threads issue because I set io-threads: 1 to rule that out as being part of the cause.

valkey-push.tar.gz valkey-range.tar.gz

Expected behavior

When running the LPUSH on the primary nodes and the LRANGE + UNLINK on the read replica it should be able to handle the throughput without frequent reconnects and bad length errors in the verbose logs.

Additional information

I spent some time adding additional logging, trying to figure out the source of the bug and so far I haven't found the conclusive cause but I do have some hints. The error above happens when SSL_write returns back error code SSL_ERROR_WANT_WRITE but the next write that occurs contains a different length than it expected (since SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER appears to be enabled it doesn't need to be the exact same buffer).

What I see happening locally is we have some data to write (say, 81856 bytes). In the _writeToClient method the replyList is not empty so we jump into writevToClient(). In this method the 81856 bytes get split up into 4 arrays and put into iov (with an iovcount of 4. Each array has 20464 bytes of data.

These arrays get passed into connTLSWritev() in tls.c. The total size of the array is larger than NET_MAX_WRITES_PER_EVENT (64k) so it's going to invoke connTLSWrite one time for each array. The first call happens and it tries to write 20464 bytes of data but it's actually only able to write 16k (16384) bytes. We hit the following statement and break out of the loop:

if ((size_t)sent != iov[i].iov_len) break;

That's going to send us back to writevToClient() in networking.c. We increment the totwritten and modify the underlying iov arrays to move past the 16k of data that we just wrote. After this, we call back into connTLSWritev() in tls.c. This time we have 81856 - 16384 = 65472 bytes to write. This is just under the 64k limit so we copy the arrays into a single buffer and try to write it out. Again, the SSL_write only writes 16kb of data (I think this is the max allowed per write?) and we repeat the process above until we've written all the data out.

This seems to work fine most of the time but I'm frequently seeing cases where it appears that the c->reply list on the connection has a different length between the time that we write out the first 16kb of data and the time that we re-enter writevToClient() and attempt to send the un-sent data (e.g. - the exact payload that the SSL_write call is expecting to see). It's not clear to me whether or not it's expected that the c->reply list would change in the midst of writing a payload, but I assume it's just continually appended to? And perhaps the issue here is just that there's no handling of the case where an exact payload needs to be re-written?

I still have a pretty weak grasp on a lot of this code so I'm really hoping that someone who knows a lot more than me would be able to tell what's going on. I'm more than happy to try out any code changes locally to validate a fix.

towest-nr commented 2 weeks ago

Pretty anecdotal but updating NET_MAX_WRITES_PER_EVENT (1024 * 64) to NET_MAX_WRITES_PER_EVENT (1024 * 16) either gets rid of the SSL errors or they are infrequent enough that I can't repro it locally anymore. I have no idea what other affects this would have though.

towest-nr commented 2 weeks ago

ok, I found an even easier to way to reproduce this issue. This might even be possible to repro on a single Redis node (not clustered, no replicas) but I'll test that out later today.

I am able to reliably reproduce the problem by simply running valkey-benchmark:

valkey-benchmark -a <auth_token> --tls --cacert /tls/ca.crt --cluster -d 32768 -n 100000 -r 100000 --threads 10 -t lrange

Depending on where in the valkey code it fails the benchmark will return failure modes like:

Summary:
  throughput summary: 24026.91 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        1.751     0.040     0.607     2.095    54.207    76.799
Error: Server closed the connection
Error: Success
Error: Success

or

Summary:
  throughput summary: 19447.69 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        2.340     0.048     0.799     2.511    62.367    85.695
Error: Server closed the connection

or even this fun one:

Summary:
  throughput summary: 15130.88 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        3.034     0.088     1.751     6.879    35.231   108.287
Error: Server closed the connections=10.6 (overall: 277.8) avg_msec=9.739 (overall: 9.739)
Segmentation fault

If you have verbose logging turned on you can see the SSL bad length error in the logs on the node:

1:M 08 Oct 2024 17:59:33.081 - Error writing to client: error:0A00010F:SSL routines::bad length

towest-nr commented 2 weeks ago

FWIW, I tried running this locally using a single valkey instance natively on my M1 and I'm struggling to reproduce this. I was able to repro it every time while running in a local k8s environment so I'm not yet sure what the key is to getting it to happen every time.

towest-nr commented 2 weeks ago

@madolson Not sure if you had a chance to look into this yet or not but I have another datapoint and steps to reproduce with a single node running locally.

make BUILD_TLS=yes
./utils/gen-test-certs.sh
./src/valkey-server --tls-port 6379 --port 0 --tls-cert-file ./tests/tls/valkey.crt --tls-key-file ./tests/tls/valkey.key --tls-ca-cert-file ./tests/tls/ca.crt --prefetch-batch-max-size 128 --max-new-tls-connections-per-cycle 5 --loglevel verbose

In another terminal:

./src/valkey-benchmark --tls --cert ./tests/tls/valkey.crt --key ./tests/tls/valkey.key --cacert ./tests/tls/ca.crt -d 32768 -n 100000 -r 100000 --threads 50 -t lrange

This reproduces nearly every time for me with this config. However, even if I revert these configs in our testing environment I still continue to see the bad length issues under enough load. So it looks like these configs might just be allowing the bug to surface under a lower level of load.

valkey-io / valkey

[BUG] 0A00010F:SSL routines::bad length causing high latency and reconnects #1135