Explosion of errors & latency after few hours of sustained production

arbfay commented 2 years ago

Version & Environment

Redpanda version: (use rpk version): v21.11.3 (rev b3e78b1)

Producer is a Rust program that uses rust-rdkafka (wrapper for librdkafka, at v1.8.2). Produces ~200msg/s sent to 14 topics (1 partition, no replication, not compact), with the following settings:

security.protocol: "sasl_plaintext"
sasl.mechanism: "SCRAM-SHA-256"
message.timeout.ms: "50"
queue.buffering.max.ms: "1"
enable.idempotence: "true"
message.send.max.retries: "10"
retry.backoff.ms: "1"

The Redpanda cluster is a single broker in production mode running on Ubuntu based r6gd.large (arm64) AWS instance with enabled idempotency and enforced SASL. The producer works on a close, other instance in the same subnet (which makes IMO network issues the unlikely cause, see below).

What went wrong?

In an image:

After hours, the number of errors explodes at both the producer & the broker. After having reproduced this several times, and tried different settings for the producer, I came to the conclusion that the problem is the broker, thus redpanda.

At first I was using an older version of redpanda and compact topics, but I reproduced this with the latest redpanda version and with 14 single partition, non-compact topics.

What should have happened instead?

Should be staying stable and work as usual.

How to reproduce the issue?

Start a single node cluster on ARM64 machine with idempotency and SASL enabled
Start a producer with settings and behaviour as described above
Let it run for several hours (sometimes it happens early, sometimes it needs +10hrs)

Additional information

Here is what the producer says (many, many times):

ERROR librdkafka > librdkafka: FAIL [thrd:sasl_plaintext://10.2.0.24:9092/bootstrap]: sasl_plaintext://10.2.0.24:9092/1: 3 request(s) timed out: disconnect (after 418567ms in state UP)
ERROR rdkafka::client > librdkafka: Global error: OperationTimedOut (Local: Timed out): sasl_plaintext://10.2.0.24:9092/1: 3 request(s) timed out: disconnect (after 418567ms in state UP)
ERROR rdkafka::client > librdkafka: Global error: AllBrokersDown (Local: All broker connections are down): 1/1 brokers are down

What the broker says:

DEBUG 2022-01-27 20:49:34,595 [shard 0] storage-gc - disk_log_impl.cc:236 - [{kafka/topic_1/0}] time retention timestamp: {timestamp: 1642711774595}, first segment max timestamp: {timestamp: 1643252751288}
DEBUG 2022-01-27 20:49:34,595 [shard 0] storage-gc - disk_log_impl.cc:236 - [{kafka/topic_5/0}] gc[time_based_retention] requested to remove segments up to -9223372036854775808 offset

INFO  2022-01-28 13:00:07,060 [shard 0] kafka - connection_context.cc:308 - Detected error processing request: std::runtime_error (Unexpected request during authentication: 3)
INFO  2022-01-28 13:00:07,326 [shard 0] kafka - connection_context.cc:308 - Detected error processing request: std::runtime_error (Unexpected request during authentication: 3)
INFO  2022-01-28 13:00:07,859 [shard 0] kafka - connection_context.cc:308 - Detected error processing request: std::runtime_error (Unexpected request during authentication: 3)
INFO  2022-01-28 13:00:08,912 [shard 0] kafka - connection_context.cc:308 - Detected error processing request: std::runtime_error (Unexpected request during authentication: 3)
WARN  2022-01-28 13:01:13,845 [shard 0] rpc - error connecting to 0.0.0.0:33145 - std::__1::system_error (error system:111, Connection refused)

We also observed an sustained increase in memory usage (~1GB/5hrs) with the same workload.

JIRA Link: CORE-824

jrkinley commented 2 years ago

Thank you for reporting this @arbfay.

The 3 at the end of Unexpected request during authentication is the API key for the metadata request type (17 is SaslHandshake). It's reported as INFO so possibly a red herring.

There are a couple of things we can try to help narrow down the problem:

Disable SCRAM authentication
Disable idempotencey

Would it be possible to try these out in your environment?

arbfay commented 2 years ago

Reproduced with no auth. Trying with disabled idempotency.

arbfay commented 2 years ago

Same issue after a couple of hours. I just deployed a single broker, Redpanda cluster and started it in production mode. No config...

This PR #1222 is enlightening on what might be happening.

FYI I'm now using this repo to test for a long period of time in a complex scenario.

jrkinley commented 2 years ago

We appreciate the feedback here @arbfay and we're looking into it.

WRT #1222; qdc is disabled by default so the lack of back pressure control at the kafka layer might explain why the producers experience issues when requests start to build up on the broker. But that doesn't explain why the requests are building up if the producer load is consistent and Redpanda performance is stable for hours. Is broker memory consumption steadily increasing over this time to reach a saturation point? Or does something suddenly change in AWS (e.g. disk or network latency spikes)?

You could try with these queue depth settings:

sudo rpk config set redpanda.kafka_qdc_enable true
sudo rpk config set redpanda.kafka_qdc_idle_depth 8
sudo rpk config set redpanda.kafka_qdc_max_depth 32
sudo rpk config set redpanda.kafka_qdc_max_latency_ms 4
sudo rpk config set redpanda.rpc_server_tcp_recv_buf 65536

arbfay commented 2 years ago

@jrkinley Nothing was detected on AWS's side. For the memory, it's still ever-increasing (memory leaks?), but it never reached the maximum in my tests: Screenshot 2022-01-31 at 15 13 41

I'm running again with your suggested settings. Also, I noticed the logs said that it didn't know what tuned parameters to use for the AWS instance type r6gd.large, so I ran rpk iotune myself and now observe a 10x improvement in latency at Redpanda's side! There are still errors sometimes (same as above), so we'll see how long it will last.

arbfay commented 2 years ago

Interestingly those 2 curves are the same.

jrkinley commented 2 years ago

Hi @arbfay. Have the back pressure settings helped to stabilise the memory?

vectorized_io_queue_total_bytes is a counter so it's always going to increase, but the close correlation to allocated memory is interesting. It would be useful to compare the other io_queue metrics here too:vectorized_io_queue_delay and vectorized_io_queue_queue_length. And just to double check that the Redpanda data directory is writing to the NVMe drive and not EBS?

WRT the sudden latency spikes. This could be related to fstrim starting up to trim blocks on the storage device. Please can you confirm that fstrim is disabled? It's disabled by default in production mode as per: https://github.com/vectorizedio/redpanda/issues/3068.

arbfay commented 2 years ago

Hi @jrkinley

The broker's memory has peaked at ~90% (14.8GB) and has remained at that level ever since (5 days). However, the io queue metric is still growing, so it means it is supposed to increase ( I guess it means something like "total bytes processed through the io queue".

Yes, the data directory is indeed an NVMe drive. I just checked about fstrim, and it is already disabled.

It looks like it is now stable! 🎊

jrkinley commented 2 years ago

@arbfay this is good news. It appears the back pressure settings are doing their job. Redpanda is designed to use all available resources, so consuming 90% of memory is ok as long as its stable. The IO queue total bytes metric is a counter, so will increase indefinitely. It's better to keep an eye on vectorized_io_queue_delay and vectorized_io_queue_queue_length as they are gauges.

redpanda-data / redpanda