Open arbfay opened 2 years ago
Thank you for reporting this @arbfay.
The 3
at the end of Unexpected request during authentication
is the API key for the metadata request type (17
is SaslHandshake). It's reported as INFO
so possibly a red herring.
There are a couple of things we can try to help narrow down the problem:
Would it be possible to try these out in your environment?
Reproduced with no auth. Trying with disabled idempotency.
Same issue after a couple of hours. I just deployed a single broker, Redpanda cluster and started it in production mode. No config...
This PR #1222 is enlightening on what might be happening.
FYI I'm now using this repo to test for a long period of time in a complex scenario.
We appreciate the feedback here @arbfay and we're looking into it.
WRT #1222; qdc is disabled by default so the lack of back pressure control at the kafka layer might explain why the producers experience issues when requests start to build up on the broker. But that doesn't explain why the requests are building up if the producer load is consistent and Redpanda performance is stable for hours. Is broker memory consumption steadily increasing over this time to reach a saturation point? Or does something suddenly change in AWS (e.g. disk or network latency spikes)?
You could try with these queue depth settings:
sudo rpk config set redpanda.kafka_qdc_enable true
sudo rpk config set redpanda.kafka_qdc_idle_depth 8
sudo rpk config set redpanda.kafka_qdc_max_depth 32
sudo rpk config set redpanda.kafka_qdc_max_latency_ms 4
sudo rpk config set redpanda.rpc_server_tcp_recv_buf 65536
@jrkinley Nothing was detected on AWS's side. For the memory, it's still ever-increasing (memory leaks?), but it never reached the maximum in my tests:
I'm running again with your suggested settings. Also, I noticed the logs said that it didn't know what tuned parameters to use for the AWS instance type r6gd.large
, so I ran rpk iotune
myself and now observe a 10x improvement in latency at Redpanda's side!
There are still errors sometimes (same as above), so we'll see how long it will last.
Interestingly those 2 curves are the same.
Hi @arbfay. Have the back pressure settings helped to stabilise the memory?
vectorized_io_queue_total_bytes
is a counter so it's always going to increase, but the close correlation to allocated memory is interesting. It would be useful to compare the other io_queue
metrics here too:vectorized_io_queue_delay
and vectorized_io_queue_queue_length
. And just to double check that the Redpanda data directory is writing to the NVMe drive and not EBS?
WRT the sudden latency spikes. This could be related to fstrim
starting up to trim blocks on the storage device. Please can you confirm that fstrim
is disabled? It's disabled by default in production mode as per: https://github.com/vectorizedio/redpanda/issues/3068.
Hi @jrkinley
The broker's memory has peaked at ~90% (14.8GB) and has remained at that level ever since (5 days). However, the io queue metric is still growing, so it means it is supposed to increase ( I guess it means something like "total bytes processed through the io queue".
Yes, the data directory is indeed an NVMe drive. I just checked about fstrim
, and it is already disabled.
It looks like it is now stable! 🎊
@arbfay this is good news. It appears the back pressure settings are doing their job. Redpanda is designed to use all available resources, so consuming 90% of memory is ok as long as its stable. The IO queue total bytes metric is a counter, so will increase indefinitely. It's better to keep an eye on vectorized_io_queue_delay
and vectorized_io_queue_queue_length
as they are gauges.
Version & Environment
Redpanda version: (use
rpk version
): v21.11.3 (rev b3e78b1)Producer is a Rust program that uses
rust-rdkafka
(wrapper forlibrdkafka
, at v1.8.2). Produces ~200msg/s sent to 14 topics (1 partition, no replication, not compact), with the following settings:security.protocol
: "sasl_plaintext"sasl.mechanism
: "SCRAM-SHA-256"message.timeout.ms
: "50"queue.buffering.max.ms
: "1"enable.idempotence
: "true"message.send.max.retries
: "10"retry.backoff.ms
: "1"The Redpanda cluster is a single broker in production mode running on Ubuntu based
r6gd.large
(arm64) AWS instance with enabled idempotency and enforced SASL. The producer works on a close, other instance in the same subnet (which makes IMO network issues the unlikely cause, see below).What went wrong?
In an image:
After hours, the number of errors explodes at both the producer & the broker. After having reproduced this several times, and tried different settings for the producer, I came to the conclusion that the problem is the broker, thus redpanda.
At first I was using an older version of redpanda and compact topics, but I reproduced this with the latest redpanda version and with 14 single partition, non-compact topics.
What should have happened instead?
Should be staying stable and work as usual.
How to reproduce the issue?
Additional information
Here is what the producer says (many, many times):
What the broker says:
We also observed an sustained increase in memory usage (~1GB/5hrs) with the same workload.
JIRA Link: CORE-824