Open chris-kimberley opened 1 year ago
I'm really interested to know why the broker was unable to recover and what impact we might have from disabling maintenance mode on the broker.
SolidIO
Can you say more about what this is? I googled "SolidIO", "SolidIO storage" etc and couldn't find any references at all.
As it turns out that's the name of the internally developed persistent disk management we use within Kubernetes (thought it was external). It just handles PV lifetimes and such. You can ignore it.
We made an attempt to bring the broker out of maintenance mode. We can see from metrics that leadership of partitions was transferred to it. But it started hitting the same assert again and we had to put it back into maintenance mode to prevent further impact.
Version & Environment
Redpanda version:
rpk version
sayslatest (rev c8d4be2)
. This image was built by @travisdowns with the version name ofv28_df58cced6e47
OS:
uname -a
:Linux redpanda-0 5.10.0-11-amd64 #1 SMP Debian 5.10.92-1 (2022-01-18) x86_64 GNU/Linux
/etc/os-release
:The issue occurred on a broker running in a self-hosted Kubernetes cluster. We're using a locally built Helm chart based off of the standard Redpanda Helm chart. Dedicated, persistent SSD with XFS for each pod. 2TB disks.
kubectl version
:Statefulset manifest (redacted):
We're using Python
confluent-kafka
library (librdkafka-based).What went wrong?
After running stably for weeks we encountered this assert 3 times (with different values for
fallocation_offset
andcommitted_offset
) on one broker, followed immediately by that broker crashing.Assert failure: (../../../src/v/storage/segment_appender.cc:507) 'false' Could not dma_write: std::__1::system_error (error system:5, Input/output error) {no_of_chunks:64, closed:0, fallocation_offset:33554432, committed_offset:11748024, bytes_flush_pending:0}
The root cause of this write failure is not known. We suspect that it was caused by an issue on the host system, not within the Redpanda container.Kubernetes restarted the pod using the same PV. But the broker was unable to recover and began crashlooping. Each crash was caused by the following assert/callstack:
I was able to stop the crashlooping by putting the broker in maintenance mode. It was then able to join the cluster and remain as a "healthy" member. We have not tried to disable maintenance mode on that broker.
What should have happened instead?
After the failure to write and the pod being restarted Redpanda should have been able to recover and the broker should have rejoined the cluster correctly.
How to reproduce the issue?
Unknown.
Additional information
All metrics up to the point of the initial crash for that broker were in line with other brokers in the cluster. Slack thread with some discussion.
JIRA Link: CORE-1095