Open solsson opened 1 year ago
This issue hasn't seen activity in 3 months. If you want to keep it open, post a comment or remove the stale
label – otherwise this will be closed in two weeks.
Adding replication team. It sounds like poking raft in some way may have unstuck health reporting or affected an offset edge case of some sort?
I spotted the same issue once more around the time I created this issue, but since then I haven't seen it again.
Version & Environment
Redpanda version: (use
rpk version
): v23.2.7GKE k8s 1.25.10-gke.1200
Redpanda helm-chart 5.2.0 (attaching my values file)
What went wrong?
Cluster health reported an under-replicated topic for 10+ hours while redpanda_kafka_max_offset reported the same offset on all brokers and
rpk cluster logdirs describe
reported the same size. The false positive was resolved after a record was produced to the topic.I run this cluster on quite heavily utilized nodes, with overprovisioned redpanda. Hence I don't expect optimum stability. The issue here however is that the cluster was seemingly healthy but reported >0 under replicated partitions.
What should have happened instead?
rpk cluster health
and theredpanda_kafka_under_replicated_replicas
metric should not have reported the topic as under replicated, except at the time of the replication issue.How to reproduce the issue?
I don't know but I found the following in logs from around the time when unhealth started.
Around that time redpanda-0 repeatedly logs
redpanda-2 repeatedly logs
while redpanda-1 blames the other two
and around a minute later redpanda-1 logs on warn level
Additional information
Helm values: values.yaml.gz
This is the second time I observe this behavior.
JIRA Link: CORE-1438