redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.55k stars 582 forks source link

Under replicated partitions: false positives until next produced record #13358

Open solsson opened 1 year ago

solsson commented 1 year ago

Version & Environment

Redpanda version: (use rpk version): v23.2.7

GKE k8s 1.25.10-gke.1200

Redpanda helm-chart 5.2.0 (attaching my values file)

What went wrong?

Cluster health reported an under-replicated topic for 10+ hours while redpanda_kafka_max_offset reported the same offset on all brokers and rpk cluster logdirs describe reported the same size. The false positive was resolved after a record was produced to the topic.

I run this cluster on quite heavily utilized nodes, with overprovisioned redpanda. Hence I don't expect optimum stability. The issue here however is that the cluster was seemingly healthy but reported >0 under replicated partitions.

What should have happened instead?

rpk cluster health and the redpanda_kafka_under_replicated_replicas metric should not have reported the topic as under replicated, except at the time of the replication issue.

How to reproduce the issue?

I don't know but I found the following in logs from around the time when unhealth started.

Around that time redpanda-0 repeatedly logs

rpc - transport.cc:208 - RPC timeout (100 ms) to {host: redpanda-1.redpanda.kafka-v3.svc.cluster.local., port: 33145}, method: node_status_rpc::node_status, correlation id: 411494, 1 in flight, time since: {init: 100 ms, enqueue: 100 ms, memory_reserved: 100 ms dispatch: 100 ms, written: 100 ms}, flushed: true

redpanda-2 repeatedly logs

rpc - transport.cc:208 - RPC timeout (100 ms) to {host: redpanda-1.redpanda.kafka-v3.svc.cluster.local., port: 33145}, method: node_status_rpc::node_status, correlation id: 8099141, 1 in flight, time since: {init: 100 ms, enqueue: 100 ms, memory_reserved: 100 ms dispatch: 100 ms, written: 100 ms}, flushed: true

while redpanda-1 blames the other two

INFO  2023-09-09 00:49:18,219 [shard 0] rpc - transport.cc:208 - RPC timeout (100 ms) to {host: redpanda-2.redpanda.kafka-v3.svc.cluster.local., port: 33145}, method: node_status_rpc::node_status, correlation id: 8095576, 1 in flight, time since: {init: 106 ms, enqueue: 106 ms, memory_reserved: 106 ms dispatch: 106 ms, written: 106 ms}, flushed: true
INFO  2023-09-09 00:49:18,316 [shard 0] rpc - transport.cc:208 - RPC timeout (100 ms) to {host: redpanda-0.redpanda.kafka-v3.svc.cluster.local., port: 33145}, method: node_status_rpc::node_status, correlation id: 8095014, 1 in flight, time since: {init: 106 ms, enqueue: 106 ms, memory_reserved: 106 ms dispatch: 106 ms, written: 106 ms}, flushed: true

and around a minute later redpanda-1 logs on warn level

WARN  2023-09-09 00:50:14,209 [shard 0] kafka - snc_quota_manager.cc:419 - qb - Quota balancer is invoked too often (-5457), enforcing minimum sleep time. Consider increasing the balancer period.
WARN  2023-09-09 00:50:26,044 [shard 0] cluster - node_isolation_watcher.cc:57 - Change is_isolated status. Is node isolated: true
WARN  2023-09-09 00:50:26,910 [shard 0] cluster - node_isolation_watcher.cc:57 - Change is_isolated status. Is node isolated: false

Additional information

Helm values: values.yaml.gz

This is the second time I observe this behavior.

JIRA Link: CORE-1438

github-actions[bot] commented 9 months ago

This issue hasn't seen activity in 3 months. If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in two weeks.

dotnwat commented 9 months ago

Adding replication team. It sounds like poking raft in some way may have unstuck health reporting or affected an offset edge case of some sort?

solsson commented 4 months ago

I spotted the same issue once more around the time I created this issue, but since then I haven't seen it again.