Open travisdowns opened 10 months ago
From the logs attached it seems that controller_backend
is not active at all on affected shards. It may be the case that some of the partitions failed to stop. I will try to reproduce that issue as the log doesn't contain entries from the moment when the problem occurred (i.e. creation/deletion of 22000 partitions topic)
@mmaslankaprv I learned overnight that simply creating/destroying the 23k partitions is not enough, this was working fine when run 100s of times in a loop. So the actual thing I was doing was a short load test omb_validation_test.py::OMBValidationTest.test_max_partitions
) in between and I guess this is somehow relevant. Will provide more details soon.
this may be related: https://github.com/redpanda-data/redpanda/issues/15392
Interesting thing is that these partitions are seen in the logs (grep for test-topic-
) and each shard hosts 300-400 partitions of a single topic:
This suggests that when a topic is deleted and partitions are stopped, there is a chance that on one of the shards they will get stuck waiting for some common resource (as opposed to partitions getting stuck independently).
Interesting thing is that these partitions are seen in the logs (grep for test-topic-) and each shard hosts 300-400 partitions of a single topic:
It did strike me that there might be some sort of deadlock, e.g., via ssg exhaustion or some other route.
I think I was able to reproduce this with a simpler setup and will provide more details soon.
@travisdowns let us know if you will have any updates here
@mmaslankaprv - we ended discussing this a bit on slack and I think the current status is that you have the timed watchdog checked in to give more info when delta application stalls?
Version & Environment
Redpanda version: 23.2.17
What went wrong?
When repeatedly creating a partition with a high number of partitions (22,800 in this case), some shards seem to get into a state where they cannot complete the creation of the replica state. Notes:
Here is a specific example of a node with shards in a bad state.
We create the 30 partition topic
ttta
usingrpk topic create ttta -p30
.rpk cluster health
shows somettta
partitions as under-replicated (ignore the other partitions):This state is persistent.
We can look at the number of partitions per hon broker 1 (controller leader) and see that some shards are many fewer partitions:
All the cores with 4 partitions are stuck: not being able to reconcile new replicas. The 4 partitions they are do have are from topics like
__consumer_offsets
which existed before the problem manifested.We can look at a specific partition
kafka/ttta/3
which was had a replica assigned on shard 11 on all 3 brokers. On brokers 0 and 2, where shard 11 is unstuck, the partition table reports the partition as done, like so:On broker 1, however, the partition is shown as "in_progress":
Find here the full
TRACE
logs for the period including the creation of thettta
topic on broker 1 (controller). You can see that afterttta/3
is assigned to shard 11 there are not log lines at all regarding any reconciliation happening in cluster_backend, unlike unaffected partitions which have them (the only log lines are a result of a leader being elected for this partition on one of the other two brokers, and corresponding updates to the leader table.This partition is still available since it is healthy on the other two brokers, but shard 19 is stock on both of brokers 0 and 1, so partitions on that shard (for example,
ttta/11
) are leaderless and unavailable.What should have happened instead?
Partitions should reconcile and become available.
How to reproduce the issue?
in_progress
topics on the leader.JIRA Link: CORE-1656