Failure in `raft_availability_test.RaftAvailabilityTest.test_leader_restart`

gousteris commented 2 years ago

[INFO  - 2022-01-16 07:27:36,890 - runner_client - log - lineno:266]: RunnerClient: rptest.tests.raft_availability_test.RaftAvailabilityTest.test_leader_restart: Summary: AssertionError()
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 215, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/tests/raft_availability_test.py", line 404, in test_leader_restart
    assert new_leader_id == self._get_leader()[0]
AssertionError

ref: https://buildkite.com/vectorized/redpanda/builds/6129#cd2aede7-c30f-4777-829b-beeb64247ca2

JIRA Link: CORE-815

dotnwat commented 2 years ago

Note there are two failures in that CI link. @gousteris is there another issue for the coproc failure in that CI link?

dotnwat commented 2 years ago

@mmaslankaprv do I remember correctly that you recently added some heuristics where if leadership wasn't available you'd return leader from health manager or something?

gousteris commented 2 years ago

again https://buildkite.com/vectorized/redpanda/builds/6175#a595a840-ed9e-4a90-adc4-6c21b849836f

gousteris commented 2 years ago

@dotnwat I can only see coproc_fixture_rpunit in addition to the raft one

mmaslankaprv commented 2 years ago

i think this may be related with: #3486

graphcareful commented 2 years ago

Filed #3491 to track the coproc test failure

jcsp commented 2 years ago

This looks different to #3486, because the leadership is not reading as a previous value, it's an unexpected new value.

Client is reading the leader as node 3, but node 3 is not a leader and never has been for any partition.

Client:

[DEBUG - 2022-01-16 07:27:32,893 - kafka_cat - _cmd_raw - lineno:54]: {"originating_broker":{"id":2,"name":"docker_n_17:9092/2"},"query":{"topic":"*"},"controllerid":2,"brokers":[{"id":3,"name":"docker_n_38:9092"},{"id":2,"name":"docker_n_17:9092"},{"id":1,"name":"docker_n_16:9092"}],"topics":[{"topic":"topic-ghcmgkqdld","partitions":[{"partition":0,"leader":3,"replicas":[{"id":1},{"id":2},{"id":3}],"isrs":[{"id":1},{"id":2},{"id":3}]}]}]}

Server:

TRACE 2022-01-16 07:27:32,892 [shard 0] kafka - request_context.h:146 - [172.18.0.51:47574] sending 3:metadata response {throttle_time_ms=0 brokers={{node_id=3 host=docker_n_38 port=9092 rack={nullopt}}, {node_id=2 host=docker_n_17 port=9092 rack={nullopt}}, {node_id=1 host=docker_n_16 port=9092 rack={nullopt}}} cluster_id={nullopt} controller_id=2 topics={{error_code={ error_code: none [0] } name={topic-ghcmgkqdld} is_internal=false partitions={{error_code={ error_code: none [0] } partition_index=0 leader_id=3 leader_epoch=-1 replica_nodes={{1}, {2}, {3}} isr_nodes={{1}, {2}, {3}} offline_replicas={}}} topic_authorized_operations=0}} cluster_authorized_operations=-2147483648}

# ^ that's node 2, which *is* the leader, telling the client that node 3 is the leader.

Not year clear how that can possibly happen -- perhaps we have some place where node ID is being used to index into an array and the nodes aren't in the order expected.

jcsp commented 2 years ago

Hmm, so the most recent update to the partition table is a null leader, as in #3486, but somehow that's showing up in a metadata response as "3" rather than -1.

TRACE 2022-01-16 07:27:32,061 [shard 1] cluster - partition_leaders_table.cc:113 - updated partition: {kafka/topic-ghcmgkqdld/0} leader: {term: 1, current leader: {nullopt}, previous leader: {2}}
TRACE 2022-01-16 07:27:32,892 [shard 0] kafka - request_context.h:146 - [172.18.0.51:47574] sending 3:metadata response {throttle_time_ms=0 brokers={{node_id=3 host=docker_n_38 port=9092 rack={nullopt}}, {node_id=2 host=docker_n_17 port=9092 rack={nullopt}}, {node_id=1 host=docker_n_16 port=9092 rack={nullopt}}} cluster_id={nullopt} controller_id=2 topics={} cluster_authorized_operations=-2147483648}

This probably is fixed by the fix for #3486, but there's still a mystery (at least to me) of where that 3 is coming from, which might be another bug in its own right.

jcsp commented 2 years ago

Removing ci-failure because I'm fairly sure the conditions that reproduced this are fixed with the fix for #3486, but keeping ticket open for any comments on how we could have seen leader=3 in these conditions: @mmaslankaprv @dotnwat can you guys think of any mechanisms?

gousteris commented 2 years ago

Seen again https://buildkite.com/vectorized/redpanda/builds/6240#4d23da5e-9696-4f1e-875b-df7298790740 along with cluster_config_test.ClusterConfigTest.test_restart

redpanda-data / redpanda

Failure in `raft_availability_test.RaftAvailabilityTest.test_leader_restart` #3468