Timeout on ARM in `RaftAvailabilityTest.test_leader_transfers_recovery.acks=-1`

Lazin commented 1 year ago

Version & Environment

Redpanda version: (use rpk version): dev

The transfer_leadership admin api call times out (504) - https://buildkite.com/redpanda/redpanda/builds/17112#01840352-dae2-40a6-9e26-a1a068b8f1bb

Traceback (most recent call last):

  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run

    data = self.run_test()

  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test

    return self.test_context.function(self.test)

  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper

    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)

  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped

    r = f(self, *args, **kwargs)

  File "/root/tests/rptest/tests/raft_availability_test.py", line 422, in test_leader_transfers_recovery

    self._transfer_leadership(admin, "kafka", self.topic,

  File "/root/tests/rptest/tests/raft_availability_test.py", line 146, in _transfer_leadership

    admin.transfer_leadership_to(topic=topic,

  File "/root/tests/rptest/services/admin.py", line 694, in transfer_leadership_to

    ret = self._request('post', path=path, node=leader)

  File "/root/tests/rptest/services/admin.py", line 335, in _request

    r.raise_for_status()

  File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 941, in raise_for_status

    raise HTTPError(http_error_msg, response=self)

requests.exceptions.HTTPError: 504 Server Error: Gateway Timeout for url: http://docker-rp-30:9644/v1/raft/1/transfer_leadership?target=1

JIRA Link: CORE-1055

dlex commented 1 year ago

The error code in the test means that leadership cannot indeed be tranfserred - at least at the moment.

INFO  2022-10-23 07:00:10,029 [shard 1] admin_api_server - admin_server.cc:1399 - Leadership transfer request for raft group 1 to node {1}
INFO  2022-10-23 07:00:10,030 [shard 1] raft - [group_id:1, {kafka/topic-ffsauwcmof/0}] consensus.cc:2937 - Starting leadership transfer from {id: {3}, revision: {18}} to {id: {1}, revision: {18}} in term 1
TRACE 2022-10-23 07:00:10,030 [shard 1] raft - [group_id:1, {kafka/topic-ffsauwcmof/0}] consensus.cc:2792 - transfer leadership: preparing target={id: {1}, revision: {18}}, dirty_offset=0  
TRACE 2022-10-23 07:00:10,030 [shard 1] raft - [group_id:1, {kafka/topic-ffsauwcmof/0}] consensus.cc:2798 - transfer leadership: cleared oplock               
DEBUG 2022-10-23 07:00:10,030 [shard 1] raft - [group_id:1, {kafka/topic-ffsauwcmof/0}] consensus.cc:2821 - transfer leadership: starting node {id: {1}, revision: {18}} recovery          
INFO  2022-10-23 07:00:10,030 [shard 1] raft - [group_id:1, {kafka/topic-ffsauwcmof/0}] consensus.cc:2840 - transfer leadership: waiting for node {id: {1}, revision: {18}} to catch up  
TRACE 2022-10-23 07:00:10,030 [shard 1] raft - [follower: {id: {1}, revision: {18}}] [group_id:1, {kafka/topic-ffsauwcmof/0}] - recovery_stm.cc:535 - Finished recovery
INFO  2022-10-23 07:00:10,030 [shard 1] raft - [group_id:1, {kafka/topic-ffsauwcmof/0}] consensus.cc:2856 - transfer leadership: finished waiting on node {id: {1}, revision: {18}} recovery 
WARN  2022-10-23 07:00:10,030 [shard 1] raft - [group_id:1, {kafka/topic-ffsauwcmof/0}] consensus.cc:3021 - Cannot transfer leadership: {id: {1}, revision: {18}} needs recovery (-9223372036854775808, -9223372036854775808, 0)

Looking further to understand the root cause

jcsp commented 1 year ago

Reopening because it's not clear if #7297 is the end of the story or if there is more work here to deal with how the consensus object got into the state to begin with: https://github.com/redpanda-data/redpanda/pull/7297#issuecomment-1317658532

dlex commented 1 year ago

A prior chat with @jcsp for reference.

Alexey Biryukov

as I can see from the log, recovery of a follower to transfer leadership to starts and immediately finishes as if no recovery is needed. However that is not the case and needs_recovery() still returns true.

in order for that to happen, recovery_stm::is_recovery_finished() must return true immediately. The last statement cannot do that (we know the values from the log), follower's meta is still there (log proves that), then it's all because of the preliminary checks. There's nothing about shutting down in the log so _ptr->_as.abort_requested() || _ptr->_bg.is_closed() are false. The current leader has just been elected a millisecond ago, but recovery_stm is created after that so _term != _ptr->term() || !_ptr->is_elected_leader() are false. The _stop_requested flag in this case reflects meta missing or is_recovery_finished() returned false before.

John Spray

The leader thinks recovery is finished, but has not updated follower stats to reflect that. it is updating them 130us later the next time it gets a heartbeat reply from the node.

I think we have two bugs here:

A node's follower stats should be updated when we exit recovery, maybe recovery_stm should be doing this? I haven't looked in detail, maybe you can look and/or talk to Michal about it next week

We are returning 504 instead of 503 in this case: it's not really a timeout, it's a flap (we thought recovery was done but it's not) -- even without the first bug this can happen, and is a 503 (plz retry) rather than a 504 (we couldn't do it in time).

For the second one, can you open a PR to change this block:
          auto& meta = _fstats.get(target_rni);
          if (needs_recovery(meta, _log.offsets().dirty_offset)) {
              vlog(
                _ctxlog.warn,
                "Cannot transfer leadership: {} needs recovery ({}, {}, "
                "{})",
                target_rni,
                meta.match_index,
                meta.last_dirty_log_index,
                _log.offsets().dirty_offset);

              return seastar::make_ready_future<std::error_code>(
                make_error_code(errc::timeout));
          }
If we return errc::exponential_backoff there instead, then admin_server.cc will convert to a 503 instead of a 504

redpanda-data / redpanda

Timeout on ARM in `RaftAvailabilityTest.test_leader_transfers_recovery.acks=-1` #6902

Version & Environment