Open Lazin opened 1 year ago
The error code in the test means that leadership cannot indeed be tranfserred - at least at the moment.
INFO 2022-10-23 07:00:10,029 [shard 1] admin_api_server - admin_server.cc:1399 - Leadership transfer request for raft group 1 to node {1}
INFO 2022-10-23 07:00:10,030 [shard 1] raft - [group_id:1, {kafka/topic-ffsauwcmof/0}] consensus.cc:2937 - Starting leadership transfer from {id: {3}, revision: {18}} to {id: {1}, revision: {18}} in term 1
TRACE 2022-10-23 07:00:10,030 [shard 1] raft - [group_id:1, {kafka/topic-ffsauwcmof/0}] consensus.cc:2792 - transfer leadership: preparing target={id: {1}, revision: {18}}, dirty_offset=0
TRACE 2022-10-23 07:00:10,030 [shard 1] raft - [group_id:1, {kafka/topic-ffsauwcmof/0}] consensus.cc:2798 - transfer leadership: cleared oplock
DEBUG 2022-10-23 07:00:10,030 [shard 1] raft - [group_id:1, {kafka/topic-ffsauwcmof/0}] consensus.cc:2821 - transfer leadership: starting node {id: {1}, revision: {18}} recovery
INFO 2022-10-23 07:00:10,030 [shard 1] raft - [group_id:1, {kafka/topic-ffsauwcmof/0}] consensus.cc:2840 - transfer leadership: waiting for node {id: {1}, revision: {18}} to catch up
TRACE 2022-10-23 07:00:10,030 [shard 1] raft - [follower: {id: {1}, revision: {18}}] [group_id:1, {kafka/topic-ffsauwcmof/0}] - recovery_stm.cc:535 - Finished recovery
INFO 2022-10-23 07:00:10,030 [shard 1] raft - [group_id:1, {kafka/topic-ffsauwcmof/0}] consensus.cc:2856 - transfer leadership: finished waiting on node {id: {1}, revision: {18}} recovery
WARN 2022-10-23 07:00:10,030 [shard 1] raft - [group_id:1, {kafka/topic-ffsauwcmof/0}] consensus.cc:3021 - Cannot transfer leadership: {id: {1}, revision: {18}} needs recovery (-9223372036854775808, -9223372036854775808, 0)
Looking further to understand the root cause
Reopening because it's not clear if #7297 is the end of the story or if there is more work here to deal with how the consensus object got into the state to begin with: https://github.com/redpanda-data/redpanda/pull/7297#issuecomment-1317658532
A prior chat with @jcsp for reference.
Alexey Biryukov
as I can see from the log, recovery of a follower to transfer leadership to starts and immediately finishes as if no recovery is needed. However that is not the case and needs_recovery() still returns true.
in order for that to happen, recovery_stm::is_recovery_finished() must return true immediately. The last statement cannot do that (we know the values from the log), follower's meta is still there (log proves that), then it's all because of the preliminary checks. There's nothing about shutting down in the log so
_ptr->_as.abort_requested() || _ptr->_bg.is_closed()
are false. The current leader has just been elected a millisecond ago, but recovery_stm is created after that so_term != _ptr->term() || !_ptr->is_elected_leader()
are false. The_stop_requested
flag in this case reflects meta missing oris_recovery_finished()
returned false before.
John Spray
The leader thinks recovery is finished, but has not updated follower stats to reflect that. it is updating them 130us later the next time it gets a heartbeat reply from the node.
I think we have two bugs here:
- A node's follower stats should be updated when we exit recovery, maybe recovery_stm should be doing this? I haven't looked in detail, maybe you can look and/or talk to Michal about it next week
- We are returning 504 instead of 503 in this case: it's not really a timeout, it's a flap (we thought recovery was done but it's not) -- even without the first bug this can happen, and is a 503 (plz retry) rather than a 504 (we couldn't do it in time).
For the second one, can you open a PR to change this block:
auto& meta = _fstats.get(target_rni); if (needs_recovery(meta, _log.offsets().dirty_offset)) { vlog( _ctxlog.warn, "Cannot transfer leadership: {} needs recovery ({}, {}, " "{})", target_rni, meta.match_index, meta.last_dirty_log_index, _log.offsets().dirty_offset); return seastar::make_ready_future<std::error_code>( make_error_code(errc::timeout)); }
If we return
errc::exponential_backoff
there instead, then admin_server.cc will convert to a 503 instead of a 504
Version & Environment
Redpanda version: (use
rpk version
): devThe
transfer_leadership
admin api call times out (504) - https://buildkite.com/redpanda/redpanda/builds/17112#01840352-dae2-40a6-9e26-a1a068b8f1bbJIRA Link: CORE-1055