Closed Lazin closed 4 days ago
Thanks, @Lazin -- can you please mark which tickets close out with this fix?
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58368#01934ae6-8eed-4373-9c01-375c81802be1
/backport v24.3.x
/backport v24.2.x
Several methods of the
archival_metadata_stm
are invoked by thentp_archiver
indirectly and require a lock to be held. The methods aredo_sync
,do_replicate_commands
, anddo_add_segments
.The methods are checking the invariant using this code:
So basically if the lock is not taken this assertion will be triggered. Another assertion that indicates that the method
do_sync
was called concurrently withdo_replicate_commands
was triggered. The problem was caused by the code in thecommand_batch_builder
. The code was invoking thedo_replicate_commands
method this way:Here the
do_replicate_commands
method is called first. It creates a future. Then we're callingfinally
method on this future and passing the continuation. This continuation is supposed to prolong the lifetime of theunits
to guarantee that the invariant is not broken. But if this method is invoked concurrently withstop
method of thearchival_metadata_stm
it is possible that the units will be acquired successfully but_gate.hold()
call will throw the exception. In this case the future will be running in the background and it will be possible to calldo_sync
ordo_replicate_commands
again and break the invariant.The check inside the
do_replicate_commands
that calls_lock.try_get_units()
to verify that the lock is held can still pass even if the semaphore was broken. Thetry_get_units
method can't throw and when the race happens (when_gate.hold
throws) can already pass.In the failed test the assertion was triggered after the STM was stopped and I was able to reproduce the issue by inserting sleeps manually. I don't think that reliable reproducer in form of unit-test or ducktape test is possible here. The problem can only affect shutdown and is difficult to trigger.
The fix is twofold:
replicate
so we won't get a dangling future if_gate.hold
throwspersisted_stm::stop
Backports Required
Release Notes