redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.69k stars 591 forks source link

[CORE-8010] cluster: Fix race condition in the `archival_metadata_stm` #24210

Closed Lazin closed 4 days ago

Lazin commented 4 days ago

Several methods of the archival_metadata_stm are invoked by the ntp_archiver indirectly and require a lock to be held. The methods are do_sync, do_replicate_commands, and do_add_segments.

The methods are checking the invariant using this code:

    vassert(
      !_lock.try_get_units().has_value(),
      "Attempt to replicate STM command while not under lock");

So basically if the lock is not taken this assertion will be triggered. Another assertion that indicates that the method do_sync was called concurrently with do_replicate_commands was triggered. The problem was caused by the code in the command_batch_builder. The code was invoking the do_replicate_commands method this way:

    auto units = co_await _stm.get()._lock.get_units(_as);

    ...!scheduling point!...

    auto batch = std::move(_builder).build();
    auto f = _stm.get()
               .do_replicate_commands(std::move(batch), _as)
               .finally([u = std::move(units), h = _stm.get()._gate.hold());

    co_return co_await ssx::with_timeout_abortable(
      std::move(f), model::no_timeout, _as);

Here the do_replicate_commands method is called first. It creates a future. Then we're calling finally method on this future and passing the continuation. This continuation is supposed to prolong the lifetime of the units to guarantee that the invariant is not broken. But if this method is invoked concurrently with stop method of the archival_metadata_stm it is possible that the units will be acquired successfully but _gate.hold() call will throw the exception. In this case the future will be running in the background and it will be possible to call do_sync or do_replicate_commands again and break the invariant.

The check inside the do_replicate_commands that calls _lock.try_get_units() to verify that the lock is held can still pass even if the semaphore was broken. The try_get_units method can't throw and when the race happens (when _gate.hold throws) can already pass.

In the failed test the assertion was triggered after the STM was stopped and I was able to reproduce the issue by inserting sleeps manually. I don't think that reliable reproducer in form of unit-test or ducktape test is possible here. The problem can only affect shutdown and is difficult to trigger.

The fix is twofold:

Backports Required

Release Notes

piyushredpanda commented 4 days ago

Thanks, @Lazin -- can you please mark which tickets close out with this fix?

vbotbuildovich commented 4 days ago

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58368#01934ae6-8eed-4373-9c01-375c81802be1

vbotbuildovich commented 4 days ago

/backport v24.3.x

vbotbuildovich commented 4 days ago

/backport v24.2.x