Open vbotbuildovich opened 2 months ago
node crashed
ERROR 2024-06-28 20:41:19,249 [shard 0:admi] assert - Assert failure: (/var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-02363aa4813e8de7f-1/redpanda/redpanda/src/v/raft/mux_state_machine.h:344) 'it != _results.end()' last applied offset 33 is greater than returned replicate result 33. this must imply existence of a result in results map
ERROR 2024-06-28 20:41:19,249 [shard 0:admi] assert - Backtrace below:
0xb1b26 /opt/redpanda_installs/ci/lib/libseastar.so+0x221563b /opt/redpanda_installs/ci/lib/libseastar.so+0x2217128 /opt/redpanda_installs/ci/lib/libseastar.so+0x2217ac0 /opt/redpanda_installs/ci/lib/libv_v_cluster.so+0xca498f8 /opt/redpanda_installs/ci/lib/libv_v_cluster.so+0xca4860a /opt/redpanda_installs/ci/lib/libv_v_cluster.so+0xca48217 /opt/redpanda_installs/ci/lib/libv_v_application.so+0x43574dc /opt/redpanda_installs/ci/lib/libv_v_application.so+0x4357053 /opt/redpanda_installs/ci/lib/libv_v_application.so+0x4356e06 /opt/redpanda_installs/ci/lib/libseastar.so+0x1ac2792 /opt/redpanda_installs/ci/lib/libseastar.so+0x1ac8d36 /opt/redpanda_installs/ci/lib/libseastar.so+0x1acb8b5 /opt/redpanda_installs/ci/lib/libseastar.so+0x1ac9d47 /opt/redpanda_installs/ci/lib/libseastar.so+0x185ca65 /opt/redpanda_installs/ci/lib/libseastar.so+0x185adff /opt/redpanda_installs/ci/lib/libv_v_application.so+0x4a6eb8a 0x143a67 /opt/redpanda_installs/ci/lib/libc.so.6+0x2a087 /opt/redpanda_installs/ci/lib/libc.so.6+0x2a14a 0x6ca84
--------
seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>, seastar::futurize<seastar::future<void>>::type seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>>(seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>&&)::'lambda'(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>&, seastar::future_state<seastar::internal::monostate>&&), void>
--------
seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>, seastar::futurize<seastar::future<void>>::type seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>>(seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>&&)::'lambda'(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>&, seastar::future_state<seastar::internal::monostate>&&), void>
--------
seastar::internal::coroutine_traits_base<void>::promise_type
The assertion was triggered because the same base offset was assigned twice to different batches:
TRACE 2024-06-28 20:41:19,236 [shard 0:admi] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:168 - Leader append result: {time_since_append: 20, base_offset: 33, last_offset: 33, last_term: 1, byte_size: 297}
TRACE 2024-06-28 20:41:19,242 [shard 0:main] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:168 - Leader append result: {time_since_append: 3, base_offset: 33, last_offset: 33, last_term: 1, byte_size: 75}
The problem is caused by a race condition i a storage layer
Seen this same assert failure in https://buildkite.com/redpanda/redpanda/builds/53808 several times
docker-rp-14/redpanda.log:TRACE 2024-08-30 10:59:34,874 [shard 0:main] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:168 - Leader append result: {time_since_append: 1, base_offset: 24, last_offset: 24, last_term: 1, byte_size: 107}
docker-rp-14/redpanda.log:TRACE 2024-08-30 10:59:42,874 [shard 0:main] raft - [group_id:0, {redpanda/controller/0}] replicate_entries_stm.cc:168 - Leader append result: {time_since_append: 0, base_offset: 24, last_offset: 24, last_term: 1, byte_size: 107}
In the above build a couple of tests also fail with timeout during alter topic config, but looking at the logs there the same assert failure has crashed the broker, although the timeout occurred a couple of seconds before the crash, so maybe it is a separate issue (still looking into that)
WARN 2024-08-30 10:59:41,498 [shard 0:main] kafka - config_utils.h:286 - Failed to alter topic properties of topic(s) {{kafka/panda-topic}} error_code observed: cluster::errc::timeout
followed by:
ERROR 2024-08-30 10:59:42,876 [shard 0:main] assert - Assert failure: (/var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0c0dcb9e76459e3f4-1/redpanda/redpanda/src/v/raft/mux_state_machine.h:344) 'it != _results.end()' last applied offset 24 is greater than returned replicate result 24. this must imply existence of a result in results map
https://buildkite.com/redpanda/redpanda/builds/50894