Closed andrewhsu closed 1 year ago
@andrewhsu if you scroll up further in the log you'll see:
TRACE 2022-02-17 18:53:26,723 [shard 0] raft - [group_id:0, {kafka/group_0/0}] replicate_entries_stm.cc:184 - Leader append result: {append_time:1208561, base_offset:{0}, last_offset:{0}, byte_size:167}
--
| flush() called concurrently with other operations.
| called from 0x12f8f44 0xec84fe 0x10419dd 0xd40360 0xd4cbad 0x10f02ef 0x10f3747 0x10f0b75 0x105921c 0x1056f00 0xbf5897 0x10b9e3f /lib64/libpthread.so.0+0x9298 /lib64/libc.so.6+0x1006a2
| --------
| seastar::continuation<seastar::internal::promise_base_with_type<void>, storage::segment_appender::dispatch_background_head_write()::$_20::operator()()::'lambda'(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>)::operator()(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>)::'lambda'(unsigned long), seastar::future<void> seastar::future<unsigned long>::then_impl_nrvo<storage::segment_appender::dispatch_background_head_write()::$_20::operator()()::'lambda'(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>)::operator()(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>)::'lambda'(unsigned long), seastar::future<void> >(storage::segment_appender::dispatch_background_head_write()::$_20::operator()()::'lambda'(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>)::operator()(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>)::'lambda'(unsigned long)&&)::'lambda'(seastar::internal::promise_base_with_type<void>&&, storage::segment_appender::dispatch_background_head_write()::$_20::operator()()::'lambda'(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>)::operator()(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>)::'lambda'(unsigned long)&, seastar::future_state<unsigned long>&&), unsigned long>
....pending op: {sanitizer_op: ss::future<size_t>::write_dma(pos:32768, *void, len:16384)
Looks like it is coming from segment appender
seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<storage::segment_appender::dispatch_background_head_write()::$_20::operator()()::'lambda'(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>)::operator()(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>)::'lambda'(), false>, seastar::futurize<seastar::future<void> >::type seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<storage::segment_appender::dispatch_background_head_write()::$_20::operator()()::'lambda'(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>)::operator()(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>)::'lambda'(), false> >(seastar::future<void>::finally_body<storage::segment_appender::dispatch_background_head_write()::$_20::operator()()::'lambda'(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>)::operator()(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>)::'lambda'(), false>&&)::'lambda'(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<storage::segment_appender::dispatch_background_head_write()::$_20::operator()()::'lambda'(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>)::operator()(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>)::'lambda'(), false>&, seastar::future_state<seastar::internal::monostate>&&), void>
We made the segment appender to be safe for concurrent flush and append, but this error is originating from the file handler sanitizer wrapper. It's not clear if this is a false positive, or it really is an invalid use of the file handler.
CI failure triage: this is very rare, but I bet it is a genuine bug. Need to read the code carefully.
CI failure triage: this is very rare, but I bet it is a genuine bug. Need to read the code carefully.
agree. this needs someone to sent down expecting a full day of analyzing this.
Ok looks like there is a fair amount of red herrings in this issue :)
Red herring 1: jitter_tests.cc
has nothing to do with the issue. It just happens to be the first file in the sources list for the test_raft_rpunit
executable and that's what seastar sets as the main test module name. The actual failure is:
../../../src/v/raft/tests/raft_group_fixture.h(552): fatal error: in "replace_whole_group": Timeout elapsed while wating for: new nodes are up to date
Red herring 2: concurrent flush and append have nothing to do with the test failure! Confusingly, these "flush() called concurrently" messages contain stack traces and that makes them look similar to assertion failure messages, but really they are just warnings and don't affect any test outcomes. Moreover, the messages (there are several) appear after the test failure. So either there are two separate issues or the causality is in reverse.
Ok looks like the replace_whole_group
test failure is a duplicate of #342 and was recently fixed in #5696 (it was a test issue, redpanda is fine).
But this concurrent flush and append message still requires investigation, so not closing the issue yet.
So I managed to reproduce concurrent flush+write with some sleeps, it is indeed allowed by segment_appender
. Approximate scenario is:
_prev_head_write
mutex gets replaced, another write of the 0-16K chunk gets dispatched (bg_head_write2)I can't see an easy fix without sacrificing concurrency. But maybe concurrent flush/dma_writes are not that bad? Not sure.
cc @dotnwat
@ztlpn nice analysis.
when i added background flushing into the segment appender i wanted to allow appends and flushes to run concurrently--that was a key part of the optimization by which raft would be able to pipeline i/o.
however, i definitely wasn't considering that the file handles that the segment appender was using would be wrapped in the sanitizer which is catching the flush() called concurrently with other operations
.
is the sanitizer too conservative for use in the segment appender? if the following are true, then I think we could probably relax the rule:
Seeing this in testing for #5318, build here.
is the sanitizer too conservative for use in the segment appender?
From FS (e.g. XFS) perspective, yes (assuming we honor that flush only affects previously-completed operations). Good point about the seastar part..
Removing ci-failure
because the test doesn't fail, the ticket is open to investigate a warning.
@dotnwat @ztlpn do you think we have next steps on this, or shall we close?
i don't think this is a proper bug in the segment appender (or a violation of the appender's api rules). if we start getting CI failures because of this we should look again and fixing it but it's clearly very rare.
this one, https://github.com/redpanda-data/redpanda/issues/5433, though, we might want to investigate
from nightly test job on
dev
branch: https://buildkite.com/redpanda/redpanda/builds/7400#23a4a0e4-52db-44b8-ada4-bd2cd6982708/6-5332earlier in the same logs: