Crash/assert in seastar::aio_general_context::flush(): Assertion `clock::now() < retry_until' failed.

StephanDollberg commented 1 year ago

Hi folks,

in Redpanda we experience the following assert/crash under heavy load:

seastar::aio_general_context::flush(): Assertion `clock::now() < retry_until' failed.

https://github.com/scylladb/seastar/blob/master/src/core/reactor_backend.cc#L349-L356

[Backtrace #0]
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at /v/build/v_deps_build/seastar-prefix/src/seastar/include/seastar/util/backtrace.hh:60
 (inlined by) seastar::backtrace_buffer::append_backtrace() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:786
 (inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:816
 (inlined by) seastar::print_with_backtrace(char const*, bool) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:828
seastar::sigabrt_action() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:4065
 (inlined by) operator() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:4041
 (inlined by) __invoke at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:4037
/opt/redpanda/lib/libc.so.6: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=6e7b96dfb83f0bdcb6a410469b82f86415e5ada3, for GNU/Linux 3.2.0, stripped

__sigaction at ??:?
pthread_key_delete at ??:?
raise at ??:?
abort at ??:?
?? ??:0
__assert_fail at ??:?
seastar::aio_general_context::flush() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor_backend.cc:338
 (inlined by) seastar::reactor_backend_aio::kernel_submit_work() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor_backend.cc:522
seastar::reactor::poll_once() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:3407
 (inlined by) operator() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:3300
 (inlined by) seastar::reactor::do_run() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:3328
operator() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:4522
 (inlined by) decltype ((std::declval<seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_2&>())()) std::__1::__invoke[abi:v160004]<seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_2&>(seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_2&) at /vectorized/llvm/bin/../include/c++/v1/__functional/invoke.h:394
 (inlined by) void std::__1::__invoke_void_return_wrapper<void, true>::__call<seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_2&>(seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_2&) at /vectorized/llvm/bin/../include/c++/v1/__functional/invoke.h:487
 (inlined by) std::__1::__function::__alloc_func<seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_2, std::__1::allocator<seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_2>, void ()>::operator()[abi:v160004]() at /vectorized/llvm/bin/../include/c++/v1/__functional/function.h:185
 (inlined by) std::__1::__function::__func<seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_2, std::__1::allocator<seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_2>, void ()>::operator()() at /vectorized/llvm/bin/../include/c++/v1/__functional/function.h:356
std::__1::__function::__value_func<void ()>::operator()[abi:v160004]() const at /vectorized/llvm/bin/../include/c++/v1/__functional/function.h:510
 (inlined by) std::__1::function<void ()>::operator()() const at /vectorized/llvm/bin/../include/c++/v1/__functional/function.h:1156
 (inlined by) seastar::posix_thread::start_routine(void*) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/posix.cc:73
pthread_condattr_setpshared at ??:?
__xmknodat at ??:?

I don't have an easy reproducer but I can reliably reproduce it with one of our internal load tests if further debugging is needed.

This bit of code seems to have a bit of history and was already changed a couple times:

I am not sure I fully understand the code. The comment mentions an "infinite loop" but I guess what it means is a longer stall in flush? I assume the 1 second was chosen arbitrarily? IMO stalling for 1 (or 5 or 10) seconds is better than crashing?

cc @bhalevy

Thanks, Stephan

bhalevy commented 1 year ago

I am not sure I fully understand the code. The comment mentions an "infinite loop" but I guess what it means is a longer stall in flush? I assume the 1 second was chosen arbitrarily? IMO stalling for 1 (or 5 or 10) seconds is better than crashing?

Correct. The 1 second is arbitrary and was meant to detect abnormal cases where no further waiting would help. Does increasing the 1 second timeout to a higher value help in your case?

StephanDollberg commented 1 year ago

Does increasing the 1 second timeout to a higher value help in your case?

Will give that a try today and will also confirm we are indeed stalling on EAGAIN.

StephanDollberg commented 1 year ago

Yeah so I did run a test and confirmed a few things:

We do get stuck on EAGAIN, i.e.: no other errors are at work here
The time doesn't actually make a difference. Once we get EAGAIN we will get it forever and get stuck.
Removing the load from the system doesn't actually get us unstuck. In my test we were still stuck in the loop even after multiple minutes after the load disappeared.

Do we possibly need to call io_getevents in between the io_submits to free up the kernel internal queue? Had a very rough look at the kernel code and it seems like the only reason for EAGAIN is the internal queue being full and it not being able to increase the queue size.

bhalevy commented 1 year ago

I think that just returning a "short" flush should work. Can you please try the following patch?

diff --git a/src/core/reactor_backend.cc b/src/core/reactor_backend.cc
index e4a6a8f6..85d3457c 100644
--- a/src/core/reactor_backend.cc
+++ b/src/core/reactor_backend.cc
@@ -337,26 +337,18 @@ void aio_general_context::queue(linux_abi::iocb* iocb) {

 size_t aio_general_context::flush() {
     auto begin = iocbs.get();
-    using clock = std::chrono::steady_clock;
-    constexpr clock::time_point no_time_point = clock::time_point(clock::duration(0));
-    auto retry_until = no_time_point;
     while (begin != last) {
         auto r = io_submit(io_context, last - begin, begin);
-        if (__builtin_expect(r > 0, true)) {
-            begin += r;
-            continue;
-        }
-        // errno == EAGAIN is expected here. We don't explicitly assert that
-        // since the assert below prevents an endless loop for any reason.
-        if (retry_until == no_time_point) {
-            // allow retrying for 1 second
-            retry_until = clock::now() + 1s;
-        } else {
-            assert(clock::now() < retry_until);
+        if (__builtin_expect(r <= 0, false)) {
+            // errno == EAGAIN is expected here, but we don't explicitly assert that.
+            // just adjust the queue and return the number of iocbs we were able to submit
+            std::memmove(iocbs.get(), begin, (last - begin) * sizeof(iocb*));
+            break;
         }
+        begin += r;
     }
-    auto nr = last - iocbs.get();
-    last = iocbs.get();
+    auto nr = begin - iocbs.get();
+    last -= nr;
     return nr;
 }

StephanDollberg commented 1 year ago

Thanks, did some quick testing and yeah it looks better.

I think that just returning a "short" flush should work.

I was wondering about "short" flushes. Does the code not otherwise assume that after a call to flush all work has been submitted?

bhalevy commented 1 year ago

We want all iocbs that were queued to be eventually submitted to the kernel, but I saw no hard requirement to flush the whole queue in a single call. When enqueuing a new iocb we don't care if the queue is empty or not.

We could in fact implement a circular queue, that would make the error path simpler yet it will make the error free path more complex, hence I preferred the memmove on error that will move the queue tail to the front of the queue, allowing more requests to be queued, assuming it would be a rare event.

bhalevy commented 1 year ago

@avikivity are you okay with the above approach?

avikivity commented 1 year ago

Do we know why we get EAGAIN?

bhalevy commented 1 year ago

Based on https://github.com/scylladb/seastar/issues/1784#issuecomment-1677396251 the kernel queue is full and since we're busy looping in flush, we don't reap completions.

avikivity commented 1 year ago

There's no kernel queue involved. aio allocates enough resources to run all iocbs that were declared in io_setup().

bhalevy commented 1 year ago

There's no kernel queue involved. aio allocates enough resources to run all iocbs that were declared in io_setup().

Maybe there is an external consumer that eats up aio resources?

avikivity commented 1 year ago

There really shouldn't be. They're local to the io context.

StephanDollberg commented 1 year ago

You guys probably understand this code a lot better than me but from looking when EAGAIN can occur I see:

We fail to allocate from the iocb cache: https://github.com/torvalds/linux/blob/master/fs/aio.c#L1056. This seems unlikely to be the case as I'd expect this to go away once load is gone
get_reqs_available returns false: https://github.com/torvalds/linux/blob/master/fs/aio.c#L1035. From what I understand this can fail if the completion ring-buffer/queue is full. As Avi mentioned above I'd expect the queue to be sized big enough for enough iocbs as declared in io_setup and I think seastar tries not to commit more than that (can we possibly call flush multiple times in between io_getevents calls?). However, the details seem to be a lot more tricky. The space in the queue is tracked in a separate counter which is also only updated in batches and is also stored per cpu with the exchange between the two only happening in batches. Also I am not sure how polling interacts with this.

In any case reaping completed events as io_getevents does seems to unblock things as it removes them from the queue.

StephanDollberg commented 1 year ago

ping @avikivity @bhalevy, any opinions on the above?

I further noticed now that seastar already uses a similar retry on EAGAIN handling in the aio_storage_context - https://github.com/scylladb/seastar/blob/master/src/core/reactor_backend.cc#L233. This checks the size further above though (we could still be stuck for a long time in that loop I guess?).

avikivity commented 1 year ago

Here we handle EAGAIN on the iocb itself, not EAGAIN from io_submit().

The io context size should be sized exactly right. Maybe there is an off-by-one somewhere.

The kmem_cache_alloc() call could legitimately fail (though it never does for such small allocations). It's reason enough to handle EAGAIN, but I'm sure it's not the real cause.

StephanDollberg commented 11 months ago

It's reason enough to handle EAGAIN, but I'm sure it's not the real cause.

Fair, do I understand you correctly though that you think the patch is worth merging in any case because of this?

We have just hit this in prod so it's become a bit more urgent for us :)

I had another look at the reactor code now and I have few questions regarding the aio_general_context where all of this happens. Above you write:

aio allocates enough resources to run all iocbs that were declared in io_setup().

This makes sense (ignoring the above for a second) however:

In the aio_general_context I don't see any "iocb limiting" happen as is done in the aio_storage_context which uses an iocb_pool
I don't really see how the size of aio_general_context::iocbs is limited beyond this assert (we are not hitting that here but I am wondering)?
Given the two above is there anything that prevents us from doing "submit 6k iocbs", flush, maybe only reap 500, "try to submit 6k iocbs again and run into EAGAIN"? Or some variation thereof?

Finally on the --max-networking-io-control-blocks reactor option (which is where the 10k default limit comes from) I do now see the following help line "This translates to the number of sockets supported per shard.". This is based on the idea that for every socket we only have a single iocb outstanding? This isn't being enforced anywhere though is it? Nothing would prevent us from creating 15k sockets?

travisdowns commented 11 months ago

@avikivity wrote:

The io context size should be sized exactly right. Maybe there is an off-by-one somewhere.

I don't think that's right, at least in the sense of --max-networking-io-control-blocks=N means that we can support N sockets, because each socket can queue up to 3 differnet iocbs (as @StephanDollberg mentions above): aio_pollable_fd_state::_iocb_pollin, aio_pollable_fd_state::_iocb_pollout and aio_pollable_fd_state::_iocb_pollhup. In fact it is not odd for all 3 or at least 2 to be enqueued at the same time for the same socket (especially once we started using the pollhup one).

Indeed a small test app shows that when you try to create 10k sockets with default network iocbs of 10k, it will fail on the 5001st, the last < end assert: https://github.com/scylladb/seastar/blob/4c0634293d1bb1e0522caec28151508df822e4e3/src/core/reactor_backend.cc#L334 as at that point it will try to queue the 10,001th iocb. So you can only use half as many sockets as you'd expect, unless you are careful not to submit more iocbs in one interval than the array can handle.

I believe there may actually also an off by 1 or 2 error sometimes as in addition as we may enqueue the hrtimer and sometimes another iocb (smp wakeup or whatever) depending on the path in the reactor, though this case does not trigger in the test case (as I guess we flush before going down that path).

The same reasoning means that N sockets can be have to 3*N iocbs in flight in the kernel, though this does not seem to produce EAGAIN from io_submit as soon as the number of in-flight iocbs exceeds the number passed to io_submit.

travisdowns commented 10 months ago

@avikivity - here's a more concrete repro to back the claim that N iocbs is not enough to support N sockets (nor N-C sockets for small C).

https://github.com/travisdowns/seastar/tree/td-eagain-repro

Run the server.py in the second commit like ./server.py, then to reproduce the clock::now() < retry_until assert run:

ninja socket_test && DO_SLEEP=1 SOCKET_COUNT=1000 tests/unit/socket_test -t open_n -- --reactor-backend=linux-aio --smp=1 --memory=1G --max-networking-io-control-blocks=1000

This tries to create 1000 sockets with --max-networking-io-control-blocks also set to 1000 and each socket is written to, read from and has a wait_for_input_shutdown callback attached.

It fails after 2/3rds have been created. Each socket consumes 3 iocbs. The reason it fails after 2/3 and not 1/3 have been created is that the kernel somewhat arbitrarily multiples the number of supported iocbs you request in io_setup by 2. However, we should not rely on it and in any case 2 is still not enough as this test case shows.

It also fails with smaller numbers, however because the kernel rounds up iocb counts to fill a full page there is a lot of quantization and it may pass for some values (e.g., with 10 sockets it will pass).

If you run it with DO_SLEEP=0 instead, it will reproduce the last < end crash after 1/3 of the sockets have been created, as we exceed the fixed-size array used to track enqueued iocbs. In our fork @StephanDollberg has also applied a fix for this issue: https://github.com/redpanda-data/seastar/pull/87/commits/d9175fcf84d31f21c0b72d0bea405f73ac1009b4 and that's something we'd like to upstream too.

ninja socket_test && DO_SLEEP=1 SOCKET_COUNT=1000 tests/unit/socket_test -t open_n -- --reactor-backend=linux-aio --smp=1 --memory=1G --max-networking-io-control-blocks=1000

cc @bhalevy

avikivity commented 8 months ago

Hum. So there's an off-by-a-factor-of-three-plus-one here. Maybe it's trying to solve the Collatz conjecture.

Doesn't it mean we should allocate more iocbs (or promise to support fewer sockets)? Waiting won't help us, since nothing guarantees any of those IOCBs will be resolved.

Really we should move to io_uring, but the performance regression is preventing us from doing so.

roydahan commented 8 months ago

@avikivity / @bhalevy would it be ok to workaround this configuration if we one set --developer-mode? It's the most common use case for people who uses docker and it's blocking them from progressing.

avikivity commented 8 months ago

An easy workaround is --reactor-backend=epoll

roydahan commented 8 months ago

An easy workaround is --reactor-backend=epoll

Yes, indeed works. I'll update the issue and the community forum.

scylladb / seastar

Crash/assert in seastar::aio_general_context::flush(): Assertion `clock::now() < retry_until' failed. #1784