scylladb / seastar

High performance server-side application framework
http://seastar.io
Apache License 2.0
8.34k stars 1.54k forks source link

aio_general_context::queue(linux_abi::iocb *): Assertion `last < end' failed. #976

Open bhalevy opened 2 years ago

bhalevy commented 2 years ago

Seen in https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/1004/artifact/logs-all.release.2/1637736003906_update_cluster_layout_tests.TestUpdateClusterLayout.increment_decrement_counters_in_threads_nodes_restarted_test/node3.log

Scylla version 4.7.dev-0.20211123.965ea4a3fa with build-id 1bc1daed97736a2174a304e55861b4660b9dc327 starting ...

scylla: ./seastar/src/core/reactor_backend.cc:269: void seastar::aio_general_context::queue(linux_abi::iocb *): Assertion `last < end' failed.
Aborting on shard 0.
Backtrace:
  0x3e04938
  0x3e35792
  0x7fc2a71fca1f
  /jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/965ea4a3facc50f337733349eba316e9e8b9b9ae/scylla/libreloc/libc.so.6+0x3d2a1
  /jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/965ea4a3facc50f337733349eba316e9e8b9b9ae/scylla/libreloc/libc.so.6+0x268a3
  /jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/965ea4a3facc50f337733349eba316e9e8b9b9ae/scylla/libreloc/libc.so.6+0x26788
  /jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/965ea4a3facc50f337733349eba316e9e8b9b9ae/scylla/libreloc/libc.so.6+0x35a15
  0x3e5bc80
  0x3e5bc9d
  0x3dfad9e
  0x3e5bda3
  0x3dfb85c
  0x412c80e
  0x1723dbf
  0x1aaad5d
  0x1ab244b
  0x13d9e60
  0x3e17304
  0x3e186f7
  0x3e1793c
  0x3dc11ce
  0x3dc0546
  0xf5023b
  /jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/965ea4a3facc50f337733349eba316e9e8b9b9ae/scylla/libreloc/libc.so.6+0x27b74
  0xf4d3ad

Decoded:

[Backtrace #0]
void seastar::backtrace(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:59
 (inlined by) seastar::backtrace_buffer::append_backtrace() at ./build/release/seastar/./seastar/src/core/reactor.cc:760
 (inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:790
seastar::print_with_backtrace(char const*, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:802
 (inlined by) seastar::sigabrt_action() at ./build/release/seastar/./seastar/src/core/reactor.cc:3720
 (inlined by) operator() at ./build/release/seastar/./seastar/src/core/reactor.cc:3702
 (inlined by) __invoke at ./build/release/seastar/./seastar/src/core/reactor.cc:3698
?? ??:0
/data/scylla-s3-reloc.cache/by-build-id/1bc1daed97736a2174a304e55861b4660b9dc327/extracted/scylla/libexec/scylla/libreloc/libc.so.6 0x3d2a1 
/data/scylla-s3-reloc.cache/by-build-id/1bc1daed97736a2174a304e55861b4660b9dc327/extracted/scylla/libexec/scylla/libreloc/libc.so.6 0x268a3 
/data/scylla-s3-reloc.cache/by-build-id/1bc1daed97736a2174a304e55861b4660b9dc327/extracted/scylla/libexec/scylla/libreloc/libc.so.6 0x26788 
/data/scylla-s3-reloc.cache/by-build-id/1bc1daed97736a2174a304e55861b4660b9dc327/extracted/scylla/libexec/scylla/libreloc/libc.so.6 0x35a15 
seastar::aio_general_context::queue(seastar::internal::linux_abi::iocb*) at ./build/release/seastar/./seastar/src/core/reactor_backend.cc:269
 (inlined by) seastar::reactor_backend_aio::poll(seastar::pollable_fd_state&, int) at ./build/release/seastar/./seastar/src/core/reactor_backend.cc:536
seastar::reactor_backend_aio::readable(seastar::pollable_fd_state&) at ./build/release/seastar/./seastar/src/core/reactor_backend.cc:544
seastar::reactor::readable(seastar::pollable_fd_state&) at ./build/release/seastar/./seastar/src/core/reactor.cc:1013
 (inlined by) seastar::pollable_fd_state::readable() at ./build/release/seastar/./seastar/src/core/reactor.cc:399
 (inlined by) seastar::reactor::do_read_some(seastar::pollable_fd_state&, seastar::internal::buffer_allocator*) at ./build/release/seastar/./seastar/src/core/reactor.cc:268
seastar::reactor_backend_aio::read_some(seastar::pollable_fd_state&, seastar::internal::buffer_allocator*) at ./build/release/seastar/./seastar/src/core/reactor_backend.cc:586
seastar::pollable_fd_state::read_some(seastar::internal::buffer_allocator*) at ./build/release/seastar/./seastar/src/core/reactor.cc:373
seastar::pollable_fd::read_some(seastar::internal::buffer_allocator*) at ./build/release/seastar/./seastar/include/seastar/core/internal/pollable_fd.hh:125
 (inlined by) seastar::net::posix_data_source_impl::get() at ./build/release/seastar/./seastar/src/net/posix-stack.cc:575
seastar::data_source::get() at ././seastar/include/seastar/core/iostream.hh:69
 (inlined by) seastar::input_stream::read_exactly(unsigned long) at ././seastar/include/seastar/core/iostream-impl.hh:190
cql_transport::cql_server::connection::read_frame() at ./transport/server.cc:360
cql_transport::cql_server::connection::process_request() at ./transport/server.cc:580
operator() at ./generic_server.cc:65
 (inlined by) seastar::internal::do_until_state::run_and_dispose() at ././seastar/include/seastar/core/loop.hh:295
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2378
 (inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2787
seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:2956
seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:2839
seastar::app_template::run_deprecated(int, char**, std::function&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:228
seastar::app_template::run(int, char**, std::function ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:128
main at ./main.cc:492
/data/scylla-s3-reloc.cache/by-build-id/1bc1daed97736a2174a304e55861b4660b9dc327/extracted/scylla/libexec/scylla/libreloc/libc.so.6 0x27b74 
_start at ??:?
bhalevy commented 2 years ago

@avi it looks like this is triggered by too low max-networking-io-control-blocks (set to 100 in the dtest environment). I can reproduce this issue, as well as #975 (with lower probability) using:

CASSANDRA_DIR=../scylla/build/release SCYLLA_EXT_OPTS="--cpus 2 --memory 1G --max-networking-io-control-blocks 20" ./scripts/run_test.sh update_cluster_layout_tests:TestUpdateClusterLayout.increment_decrement_counters_in_threads_nodes_restarted_test

The root cause looks like lack of back pressure in aio_general_context. It is initialized with max_polls() iocbs in reactor_backend_aio::_polling_io but I don't see anything preventing queuing more than that.

We need to take into account the number of outstanding iocbs at or before aio_general_context::queue in conjunction with reactor_backend_aio::await_events that processes the respective completions for them.

avikivity commented 2 years ago

That is very sad. I propose to just increase the number, adding back pressure is some work, and it doesn't help if you really need 100 connections. 100 may be too low anyway for a real cluster.