Open bhalevy opened 2 years ago
@avi it looks like this is triggered by too low max-networking-io-control-blocks
(set to 100 in the dtest environment).
I can reproduce this issue, as well as #975 (with lower probability) using:
CASSANDRA_DIR=../scylla/build/release SCYLLA_EXT_OPTS="--cpus 2 --memory 1G --max-networking-io-control-blocks 20" ./scripts/run_test.sh update_cluster_layout_tests:TestUpdateClusterLayout.increment_decrement_counters_in_threads_nodes_restarted_test
The root cause looks like lack of back pressure in aio_general_context
.
It is initialized with max_polls()
iocbs in reactor_backend_aio::_polling_io
but I don't see anything preventing queuing more than that.
We need to take into account the number of outstanding iocbs at or before aio_general_context::queue
in conjunction with reactor_backend_aio::await_events
that processes the respective completions for them.
That is very sad. I propose to just increase the number, adding back pressure is some work, and it doesn't help if you really need 100 connections. 100 may be too low anyway for a real cluster.
Seen in https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/1004/artifact/logs-all.release.2/1637736003906_update_cluster_layout_tests.TestUpdateClusterLayout.increment_decrement_counters_in_threads_nodes_restarted_test/node3.log
Decoded: