redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.16k stars 559 forks source link

Reactor stall in leader balancer with many partitions #16352

Closed StephanDollberg closed 3 months ago

StephanDollberg commented 5 months ago

Version & Environment

Redpanda version: dev

What went wrong?

Jan 29 18:08:13 ip-172-31-22-251 rpk[14601]: Reactor stalled for 69 ms on shard 0. Backtrace: 0x89ff46f 0x8a00bcd 0x3dbaf 0x6b1d632 0x6b124b4 0x3b177aa 0x8a23c9f 0x8a27411 0x8a245a6 0x8925dd0 0x89241c8 0x2dd65f6 0x94dc069 0x27b89 0x27c4a 0x2dcee24
[Backtrace #0]
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}&&) at /build/redpanda/vbuild/release/clang/v_deps_build/seastar-prefix/src/seastar/include/seastar/util/backtrace.hh:64
 (inlined by) seastar::backtrace_buffer::append_backtrace_oneline() at /build/redpanda/vbuild/release/clang/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:838
 (inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at /build/redpanda/vbuild/release/clang/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:857
seastar::internal::cpu_stall_detector::generate_trace() at /build/redpanda/vbuild/release/clang/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:1539
 (inlined by) seastar::internal::cpu_stall_detector::maybe_report() at /build/redpanda/vbuild/release/clang/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:1376
 (inlined by) seastar::internal::cpu_stall_detector::on_signal() at /build/redpanda/vbuild/release/clang/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:1396
 (inlined by) seastar::reactor::block_notifier(int) at /build/redpanda/vbuild/release/clang/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:1597
?? ??:0
std::__1::vector<cluster::leader_balancer_types::random_reassignments::replica, std::__1::allocator<cluster::leader_balancer_types::random_reassignments::replica> >::size[abi:v160004]() const at /build/redpanda/vbuild/llvm/install/bin/../include/c++/v1/vector:546
 (inlined by) std::__1::vector<cluster::leader_balancer_types::random_reassignments::replica, std::__1::allocator<cluster::leader_balancer_types::random_reassignments::replica> >::at(unsigned long) const at /build/redpanda/vbuild/llvm/install/bin/../include/c++/v1/vector:1488
 (inlined by) fragmented_vector<cluster::leader_balancer_types::random_reassignments::replica, 8192ul>::operator[](unsigned long) const at /build/redpanda/src/v/container/include/container/fragmented_vector.h:178
 (inlined by) fragmented_vector<cluster::leader_balancer_types::random_reassignments::replica, 8192ul>::operator[](unsigned long) at /build/redpanda/src/v/container/include/container/fragmented_vector.h:182
 (inlined by) cluster::leader_balancer_types::random_reassignments::generate_reassignment() at /build/redpanda/src/v/cluster/scheduling/leader_balancer_random.h:70
 (inlined by) cluster::leader_balancer_types::random_hill_climbing_strategy::find_movement(absl::lts_20230802::flat_hash_set<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> >, absl::lts_20230802::hash_internal::Hash<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::equal_to<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > >, std::__1::allocator<detail::base_named_type<long, raft::raft_group_id_type, std::__1::integral_constant<bool, true> > > > const&) at /build/redpanda/src/v/cluster/scheduling/leader_balancer_random.h:124
cluster::leader_balancer::balance() at /build/redpanda/src/v/cluster/scheduling/leader_balancer.cc:498
std::__1::coroutine_handle<seastar::internal::coroutine_traits_base<seastar::bool_class<seastar::stop_iteration_tag> >::promise_type>::resume[abi:v160004]() const at /build/redpanda/vbuild/llvm/install/bin/../include/c++/v1/__coroutine/coroutine_handle.h:169
 (inlined by) seastar::internal::coroutine_traits_base<seastar::bool_class<seastar::stop_iteration_tag> >::promise_type::run_and_dispose() at /build/redpanda/vbuild/release/clang/rp_deps_install/include/seastar/core/coroutine.hh:83
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at /build/redpanda/vbuild/release/clang/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:2750
 (inlined by) seastar::reactor::run_some_tasks() at /build/redpanda/vbuild/release/clang/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:3213
seastar::reactor::do_run() at /build/redpanda/vbuild/release/clang/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:3397
seastar::reactor::run() at /build/redpanda/vbuild/release/clang/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:3265
seastar::app_template::run_deprecated(int, char**, std::__1::function<void ()>&&) at /build/redpanda/vbuild/release/clang/v_deps_build/seastar-prefix/src/seastar/src/core/app-template.cc:276
seastar::app_template::run(int, char**, std::__1::function<seastar::future<int> ()>&&) at /build/redpanda/vbuild/release/clang/v_deps_build/seastar-prefix/src/seastar/src/core/app-template.cc:167
application::run(int, char**) at /build/redpanda/src/v/redpanda/application.cc:416
main at /build/redpanda/src/v/redpanda/main.cc:22
?? ??:0
?? ??:0
_start at ??:?

What should have happened instead?

Don't stall

How to reproduce the issue?

~T6 (45k partitions) under load

mmaslankaprv commented 4 months ago

it seems to be resolved with latest changes to fragmented vector from Tyler

piyushredpanda commented 3 months ago

Closing per above comment from Michal.