Reactor stalls in metadata reponse

travisdowns commented 2 years ago

Version & Environment

Redpanda version: 21.11.10

What went wrong?

Stalls while handling client metadata request for a topic with a large number of partitions (45k).

What should have happened instead?

No or very infrequent stalls.

How to reproduce the issue?

Should reproduce while consuming from a large topic.

Additional information

We don't see a single stall location but in general several places, almost all relating to operations which iterate over some intermediate version of the response object (usually a vector of objects which themselves contain many small vectors e.g., little vectors for each replica set).

Here is a backtrace showing time being spent copying the very large response vector:

{redpanda} 0x1a87a89: std::__1::vector<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> >, std::__1::allocator<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > > >::vector(std::__1::vector<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> >, std::__1::allocator<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > > > const&) at /vectorized/llvm/bin/../include/c++/v1/vector:1259
 (inlined by) kafka::metadata_response_partition::metadata_response_partition(kafka::metadata_response_partition const&) at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0bdc88d0f74d549f3-1/redpanda/redpanda/vbuild/release/clang/src/v/kafka/protocol/schemata/metadata_response.h:64
 (inlined by) kafka::metadata_response_partition* std::__1::construct_at<kafka::metadata_response_partition, kafka::metadata_response_partition&, kafka::metadata_response_partition*>(kafka::metadata_response_partition*, kafka::metadata_response_partition&) at /vectorized/llvm/bin/../include/c++/v1/__memory/base.h:105
 (inlined by) void std::__1::allocator_traits<std::__1::allocator<kafka::metadata_response_partition> >::construct<kafka::metadata_response_partition, kafka::metadata_response_partition&, void, void>(std::__1::allocator<kafka::metadata_response_partition>&, kafka::metadata_response_partition*, kafka::metadata_response_partition&) at /vectorized/llvm/bin/../include/c++/v1/__memory/allocator_traits.h:296
 (inlined by) void std::__1::__construct_range_forward<std::__1::allocator<kafka::metadata_response_partition>, kafka::metadata_response_partition*, kafka::metadata_response_partition*>(std::__1::allocator<kafka::metadata_response_partition>&, kafka::metadata_response_partition*, kafka::metadata_response_partition*, kafka::metadata_response_partition*&) at /vectorized/llvm/bin/../include/c++/v1/memory:1042
 (inlined by) std::__1::enable_if<__is_cpp17_forward_iterator<kafka::metadata_response_partition*>::value, void>::type std::__1::vector<kafka::metadata_response_partition, std::__1::allocator<kafka::metadata_response_partition> >::__construct_at_end<kafka::metadata_response_partition*>(kafka::metadata_response_partition*, kafka::metadata_response_partition*, unsigned long) at /vectorized/llvm/bin/../include/c++/v1/vector:1077
 (inlined by) std::__1::vector<kafka::metadata_response_partition, std::__1::allocator<kafka::metadata_response_partition> >::vector(std::__1::vector<kafka::metadata_response_partition, std::__1::allocator<kafka::metadata_response_partition> > const&) at /vectorized/llvm/bin/../include/c++/v1/vector:1259
 (inlined by) kafka::metadata_response_topic::metadata_response_topic(kafka::metadata_response_topic const&) at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0bdc88d0f74d549f3-1/redpanda/redpanda/vbuild/release/clang/src/v/kafka/protocol/schemata/metadata_response.h:92
 (inlined by) kafka::metadata_response_topic* std::__1::construct_at<kafka::metadata_response_topic, kafka::metadata_response_topic&, kafka::metadata_response_topic*>(kafka::metadata_response_topic*, kafka::metadata_response_topic&) at /vectorized/llvm/bin/../include/c++/v1/__memory/base.h:105
 (inlined by) void std::__1::allocator_traits<std::__1::allocator<kafka::metadata_response_topic> >::construct<kafka::metadata_response_topic, kafka::metadata_response_topic&, void, void>(std::__1::allocator<kafka::metadata_response_topic>&, kafka::metadata_response_topic*, kafka::metadata_response_topic&) at /vectorized/llvm/bin/../include/c++/v1/__memory/allocator_traits.h:296
 (inlined by) void std::__1::__construct_range_forward<std::__1::allocator<kafka::metadata_response_topic>, kafka::metadata_response_topic*, kafka::metadata_response_topic*>(std::__1::allocator<kafka::metadata_response_topic>&, kafka::metadata_response_topic*, kafka::metadata_response_topic*, kafka::metadata_response_topic*&) at /vectorized/llvm/bin/../include/c++/v1/memory:1042
 (inlined by) std::__1::enable_if<__is_cpp17_forward_iterator<kafka::metadata_response_topic*>::value, void>::type std::__1::vector<kafka::metadata_response_topic, std::__1::allocator<kafka::metadata_response_topic> >::__construct_at_end<kafka::metadata_response_topic*>(kafka::metadata_response_topic*, kafka::metadata_response_topic*, unsigned long) at /vectorized/llvm/bin/../include/c++/v1/vector:1077
 (inlined by) std::__1::vector<kafka::metadata_response_topic, std::__1::allocator<kafka::metadata_response_topic> >::vector(std::__1::vector<kafka::metadata_response_topic, std::__1::allocator<kafka::metadata_response_topic> > const&) at /vectorized/llvm/bin/../include/c++/v1/vector:1259
 (inlined by) kafka::get_topic_metadata(kafka::request_context&, kafka::metadata_request&)::$_6::operator()(std::__1::vector<kafka::metadata_response_topic, std::__1::allocator<kafka::metadata_response_topic> >) at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0bdc88d0f74d549f3-1/redpanda/redpanda/vbuild/release/clang/../../../src/v/kafka/server/handlers/metadata.cc:262
 (inlined by) seastar::future<std::__1::vector<kafka::metadata_response_topic, std::__1::allocator<kafka::metadata_response_topic> > > seastar::futurize<std::__1::vector<kafka::metadata_response_topic, std::__1::allocator<kafka::metadata_response_topic> > >::invoke<kafka::get_topic_metadata(kafka::request_context&, kafka::metadata_request&)::$_6, std::__1::vector<kafka::metadata_response_topic, std::__1::allocator<kafka::metadata_response_topic> > >(kafka::get_topic_metadata(kafka::request_context&, kafka::metadata_request&)::$_6&&, std::__1::vector<kafka::metadata_response_topic, std::__1::allocator<kafka::metadata_response_topic> >&&) at /vectorized/include/seastar/core/future.hh:2137
 (inlined by) seastar::future<std::__1::vector<kafka::metadata_response_topic, std::__1::allocator<kafka::metadata_response_topic> > > seastar::future<std::__1::vector<kafka::metadata_response_topic, std::__1::allocator<kafka::metadata_response_topic> > >::then_impl<kafka::get_topic_metadata(kafka::request_context&, kafka::metadata_request&)::$_6, seastar::future<std::__1::vector<kafka::metadata_response_topic, std::__1::allocator<kafka::metadata_response_topic> > > >(kafka::get_topic_metadata(kafka::request_context&, kafka::metadata_request&)::$_6&&) at /vectorized/include/seastar/core/future.hh:1601
 (inlined by) seastar::internal::future_result<kafka::get_topic_metadata(kafka::request_context&, kafka::metadata_request&)::$_6, std::__1::vector<kafka::metadata_response_topic, std::__1::allocator<kafka::metadata_response_topic> > >::future_type seastar::internal::call_then_impl<seastar::future<std::__1::vector<kafka::metadata_response_topic, std::__1::allocator<kafka::metadata_response_topic> > > >::run<kafka::get_topic_metadata(kafka::request_context&, kafka::metadata_request&)::$_6>(seastar::future<std::__1::vector<kafka::metadata_response_topic, std::__1::allocator<kafka::metadata_response_topic> > >&, kafka::get_topic_metadata(kafka::request_context&, kafka::metadata_request&)::$_6&&) at /vectorized/include/seastar/core/future.hh:1234
 (inlined by) seastar::future<std::__1::vector<kafka::metadata_response_topic, std::__1::allocator<kafka::metadata_response_topic> > > seastar::future<std::__1::vector<kafka::metadata_response_topic, std::__1::allocator<kafka::metadata_response_topic> > >::then<kafka::get_topic_metadata(kafka::request_context&, kafka::metadata_request&)::$_6, seastar::future<std::__1::vector<kafka::metadata_response_topic, std::__1::allocator<kafka::metadata_response_topic> > > >(kafka::get_topic_metadata(kafka::request_context&, kafka::metadata_request&)::$_6&&) at /vectorized/include/seastar/core/future.hh:1520
 (inlined by) kafka::get_topic_metadata(kafka::request_context&, kafka::metadata_request&) at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0bdc88d0f74d549f3-1/redpanda/redpanda/vbuild/release/clang/../../../src/v/kafka/server/handlers/metadata.cc:259
{redpanda} 0x1a8e508: kafka::handler<kafka::metadata_api, (short)0, (short)7>::handle(kafka::request_context, seastar::smp_service_group) at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0bdc88d0f74d549f3-1/redpanda/redpanda/vbuild/release/clang/../../../src/v/kafka/server/handlers/metadata.cc:292

This one can actually be fixed with an additional std::move at return, currently we have:

    return ss::when_all_succeed(new_topics.begin(), new_topics.end())
      .then([res = std::move(res)](
              std::vector<metadata_response::topic> topics) mutable {
          res.insert(res.end(), topics.begin(), topics.end());
          return res;
      });

Where res is a very large, deeply nested intermediate response object. Replacing return res with return std::move(res) would avoid the extra copy which shows in many traces.

Other stacks don't involve copies, but transformations of the response object, e.g. here:

{redpanda} 0x1a7d320: kafka::make_topic_response_from_topic_metadata(cluster::metadata_cache const&, model::topic_metadata&&)::$_0::operator()(model::partition_metadata&) const::'lambda'(model::broker_shard const&)::operator()(model::broker_shard const&) const at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0bdc88d0f74d549f3-1/redpanda/redpanda/vbuild/release/clang/../../../src/v/kafka/server/handlers/metadata.cc:102
 (inlined by) std::__1::back_insert_iterator<std::__1::vector<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> >, std::__1::allocator<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > > > > std::__1::transform<std::__1::__wrap_iter<model::broker_shard const*>, std::__1::back_insert_iterator<std::__1::vector<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> >, std::__1::allocator<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > > > >, kafka::make_topic_response_from_topic_metadata(cluster::metadata_cache const&, model::topic_metadata&&)::$_0::operator()(model::partition_metadata&) const::'lambda'(model::broker_shard const&)>(std::__1::__wrap_iter<model::broker_shard const*>, std::__1::__wrap_iter<model::broker_shard const*>, std::__1::back_insert_iterator<std::__1::vector<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> >, std::__1::allocator<detail::base_named_type<int, model::node_id_model_type, std::__1::integral_constant<bool, true> > > > >, kafka::make_topic_response_from_topic_metadata(cluster::metadata_cache const&, model::topic_metadata&&)::$_0::operator()(model::partition_metadata&) const::'lambda'(model::broker_shard const&)) at /vectorized/llvm/bin/../include/c++/v1/algorithm:1982
 (inlined by) kafka::make_topic_response_from_topic_metadata(cluster::metadata_cache const&, model::topic_metadata&&)::$_0::operator()(model::partition_metadata&) const at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0bdc88d0f74d549f3-1/redpanda/redpanda/vbuild/release/clang/../../../src/v/kafka/server/handlers/metadata.cc:98
 (inlined by) std::__1::back_insert_iterator<std::__1::vector<kafka::metadata_response_partition, std::__1::allocator<kafka::metadata_response_partition> > > std::__1::transform<std::__1::__wrap_iter<model::partition_metadata*>, std::__1::back_insert_iterator<std::__1::vector<kafka::metadata_response_partition, std::__1::allocator<kafka::metadata_response_partition> > >, kafka::make_topic_response_from_topic_metadata(cluster::metadata_cache const&, model::topic_metadata&&)::$_0>(std::__1::__wrap_iter<model::partition_metadata*>, std::__1::__wrap_iter<model::partition_metadata*>, std::__1::back_insert_iterator<std::__1::vector<kafka::metadata_response_partition, std::__1::allocator<kafka::metadata_response_partition> > >, kafka::make_topic_response_from_topic_metadata(cluster::metadata_cache const&, model::topic_metadata&&)::$_0) at /vectorized/llvm/bin/../include/c++/v1/algorithm:1982
 (inlined by) kafka::make_topic_response_from_topic_metadata(cluster::metadata_cache const&, model::topic_metadata&&) at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0bdc88d0f74d549f3-1/redpanda/redpanda/vbuild/release/clang/../../../src/v/kafka/server/handlers/metadata.cc:91
{redpanda} 0x1a83560: kafka::make_topic_response(kafka::request_context&, kafka::metadata_request&, model::topic_metadata) at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0bdc88d0f74d549f3-1/redpanda/redpanda/vbuild/release/clang/../../../src/v/kafka/server/handlers/metadata.cc:182
 (inlined by) kafka::get_topic_metadata(kafka::request_context&, kafka::metadata_request&) at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0bdc88d0f74d549f3-1/redpanda/redpanda/vbuild/release/clang/../../../src/v/kafka/server/handlers/metadata.cc:233

The one above is inside the transformation implemented by make_topic_response_from_topic_metadata.

A practical approach would be to try to implement the "front end" of the metadata response in a vanilla (synchronous) way that involves no O(partition_count) operations, just collecting stuff from the metadata cache, performing the authorization stuff, etc, then a final step which does O(partition_count) work to iterate over the collected structures and generate the result. This final iteration should be done async-aware, e.g., with seastar::do_for_each or similar, and this might also extend down to the response: i.e., maybe the response needs to be streaming with possible suspension while the response is in-progress.

JIRA Link: CORE-881

dotnwat commented 2 years ago

Great report, and not surprising that this is happening. I think we're going to find quite a few of these.

travisdowns commented 2 years ago

FYI this stall accounts for 3070 of the 5866 unique stacks in a large load test (1152 cores) performed recently (you probably know the one). Most stalls are modest, ~260 ms, with only a handful of larger ones (up to minutes, but the node appears unhealthy in those cases).

dotnwat commented 2 years ago

up to minutes

woah. that's unexpected. presumably that isn't real computation, but the reactor thread blocked on something?

travisdowns commented 2 years ago

woah. that's unexpected. presumably that isn't real computation, but the reactor thread blocked on something?

Yeah as I sort of vaguely alluded to by "unhealthy node" the top of stacks for the very long stalls looked different, e.g., they are in things like internal::throw_bad_alloc(); in sstring.hh i.e., we are OOM. It isn't immediately clear why that causes a minutes long stall: perhaps there is some kind of loop with continual OOMs if the exception doesn't get all the way up to the top level, but I think we can basically consider that a different issue.

dotnwat commented 2 years ago

we are OOM

ahh, got it. all sorts of stuff happens in low memory situation. there is the normal allocator looking for memory, the reclaimer processing our batch cache, cross-core compare-and-swap loops trying to reclaim foreign memory.

loop with continual OOMs

yeh. I could see this happening.

redpanda-data / redpanda