Open travisdowns opened 2 years ago
Great report, and not surprising that this is happening. I think we're going to find quite a few of these.
FYI this stall accounts for 3070 of the 5866 unique stacks in a large load test (1152 cores) performed recently (you probably know the one). Most stalls are modest, ~260 ms, with only a handful of larger ones (up to minutes, but the node appears unhealthy in those cases).
up to minutes
woah. that's unexpected. presumably that isn't real computation, but the reactor thread blocked on something?
woah. that's unexpected. presumably that isn't real computation, but the reactor thread blocked on something?
Yeah as I sort of vaguely alluded to by "unhealthy node" the top of stacks for the very long stalls looked different, e.g., they are in things like internal::throw_bad_alloc();
in sstring.hh
i.e., we are OOM. It isn't immediately clear why that causes a minutes long stall: perhaps there is some kind of loop with continual OOMs if the exception doesn't get all the way up to the top level, but I think we can basically consider that a different issue.
we are OOM
ahh, got it. all sorts of stuff happens in low memory situation. there is the normal allocator looking for memory, the reclaimer processing our batch cache, cross-core compare-and-swap loops trying to reclaim foreign memory.
loop with continual OOMs
yeh. I could see this happening.
Version & Environment
Redpanda version: 21.11.10
What went wrong?
Stalls while handling client metadata request for a topic with a large number of partitions (45k).
What should have happened instead?
No or very infrequent stalls.
How to reproduce the issue?
Should reproduce while consuming from a large topic.
Additional information
We don't see a single stall location but in general several places, almost all relating to operations which iterate over some intermediate version of the response object (usually a vector of objects which themselves contain many small vectors e.g., little vectors for each replica set).
Here is a backtrace showing time being spent copying the very large response vector:
This one can actually be fixed with an additional
std::move
at return, currently we have:Where
res
is a very large, deeply nested intermediate response object. Replacingreturn res
withreturn std::move(res)
would avoid the extra copy which shows in many traces.Other stacks don't involve copies, but transformations of the response object, e.g. here:
The one above is inside the transformation implemented by
make_topic_response_from_topic_metadata
.A practical approach would be to try to implement the "front end" of the metadata response in a vanilla (synchronous) way that involves no O(partition_count) operations, just collecting stuff from the metadata cache, performing the authorization stuff, etc, then a final step which does O(partition_count) work to iterate over the collected structures and generate the result. This final iteration should be done async-aware, e.g., with
seastar::do_for_each
or similar, and this might also extend down to the response: i.e., maybe the response needs to be streaming with possible suspension while the response is in-progress.JIRA Link: CORE-881