RPC compression of large frames can stall

xemul commented 7 months ago

[Backtrace #0]
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:59
 (inlined by) seastar::backtrace_buffer::append_backtrace_oneline() at ./build/release/seastar/./seastar/src/core/reactor.cc:774
 (inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:793
seastar::internal::cpu_stall_detector::generate_trace() at ./build/release/seastar/./seastar/src/core/reactor.cc:1368
seastar::internal::cpu_stall_detector::maybe_report() at ./build/release/seastar/./seastar/src/core/reactor.cc:1110
 (inlined by) seastar::internal::cpu_stall_detector::on_signal() at ./build/release/seastar/./seastar/src/core/reactor.cc:1127
 (inlined by) seastar::reactor::block_notifier(int) at ./build/release/seastar/./seastar/src/core/reactor.cc:1351
?? ??:0
?? ??:0
seastar::rpc::lz4_fragmented_compressor::compress(unsigned long, seastar::rpc::snd_buf) at ./build/release/seastar/./seastar/src/rpc/lz4_fragmented_compressor.cc:177
seastar::rpc::connection::compress(seastar::rpc::snd_buf) at ./build/release/seastar/./seastar/src/rpc/rpc.cc:114
seastar::rpc::connection::send_entry(seastar::rpc::connection::outgoing_entry&) at ./build/release/seastar/./seastar/src/rpc/rpc.cc:151
operator() at ./build/release/seastar/./seastar/src/rpc/rpc.cc:302

mykaul commented 7 months ago

@michoecho - I believe you've fixed it?

michoecho commented 7 months ago

@michoecho - I believe you've fixed it?

No. I fixed #13437, but that was about CQL compression allocating too much large buffers. This backtrace is about RPC compression, and is not related to large allocations.

This stall is to be expected if the RPC frame is huge. But we don't know if that's really what happened in this case. If we truly get such huge RPC frames in practice, then the fix would be to make large compressions preemptible. But we don't know if that's really the case, or this is an artifact of something else (e.g. scylladb/scylladb#8828).

By the way, this is a duplicate of https://github.com/scylladb/scylla-enterprise/issues/3142. Avi suggested there that we could dump the top of the stack during stall reports, so we could extract some interesting parameters from it (here: the size of the compressed buffer). An easier but less general diagnostic would be to add a special log for stalling compressions.

avikivity commented 7 months ago

There's no timing or frame size in the report.

michoecho commented 7 months ago

There's no timing or frame size in the report.

This report has been copied here from https://github.com/scylladb/scylla-enterprise/issues/3121 — a real-world scenario, not a synthetic test. So there can't be a frame size in the report, because Scylla doesn't report it in any way.

The stall time is 70ms. But this doesn't tell you much without the frame size, does it.

avikivity commented 7 months ago

I don't believe compression of a reasonable frame can take 70ms.

avikivity commented 7 months ago

lz4 is reported as compressing > 500MB/s/core (amazing). A 1 MB frame would take 2ms. 70ms at that speed would be 35MB.

The reference system runs at 4.9GHz, so maybe real world speed is slower, but still.

michoecho commented 7 months ago

lz4 is reported as compressing > 500MB/s/core (amazing).

The benchmark actually says 780MB/s/core. Even Zstd can do 500MB/s/core. But remember it's reported per core, not per hyperthread.

A 1 MB frame would take 2ms. 70ms at that speed would be 35MB.

This stall comes from repair. What's the max size of frames produced by repair? Isn't it 32 MiB? That would about check out.

xemul commented 7 months ago

This stall comes from repair.

Someone from the original issue disagrees with that

michoecho commented 7 months ago

This stall comes from repair.

Someone from the original issue disagrees with that

It doesn't come from repair code, but it's caused by repair's RPC frames. There's no disagreement here.

Repair can't do anything about the fact that the RPC layer doesn't preempt inside frames. It could, however, send smaller frames, if RPC can't deal with large frames well.

xemul commented 7 months ago

Doesn't repair send its data via output_stream engine, not via sockets directly?

michoecho commented 7 months ago

Doesn't repair send its data via output_stream engine, not via sockets directly?

It sends its data via Scylla's messaging_service. Who says anything about sockets?

xemul commented 7 months ago

It sends its data via Scylla's messaging_service

Apparently it doesn't, messaging service doesn't have send/recv facilities, it just provides sockets. I meant -- to send fragments over the wire it negotiates stream socket and then wraps it into rpc::sink/rpc::source, then calls operator() on it passing serializable arguments. I thought it was "classical" iostream, but now I refreshed my memory and see it's on its own

michoecho commented 7 months ago

It sends its data via Scylla's messaging_service

Apparently it doesn't, messaging service doesn't have send/recv facilities, it just provides sockets.

That's what I meant. But you are using the word "socket" differently than me. Should "rpc::sink" be called a socket?

Anyway, I meant that the serializable arguments might be arbitrarily large, and compression will compress the entire message at once. Repair can't help that, but it could split what it sends into smaller pieces.

If this is about repair's large messages, then we should be able to confirm it with a test. Enable internode_compression, start a Scylla cluster, do something that will result in a large partition being repaired, observe if the stalls happen.

If we confirm that, we can discuss whether this is something that should be fixed on Scylla's side (by sending smaller messages), or on Seastar's side (by making RPC compression preemptible).

avikivity commented 7 months ago

repair uses rpc streaming. the unit frame there is a row. so if the row is reasonable, so will be its compression time.

michoecho commented 7 months ago

the unit frame there is a row

Are you sure about that?

As far as I can see from the code, the unit is repair_row_on_wire, which is actually a batch of rows, despite the name.

xemul commented 7 months ago

so if the row is reasonable, so will be its compression time.

And if the row is "too large" we should have a large-data-handler record, shouldn't we? (the default is 10Mb)

avikivity commented 7 months ago

the unit frame there is a row

Are you sure about that?

As far as I can see from the code, the unit is repair_row_on_wire, which is actually a batch of rows, despite the name.

@asias ?

scylladb / seastar

RPC compression of large frames can stall #2002