Open xemul opened 7 months ago
@michoecho - I believe you've fixed it?
@michoecho - I believe you've fixed it?
No. I fixed #13437, but that was about CQL compression allocating too much large buffers. This backtrace is about RPC compression, and is not related to large allocations.
This stall is to be expected if the RPC frame is huge. But we don't know if that's really what happened in this case. If we truly get such huge RPC frames in practice, then the fix would be to make large compressions preemptible. But we don't know if that's really the case, or this is an artifact of something else (e.g. scylladb/scylladb#8828).
By the way, this is a duplicate of https://github.com/scylladb/scylla-enterprise/issues/3142. Avi suggested there that we could dump the top of the stack during stall reports, so we could extract some interesting parameters from it (here: the size of the compressed buffer). An easier but less general diagnostic would be to add a special log for stalling compressions.
There's no timing or frame size in the report.
There's no timing or frame size in the report.
This report has been copied here from https://github.com/scylladb/scylla-enterprise/issues/3121 — a real-world scenario, not a synthetic test. So there can't be a frame size in the report, because Scylla doesn't report it in any way.
The stall time is 70ms. But this doesn't tell you much without the frame size, does it.
I don't believe compression of a reasonable frame can take 70ms.
lz4 is reported as compressing > 500MB/s/core (amazing). A 1 MB frame would take 2ms. 70ms at that speed would be 35MB.
The reference system runs at 4.9GHz, so maybe real world speed is slower, but still.
lz4 is reported as compressing > 500MB/s/core (amazing).
The benchmark actually says 780MB/s/core. Even Zstd can do 500MB/s/core. But remember it's reported per core, not per hyperthread.
A 1 MB frame would take 2ms. 70ms at that speed would be 35MB.
This stall comes from repair. What's the max size of frames produced by repair? Isn't it 32 MiB? That would about check out.
This stall comes from repair.
Someone from the original issue disagrees with that
This stall comes from repair.
Someone from the original issue disagrees with that
It doesn't come from repair code, but it's caused by repair's RPC frames. There's no disagreement here.
Repair can't do anything about the fact that the RPC layer doesn't preempt inside frames. It could, however, send smaller frames, if RPC can't deal with large frames well.
Doesn't repair send its data via output_stream engine, not via sockets directly?
Doesn't repair send its data via output_stream engine, not via sockets directly?
It sends its data via Scylla's messaging_service. Who says anything about sockets?
It sends its data via Scylla's messaging_service
Apparently it doesn't, messaging service doesn't have send/recv facilities, it just provides sockets. I meant -- to send fragments over the wire it negotiates stream socket and then wraps it into rpc::sink/rpc::source, then calls operator() on it passing serializable arguments. I thought it was "classical" iostream, but now I refreshed my memory and see it's on its own
It sends its data via Scylla's messaging_service
Apparently it doesn't, messaging service doesn't have send/recv facilities, it just provides sockets.
That's what I meant. But you are using the word "socket" differently than me. Should "rpc::sink" be called a socket?
Anyway, I meant that the serializable arguments might be arbitrarily large, and compression will compress the entire message at once. Repair can't help that, but it could split what it sends into smaller pieces.
If this is about repair's large messages, then we should be able to confirm it with a test. Enable internode_compression, start a Scylla cluster, do something that will result in a large partition being repaired, observe if the stalls happen.
If we confirm that, we can discuss whether this is something that should be fixed on Scylla's side (by sending smaller messages), or on Seastar's side (by making RPC compression preemptible).
repair uses rpc streaming. the unit frame there is a row. so if the row is reasonable, so will be its compression time.
the unit frame there is a row
Are you sure about that?
As far as I can see from the code, the unit is repair_row_on_wire
, which is actually a batch of rows, despite the name.
so if the row is reasonable, so will be its compression time.
And if the row is "too large" we should have a large-data-handler record, shouldn't we? (the default is 10Mb)
the unit frame there is a row
Are you sure about that?
As far as I can see from the code, the unit is
repair_row_on_wire
, which is actually a batch of rows, despite the name.
@asias ?