icerpc / icerpc-csharp

A C# RPC framework built for QUIC, with bidirectional streaming, first-class async/await, and Protobuf support.
https://docs.icerpc.dev
Apache License 2.0
108 stars 13 forks source link

Add `StreamLastAndReadCompleted` frame to avoid sending an extra `StreamReadCompleted` frame #3218

Closed bentoi closed 1 year ago

bentoi commented 1 year ago

For bidirectional streams, the remote stream always sends a StreamReadCompleted when reads are closed. Reads are considered closed when the data from the remote stream input has been consumed.

Instead, we could look into not sending this frame in the scenario where writes are not closed.

The remote stream would send a new StreamLastAndReadCompleted frame instead of a StreamLast frame. Upon receiving this frame, the local stream can consider the remote stream reads and writes as closed. As a consequence, the connection can release the bidirectional stream semaphore.

bernardnormier commented 1 year ago

It sounds like a worthwhile optimization for the common case of small RPCs.

bernardnormier commented 1 year ago

Could we just use 2 frames instead of 3?

With this proposal, we'd have:

It seems to me:

would be sufficient.

In particular for unidirectional streams, StreamComplete(s) would replace StreamReadsCompleted, where the reader tells the writer "I'm done".

bentoi commented 1 year ago

There's also the following scenario: I'm done with reading (please stop sending) but I still have something to send you. With this scenario, the remote stream needs to be able to send StreamReadsCompleted before sending the data, right?

bentoi commented 1 year ago

With Quic, I think this is handled with the MAX_STREAMS frame. If initial_max_streams_bidi is set to 1 and the client creates a bi-dir stream, it won't be able to open a new bi-dir stream until the peers sends a MAX_STREAMS(MaxStreams=1) frame to allow the client to open a new stream.

The server could actually send this frame even if the first stream is still opened.

So with Quic:

We'd have to check this with Wireshark... If that's how it works, the question will be whether or not we should do the same with Slic. A two-way icerpc request would still require 3 frames here (2 x STREAM_LAST + 1 MAX_STREAMS).

The question is whether or not this has a significant impact on icerpc latency.

bernardnormier commented 1 year ago

There's also the following scenario: I'm done with reading (please stop sending) but I still have something to send you. With this scenario, the remote stream needs to be able to send StreamReadsCompleted before sending the data, right?

Yes, you're right.

bernardnormier commented 1 year ago

The question is whether or not this has a significant impact on icerpc latency.

You mean if sending 3 frames instead of 2 has a significant impact on latency? I bet it does.

bernardnormier commented 1 year ago

server sends MAX_STREAM(MaxStreams=1) to allow the client to open a new stream

When does the server decide to send this frame? It doesn't know when the client reads the STREAM_LAST frame it sent, so arguably it could "bundle" it with the STREAM_LAST frame, which is in a way what we're talking about here with this StreamLastAndReadCompleted frame.

An alternative to this new StreamLastAndReadCompleted frame could be to send both StreamLast and StreamReadsCompleted together in a single write.

The logic being: when reading StreamLast from a bidir stream that is still writable, defer the sending of StreamReadsCompleted until the next write--and bundle it with the next Stream/StreamLast frame.

bentoi commented 1 year ago

When does the server decide to send this frame? It doesn't know when the client reads the STREAM_LAST frame it sent, so arguably it could "bundle" it with the STREAM_LAST frame

Why would it matter to the server that the client read or not the STREAM_LAST frame? I expect a well behaved Quic implementation sends the MAX_STREAMS frame once the both the remote stream reads and writes are closed and regardless of the local stream state.

See also the stream states here. "Reads closed" = the "DataRead" state, "Writes closed" = the "DataRcvd" state.

The remote stream should consider that reads are closed only once all the buffered has been consumed by the application (DataRead state). It shouldn't consider reads closed when it received the STREAM_LAST frame (DataRcvd state).

I suspect it's the issue with .NET Quic (or msquic). It sends the MAX_STREAMS frame too soon, before the data buffered on the remote stream is consumed by the application.

We'll see what Microsoft has to say, it's also quite possible I'm missing something.

If we want to avoid the extra frame, the StreamLastReadCompleted frame sounds simpler to me.

bernardnormier commented 1 year ago

and bundle it with the next Stream/StreamLast frame.

One potential advantage over the StreamLastReadCompleted (other than one fewer frame type) is we'd communicate StreamReadsCompleted earlier in the event the server keeps writing for a while.

Question: we currently have this exchange StreamLast -> StreamReadsCompleted <- StreamLast <-

The client doesn't need to send a StreamReadsCompleted to the server: StreamReadsCompleted ->

because it's the "accepter" (the server here) that flow-controls the streams?

And here, as soon as the accepter sends StreamLast, it can dispose the stream?

bentoi commented 1 year ago

Let's use "local stream" for the stream created by the client and "remote stream" for the stream accepted by the server.

Yes, the local stream never sends StreamReadsCompleted, see https://github.com/icerpc/icerpc-csharp/blob/8262c7083a59411234a9824fde3cc5f47d724130/src/IceRpc/Transports/Slic/Internal/SlicStream.cs#L149

because it's the "accepter" (the server here) that flow-controls the streams?

Yes, it's the server that decides when the client can create a new stream.

The remote stream reads and writes are closed when the remote stream sent respectively StreamReadsCompleted and StreamLast.

When the local stream receives these two frames, the connection bidirectional stream semaphore is released to unblock a CreateStreamAsync call.

bentoi commented 1 year ago

Here are the results with the addition of a StreamLastAndReadsCompleted frame. The results vary from different runs but overall the results follow this trend. I don't understand why the task count 16 and 64 benchmarks are slower with the change.

Method Job Protocol Mean Error StdDev Median Op/s Ratio
InvokeLatency 0.1.0 icerpc 141.51 us 1.932 us 1.807 us 141.71 us 7,066.7 baseline
InvokeLatency 0.1.1-slicfix icerpc 98.13 us 0.397 us 0.352 us 98.11 us 10,190.9 1.44x faster
InvokeRPSWithTaskCount1 0.1.0 icerpc 129.57 us 2.094 us 2.722 us 128.80 us 7,717.8 baseline
InvokeRPSWithTaskCount1 0.1.1-slicfix icerpc 99.51 us 1.952 us 2.324 us 99.20 us 10,049.3 1.30x faster
InvokeRPSWithTaskCount4 0.1.0 icerpc 62.18 us 0.902 us 0.844 us 62.49 us 16,082.0 baseline
InvokeRPSWithTaskCount4 0.1.1-slicfix icerpc 56.74 us 0.763 us 0.713 us 56.82 us 17,622.8 1.10x faster
InvokeRPSWithTaskCount16 0.1.0 icerpc 49.81 us 0.955 us 2.516 us 48.93 us 20,076.2 baseline
InvokeRPSWithTaskCount16 0.1.1-slicfix icerpc 61.14 us 1.198 us 1.935 us 61.64 us 16,355.6 1.20x slower
InvokeRPSWithTaskCount64 0.1.0 icerpc 50.80 us 1.347 us 3.971 us 49.73 us 19,684.9 baseline
InvokeRPSWithTaskCount64 0.1.1-slicfix icerpc 57.00 us 1.390 us 4.098 us 56.69 us 17,542.7 1.13x slower
bentoi commented 1 year ago

gRPC vs IceRPC w/ Slic fix

Method Benchmark Mean Error StdDev Op/s
PingLatency IceRPC(icerpc,tcp) 101.63 us 1.978 us 1.850 us 9,839.2
PingLatency gRPC(http/2,tcp) 147.01 us 1.438 us 1.274 us 6,802.1
PingRPSWithTaskCount1 IceRPC(icerpc,tcp) 95.83 us 1.214 us 1.136 us 10,435.6
PingRPSWithTaskCount4 IceRPC(icerpc,tcp) 55.35 us 0.955 us 0.893 us 18,066.7
PingRPSWithTaskCount16 IceRPC(icerpc,tcp) 55.79 us 1.053 us 1.253 us 17,923.0
PingRPSWithTaskCount64 IceRPC(icerpc,tcp) 52.35 us 1.009 us 0.894 us 19,101.0
PingRPSWithTaskCount1 gRPC(http/2,tcp) 136.44 us 0.846 us 0.791 us 7,329.2
PingRPSWithTaskCount4 gRPC(http/2,tcp) 53.42 us 0.957 us 1.278 us 18,720.4
PingRPSWithTaskCount16 gRPC(http/2,tcp) 20.39 us 0.402 us 0.376 us 49,046.8
PingRPSWithTaskCount64 gRPC(http/2,tcp) 13.52 us 0.267 us 0.262 us 73,944.5