Add `StreamLastAndReadCompleted` frame to avoid sending an extra `StreamReadCompleted` frame

bentoi commented 1 year ago

For bidirectional streams, the remote stream always sends a StreamReadCompleted when reads are closed. Reads are considered closed when the data from the remote stream input has been consumed.

Instead, we could look into not sending this frame in the scenario where writes are not closed.

The remote stream would send a new StreamLastAndReadCompleted frame instead of a StreamLast frame. Upon receiving this frame, the local stream can consider the remote stream reads and writes as closed. As a consequence, the connection can release the bidirectional stream semaphore.

bernardnormier commented 1 year ago

It sounds like a worthwhile optimization for the common case of small RPCs.

bernardnormier commented 1 year ago

Could we just use 2 frames instead of 3?

With this proposal, we'd have:

StreamReadsCompleted = I've read everything and won't read more
StreamLast = I've written everything and won't write more
StreamLastAndReadCompleted = both the above

It seems to me:

StreamLast = I've written everything and won't write more (but maybe I am still reading)
StreamComplete(d) = I've read everything and won't read more and I've also written everything and won't write more

would be sufficient.

In particular for unidirectional streams, StreamComplete(s) would replace StreamReadsCompleted, where the reader tells the writer "I'm done".

bentoi commented 1 year ago

There's also the following scenario: I'm done with reading (please stop sending) but I still have something to send you. With this scenario, the remote stream needs to be able to send StreamReadsCompleted before sending the data, right?

bentoi commented 1 year ago

With Quic, I think this is handled with the MAX_STREAMS frame. If initial_max_streams_bidi is set to 1 and the client creates a bi-dir stream, it won't be able to open a new bi-dir stream until the peers sends a MAX_STREAMS(MaxStreams=1) frame to allow the client to open a new stream.

The server could actually send this frame even if the first stream is still opened.

So with Quic:

client opens a bi-dir stream
client sends data+EOS over the stream (sends STREAM_LAST frame)
server reads the data+EOS from the server (receives the STREAM_LAST frame)
server sends data + EOS over the stream (sends STREAM_LAST frame)
client reads the data+EOS from the server (receives the STREAM_LAST frame)
server sends MAX_STREAM(MaxStreams=1) to allow the client to open a new stream

We'd have to check this with Wireshark... If that's how it works, the question will be whether or not we should do the same with Slic. A two-way icerpc request would still require 3 frames here (2 x STREAM_LAST + 1 MAX_STREAMS).

The question is whether or not this has a significant impact on icerpc latency.

bernardnormier commented 1 year ago

There's also the following scenario: I'm done with reading (please stop sending) but I still have something to send you. With this scenario, the remote stream needs to be able to send StreamReadsCompleted before sending the data, right?

Yes, you're right.

bernardnormier commented 1 year ago

The question is whether or not this has a significant impact on icerpc latency.

You mean if sending 3 frames instead of 2 has a significant impact on latency? I bet it does.

bernardnormier commented 1 year ago

server sends MAX_STREAM(MaxStreams=1) to allow the client to open a new stream

When does the server decide to send this frame? It doesn't know when the client reads the STREAM_LAST frame it sent, so arguably it could "bundle" it with the STREAM_LAST frame, which is in a way what we're talking about here with this StreamLastAndReadCompleted frame.

An alternative to this new StreamLastAndReadCompleted frame could be to send both StreamLast and StreamReadsCompleted together in a single write.

The logic being: when reading StreamLast from a bidir stream that is still writable, defer the sending of StreamReadsCompleted until the next write--and bundle it with the next Stream/StreamLast frame.

bentoi commented 1 year ago

When does the server decide to send this frame? It doesn't know when the client reads the STREAM_LAST frame it sent, so arguably it could "bundle" it with the STREAM_LAST frame

Why would it matter to the server that the client read or not the STREAM_LAST frame? I expect a well behaved Quic implementation sends the MAX_STREAMS frame once the both the remote stream reads and writes are closed and regardless of the local stream state.

See also the stream states here. "Reads closed" = the "DataRead" state, "Writes closed" = the "DataRcvd" state.

The remote stream should consider that reads are closed only once all the buffered has been consumed by the application (DataRead state). It shouldn't consider reads closed when it received the STREAM_LAST frame (DataRcvd state).

I suspect it's the issue with .NET Quic (or msquic). It sends the MAX_STREAMS frame too soon, before the data buffered on the remote stream is consumed by the application.

We'll see what Microsoft has to say, it's also quite possible I'm missing something.

If we want to avoid the extra frame, the StreamLastReadCompleted frame sounds simpler to me.

bernardnormier commented 1 year ago

and bundle it with the next Stream/StreamLast frame.

One potential advantage over the StreamLastReadCompleted (other than one fewer frame type) is we'd communicate StreamReadsCompleted earlier in the event the server keeps writing for a while.

Question: we currently have this exchange StreamLast -> StreamReadsCompleted <- StreamLast <-

The client doesn't need to send a StreamReadsCompleted to the server: StreamReadsCompleted ->

because it's the "accepter" (the server here) that flow-controls the streams?

And here, as soon as the accepter sends StreamLast, it can dispose the stream?

bentoi commented 1 year ago

Let's use "local stream" for the stream created by the client and "remote stream" for the stream accepted by the server.

Yes, the local stream never sends StreamReadsCompleted, see https://github.com/icerpc/icerpc-csharp/blob/8262c7083a59411234a9824fde3cc5f47d724130/src/IceRpc/Transports/Slic/Internal/SlicStream.cs#L149

because it's the "accepter" (the server here) that flow-controls the streams?

Yes, it's the server that decides when the client can create a new stream.

The remote stream reads and writes are closed when the remote stream sent respectively StreamReadsCompleted and StreamLast.

When the local stream receives these two frames, the connection bidirectional stream semaphore is released to unblock a CreateStreamAsync call.

bentoi commented 1 year ago

Here are the results with the addition of a StreamLastAndReadsCompleted frame. The results vary from different runs but overall the results follow this trend. I don't understand why the task count 16 and 64 benchmarks are slower with the change.

Method	Job	Protocol	Mean	Error	StdDev	Median	Op/s	Ratio
InvokeLatency	0.1.0	icerpc	141.51 us	1.932 us	1.807 us	141.71 us	7,066.7	baseline
InvokeLatency	0.1.1-slicfix	icerpc	98.13 us	0.397 us	0.352 us	98.11 us	10,190.9	1.44x faster

InvokeRPSWithTaskCount1	0.1.0	icerpc	129.57 us	2.094 us	2.722 us	128.80 us	7,717.8	baseline
InvokeRPSWithTaskCount1	0.1.1-slicfix	icerpc	99.51 us	1.952 us	2.324 us	99.20 us	10,049.3	1.30x faster

InvokeRPSWithTaskCount4	0.1.0	icerpc	62.18 us	0.902 us	0.844 us	62.49 us	16,082.0	baseline
InvokeRPSWithTaskCount4	0.1.1-slicfix	icerpc	56.74 us	0.763 us	0.713 us	56.82 us	17,622.8	1.10x faster

InvokeRPSWithTaskCount16	0.1.0	icerpc	49.81 us	0.955 us	2.516 us	48.93 us	20,076.2	baseline
InvokeRPSWithTaskCount16	0.1.1-slicfix	icerpc	61.14 us	1.198 us	1.935 us	61.64 us	16,355.6	1.20x slower

InvokeRPSWithTaskCount64	0.1.0	icerpc	50.80 us	1.347 us	3.971 us	49.73 us	19,684.9	baseline
InvokeRPSWithTaskCount64	0.1.1-slicfix	icerpc	57.00 us	1.390 us	4.098 us	56.69 us	17,542.7	1.13x slower

bentoi commented 1 year ago

gRPC vs IceRPC w/ Slic fix

Method	Benchmark	Mean	Error	StdDev	Op/s
PingLatency	IceRPC(icerpc,tcp)	101.63 us	1.978 us	1.850 us	9,839.2
PingLatency	gRPC(http/2,tcp)	147.01 us	1.438 us	1.274 us	6,802.1
PingRPSWithTaskCount1	IceRPC(icerpc,tcp)	95.83 us	1.214 us	1.136 us	10,435.6
PingRPSWithTaskCount4	IceRPC(icerpc,tcp)	55.35 us	0.955 us	0.893 us	18,066.7
PingRPSWithTaskCount16	IceRPC(icerpc,tcp)	55.79 us	1.053 us	1.253 us	17,923.0
PingRPSWithTaskCount64	IceRPC(icerpc,tcp)	52.35 us	1.009 us	0.894 us	19,101.0
PingRPSWithTaskCount1	gRPC(http/2,tcp)	136.44 us	0.846 us	0.791 us	7,329.2
PingRPSWithTaskCount4	gRPC(http/2,tcp)	53.42 us	0.957 us	1.278 us	18,720.4
PingRPSWithTaskCount16	gRPC(http/2,tcp)	20.39 us	0.402 us	0.376 us	49,046.8
PingRPSWithTaskCount64	gRPC(http/2,tcp)	13.52 us	0.267 us	0.262 us	73,944.5

icerpc / icerpc-csharp

Add `StreamLastAndReadCompleted` frame to avoid sending an extra `StreamReadCompleted` frame #3218