Closed thomaseizinger closed 6 months ago
I added to commits that make the test pass. Let me know if this fix is okay :)
Seems reasonable to me!
Thanks for the PR! I'd like to discuss the motivation in some more depth. My knee-jerk reaction is discomfort with the presumptions this makes about the application's interests, and I wonder if your requirements could be addressed with a different pattern at the application layer instead.
if the recv_stream at the opposite side is still alive, calling finish will work but if it is not, finish will fail.
The intention of the current API is that, if you don't want to stop incoming streams, then you always read them to the end, even if application-layer reasoning tells you to expect no further data.
Whilst [not calling
finish
] may fine for some usecases, I'd argue that it is not very clean
What cases do you feel finish
must be called in? My intuition is that you should call it iff you want to know about the stream being stopped, and it seems like here you don't, so why call it at all?
finish
should in fact be idempotent.
This seems like a separate concern from whether streams stopped during finish
should raise errors.
we check whether we all data we sent has been acknowledged and in that case, exit finish gracefully
Returning an error if and only if we haven't received some ACKs for the stream is extremely fragile. Packet loss or reordering might cause the STOP_SENDING
frame to be received before any outstanding ACK
s, even if the peer has indeed received everything. This creates a very similar problem to the original motivation, where finish
may unpredictably fail, except it cannot be prevented by changing application-layer behavior.
Thanks for the PR! I'd like to discuss the motivation in some more depth.
Thank you for the comment. I'll try to expand a bit.
Whilst [not calling
finish
] may fine for some usecases, I'd argue that it is not very cleanWhat cases do you feel
finish
must be called in? My intuition is that you should call it iff you want to know about the stream being stopped, and it seems like here you don't, so why call it at all?
In rust-libp2p
, we abstract over many different kinds of streams and QUIC is just one of the possible transports. They all share the AsyncRead
+ AsyncWrite
API and we forward AsyncWrite::close
to finish
.
The issue we had in https://github.com/libp2p/rust-libp2p/pull/4747 was resolved by not calling close
(and thereby not calling finish
). This works for now but it is a massive footgun for protocol developers on top of rust-libp2p
.
It just doesn't seem correct to fail closing the stream even though everything worked correctly and we just had a timing issue with the remote on who sent what when.
we check whether we all data we sent has been acknowledged and in that case, exit finish gracefully
Returning an error if and only if we haven't received some ACKs for the stream is extremely fragile. Packet loss or reordering might cause the
STOP_SENDING
frame to be received before any outstandingACK
s, even if the peer has indeed received everything. This creates a very similar problem to the original motivation, wherefinish
may unpredictably fail, except it cannot be prevented by changing application-layer behavior.
I am not sure this is true.
(4) is what I think should not fail even if the remote has already stopped the stream. There cannot be any missing ACKs because the server would not have responded if it didn't receive the entire request from (1). If we did in fact still send some data that the server hasn't acked yet, then closing should fail! In this case however, the two peers don't seem to agree on the protocol because the server apparently did not expect further data to be sent and dropped the stream without continuing to read.
It just doesn't seem correct to fail closing the stream even though everything worked correctly and we just had a timing issue with the remote on who sent what when.
Could you instead ignore all errors from finish
, perhaps unconditionally in your AsyncWrite::close
binding? This would also protect you from a peer that closes the entire connection rather than just the stream.
Note that even a successful finish
is not assurance that the remote application layer has seen all data you sent, so ignoring finish errors categorically doesn't put you at increased risk of data loss. Just like in TCP, the only way to know for sure that a peer's application layer has processed your data is with an application-layer ACK.
I am not sure this is true.
Successful receipt of a server's application-layer response does not guarantee successful receipt of all ACKs covering application data that prompted that response. This information may have been carried in separate packets.
For clarity, note that your example illustrates a particular bidirectional request/response pattern. Application protocols routinely employ other patterns, which may have fewer round-trips. For example, if you open, write to, and immediately finish a unidirectional stream, and the peer stops that stream, it is impossible to reliably predict whether or not an error will be raised under this PR's changes, because packets carrying ACK frames can easily be reordered or lost when the STOP_SENDING frame is received.
Thank you for the clarification!
It seems that the safe thing to do is to instantly close the stream once we are done with writing before we start reading data from the remote. That should make this race condition go away.
Still, it leaves me with an uneasy feeling that it matters when I close my stream. My argument would be that it shouldn't matter. Perhaps finish
should not fail at all and always return ()
? Or maybe it should return something other than Result
to avoid that users run into the trap of using ?
and aborting their entire task even though nothing actually failed?
Perhaps the last part is interesting to debate: Why is it an "error" to finish
a stream when the remote has closed it already?
It seems that the safe thing to do is to instantly close the stream once we are done with writing before we start reading data from the remote. That should make this race condition go away.
That's a more conventional pattern, though if the peer stops the stream immediately after reading data, and the connection was very low latency, you could still theoretically race.
Perhaps finish should not fail at all and always return ()?
This would be less of a footgun, and it's tempting. Maybe even make it synchronous, and thereby prevent similarly error-prone application sensitivity to transport-layer ACKs. This places it in similar territory to a successful write
-- you don't know how far the data's gotten, just that it's out of your hands now.
Our intention was to support application protocols in which STOP_SENDING
has greater significance. Because an application can ensure that streams are never stopped in normal operation by always reading streams to the end, a stopped stream could have special meaning, whether to indicate a logic error in the peer, or other application layer significance. An asynchronous, fallible finish
allows reliable detection of stops that would otherwise be lost.
Or maybe it should return something other than Result
This is a good thought. On review of the API, I think any form of async, data-bearing SendStream::finish
future might actually be redundant to a synchronous, infallible finish
followed by waiting on SendStream::stopped
if you really want to.
This is a good thought. On review of the API, I think any form of async, data-bearing
SendStream::finish
future might actually be redundant to a synchronous, infalliblefinish
followed by waiting onSendStream::stopped
if you really want to.
Sounds good!
Or maybe it should return something other than Result
This is a good thought. On review of the API, I think any form of async, data-bearing
SendStream::finish
future might actually be redundant to a synchronous, infalliblefinish
followed by waiting onSendStream::stopped
if you really want to.
I really like this!
So if I understand you correctly: We essentially design finish
to be the same as reset
: Synchronous and just changing the state of the underlying stream.
Looking at the API though, I'd have a different question: Why do we expose the UnknownStream
error from SendStream::reset
and other APIs? Surely that should not be hit when I am calling the function on a SendStream
?
We essentially design finish to be the same as reset: Synchronous and just changing the state of the underlying stream.
Yep!
Surely that should not be hit when I am calling the function on a SendStream?
Per the doc comments, UnknownStream
can arise if the stream becomes fully closed, such as after calling finish
yourself. Perhaps we should rename it for clarity.
Surely that should not be hit when I am calling the function on a SendStream?
Per the doc comments,
UnknownStream
can arise if the stream becomes fully closed, such as after callingfinish
yourself. Perhaps we should rename it for clarity.
But didn't we just conclude that such an error should only be returned from awaiting stopped
? I guess the question is, should it be a synchronous error to reset
a finished stream and vice versa?
What if we say instead:
reset
and finish
just changes local state and may queue a frame to be sent.reset
or finish
on a stopped stream does nothing.stopped
.This means if I really care that a stream finished gracefully, I first call finish
, then await stopped
and check that this didn't fail.
I don't know the internals that well though. Will calling reset
and then stopped
give me a stop-code that allows me to check that it was actually reset?
But didn't we just conclude that such an error should only be returned from awaiting stopped?
To be clear, it's usually a "you're calling methods in a nonsensical order" error. stopped
after finish
or perhaps reset
is the only time when it should arise in a non-buggy program. reset
after finish
is a corner case.
should it be a synchronous error to reset a finished stream and vice versa?
Maybe it's useful for finish
to return an error synchronously if used on a stream that you previously reset
, since that indicates a logic error by the caller. Calling reset twice is also a logic error. I like the simplicity of making these cases a no-op, but maybe we should try to make caller bugs obvious here? I could go either way.
Conversely, reset
after finish
is perfectly sensible (read it as "give up on any retransmits"), and we should probably not return UnknownStream
errors there ever, since otherwise the behavior is racey.
Calling reset or finish on a stopped stream does nothing.
It feels a bit weird that finish()
would succeed when write(&[])
would fail with Stopped
, but I think I'm satisfied that this is the most constructive take regardless. We can think of finish
as morally just setting a bit on the most recent preceding write
.
await stopped and check that this didn't fail.
For this to work as written, we should probably replace Err(UnknownStream)
with an Ok(None)
case or similar. I think that's reasonable.
Will calling reset and then stopped give me a stop-code that allows me to check that it was actually reset?
Stream resets are advisory, and a peer that has received a RESET_STREAM frame from you probably won't bother to send STOP_SENDING, since you've already declared intent not to send. A stream is "actually reset" whenever you say it is, and what the peer makes of that is up to them.
But didn't we just conclude that such an error should only be returned from awaiting stopped?
To be clear, it's usually a "you're calling methods in a nonsensical order" error.
Perhaps a debug_assert
(and a warning?) is appropriate then? If I understand correctly, this can't be triggered by a remote so should be safe? It would be a no-op in production anyway though.
We can think of
finish
as morally just setting a bit on the most recent precedingwrite
.
So what happens when I call finish
but there are no more pending writes? Will that create an empty frame and send it? Is that a thing in QUIC? Sorry, I am not too familiar with the details :grimacing:
So what happens when I call finish but there are no more pending writes? Will that create an empty frame and send it?
Yep, exactly!
Okay, I might have a stab at this next week :)
Closing in favor of https://github.com/quinn-rs/quinn/pull/1840, where I've implemented something like what we discussed here, since the next release is very close. Feedback welcome!
At the moment, calling
finish
on a stream that has been stopped by the remote fails. This can lead to weird race conditions in application protocols. I added a test that showcases this. Depending on where you place thesend_stream.finish
in the client, the test either succeeds or fails. In other words, if therecv_stream
at the opposite side is still alive, callingfinish
will work but if it is not,finish
will fail.In a real-world setting, we cannot influence how fast the remote is in processing your data and / or when they drop (i.e. stop) the stream. I think
quinn
should not failfinish
if the other party has already stopped the stream. Instead, we should assume that they've read all the data they expected and thus no longer need the stream.Unfortunately, the only workaround for this issue is to not call
finish
at all. Whilst that may fine for some usecases, I'd argue that it is not very clean andfinish
should in fact be idempotent.To fix this issue, we check whether we all data we sent has been acknowledged and in that case, exit
finish
gracefully, pretending that the remote acked ourFIN
.