quinn-rs / quinn

Async-friendly QUIC implementation in Rust
Apache License 2.0
3.85k stars 394 forks source link

fix: don't fail when `finish` is called on a stopped stream #1699

Closed thomaseizinger closed 6 months ago

thomaseizinger commented 1 year ago

At the moment, calling finish on a stream that has been stopped by the remote fails. This can lead to weird race conditions in application protocols. I added a test that showcases this. Depending on where you place the send_stream.finish in the client, the test either succeeds or fails. In other words, if the recv_stream at the opposite side is still alive, calling finish will work but if it is not, finish will fail.

In a real-world setting, we cannot influence how fast the remote is in processing your data and / or when they drop (i.e. stop) the stream. I think quinn should not fail finish if the other party has already stopped the stream. Instead, we should assume that they've read all the data they expected and thus no longer need the stream.

Unfortunately, the only workaround for this issue is to not call finish at all. Whilst that may fine for some usecases, I'd argue that it is not very clean and finish should in fact be idempotent.

To fix this issue, we check whether we all data we sent has been acknowledged and in that case, exit finish gracefully, pretending that the remote acked our FIN.

thomaseizinger commented 1 year ago

I added to commits that make the test pass. Let me know if this fix is okay :)

djc commented 1 year ago

Seems reasonable to me!

Ralith commented 1 year ago

Thanks for the PR! I'd like to discuss the motivation in some more depth. My knee-jerk reaction is discomfort with the presumptions this makes about the application's interests, and I wonder if your requirements could be addressed with a different pattern at the application layer instead.

if the recv_stream at the opposite side is still alive, calling finish will work but if it is not, finish will fail.

The intention of the current API is that, if you don't want to stop incoming streams, then you always read them to the end, even if application-layer reasoning tells you to expect no further data.

Whilst [not calling finish] may fine for some usecases, I'd argue that it is not very clean

What cases do you feel finish must be called in? My intuition is that you should call it iff you want to know about the stream being stopped, and it seems like here you don't, so why call it at all?

finish should in fact be idempotent.

This seems like a separate concern from whether streams stopped during finish should raise errors.

we check whether we all data we sent has been acknowledged and in that case, exit finish gracefully

Returning an error if and only if we haven't received some ACKs for the stream is extremely fragile. Packet loss or reordering might cause the STOP_SENDING frame to be received before any outstanding ACKs, even if the peer has indeed received everything. This creates a very similar problem to the original motivation, where finish may unpredictably fail, except it cannot be prevented by changing application-layer behavior.

thomaseizinger commented 1 year ago

Thanks for the PR! I'd like to discuss the motivation in some more depth.

Thank you for the comment. I'll try to expand a bit.

Whilst [not calling finish] may fine for some usecases, I'd argue that it is not very clean

What cases do you feel finish must be called in? My intuition is that you should call it iff you want to know about the stream being stopped, and it seems like here you don't, so why call it at all?

In rust-libp2p, we abstract over many different kinds of streams and QUIC is just one of the possible transports. They all share the AsyncRead + AsyncWrite API and we forward AsyncWrite::close to finish.

The issue we had in https://github.com/libp2p/rust-libp2p/pull/4747 was resolved by not calling close (and thereby not calling finish). This works for now but it is a massive footgun for protocol developers on top of rust-libp2p.

It just doesn't seem correct to fail closing the stream even though everything worked correctly and we just had a timing issue with the remote on who sent what when.

we check whether we all data we sent has been acknowledged and in that case, exit finish gracefully

Returning an error if and only if we haven't received some ACKs for the stream is extremely fragile. Packet loss or reordering might cause the STOP_SENDING frame to be received before any outstanding ACKs, even if the peer has indeed received everything. This creates a very similar problem to the original motivation, where finish may unpredictably fail, except it cannot be prevented by changing application-layer behavior.

I am not sure this is true.

  1. Client sends some data
  2. Server receives data & responds
  3. Client receives response
  4. Client finishes stream

(4) is what I think should not fail even if the remote has already stopped the stream. There cannot be any missing ACKs because the server would not have responded if it didn't receive the entire request from (1). If we did in fact still send some data that the server hasn't acked yet, then closing should fail! In this case however, the two peers don't seem to agree on the protocol because the server apparently did not expect further data to be sent and dropped the stream without continuing to read.

Ralith commented 1 year ago

It just doesn't seem correct to fail closing the stream even though everything worked correctly and we just had a timing issue with the remote on who sent what when.

Could you instead ignore all errors from finish, perhaps unconditionally in your AsyncWrite::close binding? This would also protect you from a peer that closes the entire connection rather than just the stream.

Note that even a successful finish is not assurance that the remote application layer has seen all data you sent, so ignoring finish errors categorically doesn't put you at increased risk of data loss. Just like in TCP, the only way to know for sure that a peer's application layer has processed your data is with an application-layer ACK.

I am not sure this is true.

Successful receipt of a server's application-layer response does not guarantee successful receipt of all ACKs covering application data that prompted that response. This information may have been carried in separate packets.

For clarity, note that your example illustrates a particular bidirectional request/response pattern. Application protocols routinely employ other patterns, which may have fewer round-trips. For example, if you open, write to, and immediately finish a unidirectional stream, and the peer stops that stream, it is impossible to reliably predict whether or not an error will be raised under this PR's changes, because packets carrying ACK frames can easily be reordered or lost when the STOP_SENDING frame is received.

thomaseizinger commented 1 year ago

Thank you for the clarification!

It seems that the safe thing to do is to instantly close the stream once we are done with writing before we start reading data from the remote. That should make this race condition go away.

Still, it leaves me with an uneasy feeling that it matters when I close my stream. My argument would be that it shouldn't matter. Perhaps finish should not fail at all and always return ()? Or maybe it should return something other than Result to avoid that users run into the trap of using ? and aborting their entire task even though nothing actually failed?

Perhaps the last part is interesting to debate: Why is it an "error" to finish a stream when the remote has closed it already?

Ralith commented 1 year ago

It seems that the safe thing to do is to instantly close the stream once we are done with writing before we start reading data from the remote. That should make this race condition go away.

That's a more conventional pattern, though if the peer stops the stream immediately after reading data, and the connection was very low latency, you could still theoretically race.

Perhaps finish should not fail at all and always return ()?

This would be less of a footgun, and it's tempting. Maybe even make it synchronous, and thereby prevent similarly error-prone application sensitivity to transport-layer ACKs. This places it in similar territory to a successful write -- you don't know how far the data's gotten, just that it's out of your hands now.

Our intention was to support application protocols in which STOP_SENDING has greater significance. Because an application can ensure that streams are never stopped in normal operation by always reading streams to the end, a stopped stream could have special meaning, whether to indicate a logic error in the peer, or other application layer significance. An asynchronous, fallible finish allows reliable detection of stops that would otherwise be lost.

Or maybe it should return something other than Result

This is a good thought. On review of the API, I think any form of async, data-bearing SendStream::finish future might actually be redundant to a synchronous, infallible finish followed by waiting on SendStream::stopped if you really want to.

djc commented 1 year ago

This is a good thought. On review of the API, I think any form of async, data-bearing SendStream::finish future might actually be redundant to a synchronous, infallible finish followed by waiting on SendStream::stopped if you really want to.

Sounds good!

thomaseizinger commented 1 year ago

Or maybe it should return something other than Result

This is a good thought. On review of the API, I think any form of async, data-bearing SendStream::finish future might actually be redundant to a synchronous, infallible finish followed by waiting on SendStream::stopped if you really want to.

I really like this!

So if I understand you correctly: We essentially design finish to be the same as reset: Synchronous and just changing the state of the underlying stream.

Looking at the API though, I'd have a different question: Why do we expose the UnknownStream error from SendStream::reset and other APIs? Surely that should not be hit when I am calling the function on a SendStream?

Ralith commented 1 year ago

We essentially design finish to be the same as reset: Synchronous and just changing the state of the underlying stream.

Yep!

Surely that should not be hit when I am calling the function on a SendStream?

Per the doc comments, UnknownStream can arise if the stream becomes fully closed, such as after calling finish yourself. Perhaps we should rename it for clarity.

thomaseizinger commented 1 year ago

Surely that should not be hit when I am calling the function on a SendStream?

Per the doc comments, UnknownStream can arise if the stream becomes fully closed, such as after calling finish yourself. Perhaps we should rename it for clarity.

But didn't we just conclude that such an error should only be returned from awaiting stopped? I guess the question is, should it be a synchronous error to reset a finished stream and vice versa?

What if we say instead:

This means if I really care that a stream finished gracefully, I first call finish, then await stopped and check that this didn't fail.

I don't know the internals that well though. Will calling reset and then stopped give me a stop-code that allows me to check that it was actually reset?

Ralith commented 1 year ago

But didn't we just conclude that such an error should only be returned from awaiting stopped?

To be clear, it's usually a "you're calling methods in a nonsensical order" error. stopped after finish or perhaps reset is the only time when it should arise in a non-buggy program. reset after finish is a corner case.

should it be a synchronous error to reset a finished stream and vice versa?

Maybe it's useful for finish to return an error synchronously if used on a stream that you previously reset, since that indicates a logic error by the caller. Calling reset twice is also a logic error. I like the simplicity of making these cases a no-op, but maybe we should try to make caller bugs obvious here? I could go either way.

Conversely, reset after finish is perfectly sensible (read it as "give up on any retransmits"), and we should probably not return UnknownStream errors there ever, since otherwise the behavior is racey.

Calling reset or finish on a stopped stream does nothing.

It feels a bit weird that finish() would succeed when write(&[]) would fail with Stopped, but I think I'm satisfied that this is the most constructive take regardless. We can think of finish as morally just setting a bit on the most recent preceding write.

await stopped and check that this didn't fail.

For this to work as written, we should probably replace Err(UnknownStream) with an Ok(None) case or similar. I think that's reasonable.

Will calling reset and then stopped give me a stop-code that allows me to check that it was actually reset?

Stream resets are advisory, and a peer that has received a RESET_STREAM frame from you probably won't bother to send STOP_SENDING, since you've already declared intent not to send. A stream is "actually reset" whenever you say it is, and what the peer makes of that is up to them.

thomaseizinger commented 1 year ago

But didn't we just conclude that such an error should only be returned from awaiting stopped?

To be clear, it's usually a "you're calling methods in a nonsensical order" error.

Perhaps a debug_assert (and a warning?) is appropriate then? If I understand correctly, this can't be triggered by a remote so should be safe? It would be a no-op in production anyway though.

We can think of finish as morally just setting a bit on the most recent preceding write.

So what happens when I call finish but there are no more pending writes? Will that create an empty frame and send it? Is that a thing in QUIC? Sorry, I am not too familiar with the details :grimacing:

Ralith commented 1 year ago

So what happens when I call finish but there are no more pending writes? Will that create an empty frame and send it?

Yep, exactly!

thomaseizinger commented 1 year ago

Okay, I might have a stab at this next week :)

Ralith commented 6 months ago

Closing in favor of https://github.com/quinn-rs/quinn/pull/1840, where I've implemented something like what we discussed here, since the next release is very close. Feedback welcome!