quinn-rs / quinn

Async-friendly QUIC implementation in Rust
Apache License 2.0
3.76k stars 380 forks source link

clarification on stateless reset #1443

Closed bmwill closed 1 year ago

bmwill commented 1 year ago

I was hoping to get some clarification on how the stateless reset functionality works in quinn/QUIC. My understanding of the stateless reset feature is that if a connection exists between two endpoints A and B and one side of the connection state is lost for some reason (say endpoint A crashed and was restarted) then the side that lost state would be able to inform the other side and the old connection would be closed quickly on the side that didn't lose state. In my limited testing i've found this to not be the case though, and instead (via logs) i'm able to see that the side the lost sate (A) is sending a stateless reset but the other side (B) doesn't seem to handle the reset and close the connection on its side, instead the connection hangs until the configured idle timeout period at which point the connection is timed out.

Is my understanding of how the stateless reset is supposed to work not correct? Is there some configuration that I need to be doing differently in order to have this work properly? (eg do you need to use a non-default EndpointConfig)

Ralith commented 1 year ago

Your understanding is correct. However, stateless resets are cryptographically secured via EndpointConfig::reset_key, which ensures that an attacker can't kill your connection by injecting packets. Endpoints drop stateless resets that use an incorrect key, so if you want to send useful resets after a restart, you need to persist your reset key somehow.

bmwill commented 1 year ago

Got it, so the fact that a new, random reset key is being created each time an endpoint is created (because of using Endpoint::default) is leading to this behavior but it can be fixed by configuring an explicit reset key. I'll investigate doing that and report back if I still run into any snags. Thanks for the clarification!

bmwill commented 1 year ago

Ok I'm back! Even with a consistent reset key i seem to be running into some issues. Let me explain the setup:

Is there something else I'm missing or is there some issue with a restarted "server" side of a connection sending a stateless reset to the client side of a connection?

Ralith commented 1 year ago

Stateless resets should work in both directions. If anything, server-sent ones should be more robust. This is tested here: https://github.com/quinn-rs/quinn/blob/1d390af2facdf424c7feb470909fffc29a1dca6c/quinn-proto/src/tests/mod.rs#L166-L193

Do you have a packet capture and/or reproducible test case? It's interesting that you're seeing too-short packet errors, as stateless resets shouldn't trigger that, so we may have a bug there.

bmwill commented 1 year ago

Yes i do have a reproducible case which you can find here: https://github.com/bmwill/anemo/tree/stateless-reset

Specifically you can run the following:

# Start the server side
$ RUST_LOG=quinn=debug cargo run --bin stateless-reset -- server

# In another terminal start the client side
$ RUST_LOG=quinn=debug cargo run --bin stateless-reset

You can then Ctrl-C either side of the connection, and then restart it, in order to see the different behavior.

The reproducible case is written in terms of a p2p rpc framework which uses quinn internally. I can try to spend some time distilling this into a lower level repro case if needed (although that's going to take me a bit more time).

Ralith commented 1 year ago

I'm unable to reproduce using that example. Stateless resets are recognized in both directions.

bmwill commented 1 year ago

Hmm I'm able to reproduce the issue locally very consistently (and seeing the too-short packet errors), although i have seen it succeed once or twice. For what its worth I'm doing this on an M1 mac.

Is there any other information that I could capture locally that would help with the debugging?

Ralith commented 1 year ago

A packet capture could be revealing.

djc commented 1 year ago

FWIW, I can also reproduce this on my M1 Mac -- for me it seemed to work okay when I restarted the client, but when I restarted the server it got stuck resending stateless resets.

bmwill commented 1 year ago

@djc Glad you're able to reproduce this issue as well. I was worried for a second this was going to be a case of "this only doesn't work on my machine" :D

@Ralith I'll work on getting a packet capture

Ralith commented 1 year ago

It's definitely plausible for there to be quirks on macOS, as we have slightly different low level I/O code paths there. Still, hard to imagine how this behavior would result. Looking forward to seeing details.

bmwill commented 1 year ago

Ok here is a packet capture for the scenario where the server is restarted, tries to send a stateless reset, but the client side is unable to process the reset. https://gist.github.com/bmwill/bfa7a2b64cd5e05c69f40837434a227b

https://gist.github.com/bmwill/bfa7a2b64cd5e05c69f40837434a227b#file-gistfile1-txt-L174-L178 this is the last packet sent from the server-side before it was killed.

https://gist.github.com/bmwill/bfa7a2b64cd5e05c69f40837434a227b#file-gistfile1-txt-L270-L279 is the first time that the server starts sending packets again once its been restarted, and this should be a capture of the reset packets being sent to the client. https://gist.github.com/bmwill/bfa7a2b64cd5e05c69f40837434a227b#file-gistfile1-txt-L290-L299 is the second set of reset packets sent from the server to the client.

Finally after the idle-timeout period the client kills the old connection and then begins re-establishing a new connection starting at https://gist.github.com/bmwill/bfa7a2b64cd5e05c69f40837434a227b#file-gistfile1-txt-L300.

Ralith commented 1 year ago

Any chance you can provide that in pcap format, with accompanying Quinn trace logs?

bmwill commented 1 year ago

Yes of course, sorry. I've uploaded a new set of files here: https://github.com/bmwill/anemo/commit/7ea31e2b5235cc1dbe46db761cfe5d2a69a1bdc6.

There's also versions of client/server logs that have color codes if that's useful.

Ralith commented 1 year ago

Thanks! That should be plenty for now.

Ralith commented 1 year ago

This was due to PING-only packets generating very small stateless resets, combined with us not checking for stateless resets before bailing out if a packet is too small to be decoded. Fix for both issues up in https://github.com/quinn-rs/quinn/pull/1446.

bmwill commented 1 year ago

Thanks for the quick fix, I've tested it locally and stateless resets from the server now work perfectly!