clarification on stateless reset

bmwill commented 1 year ago

I was hoping to get some clarification on how the stateless reset functionality works in quinn/QUIC. My understanding of the stateless reset feature is that if a connection exists between two endpoints A and B and one side of the connection state is lost for some reason (say endpoint A crashed and was restarted) then the side that lost state would be able to inform the other side and the old connection would be closed quickly on the side that didn't lose state. In my limited testing i've found this to not be the case though, and instead (via logs) i'm able to see that the side the lost sate (A) is sending a stateless reset but the other side (B) doesn't seem to handle the reset and close the connection on its side, instead the connection hangs until the configured idle timeout period at which point the connection is timed out.

Is my understanding of how the stateless reset is supposed to work not correct? Is there some configuration that I need to be doing differently in order to have this work properly? (eg do you need to use a non-default EndpointConfig)

Ralith commented 1 year ago

Your understanding is correct. However, stateless resets are cryptographically secured via EndpointConfig::reset_key, which ensures that an attacker can't kill your connection by injecting packets. Endpoints drop stateless resets that use an incorrect key, so if you want to send useful resets after a restart, you need to persist your reset key somehow.

bmwill commented 1 year ago

Got it, so the fact that a new, random reset key is being created each time an endpoint is created (because of using Endpoint::default) is leading to this behavior but it can be fixed by configuring an explicit reset key. I'll investigate doing that and report back if I still run into any snags. Thanks for the clarification!

bmwill commented 1 year ago

Ok I'm back! Even with a consistent reset key i seem to be running into some issues. Let me explain the setup:

Endpoint A and B both are using a fixed reset key of [0; 64]
A establishes a connection with B (so A acts as the client and B as the server)

If A restarts it'll end up seeing packets from B for the old connection and send a stateless reset. B properly processes the stateless reset and immediately closes the old connection.

2022-11-15T22:17:39.299060Z DEBUG quinn_proto::endpoint: sending stateless reset for c5adab5a34cb7941 to [::1]:8080
2022-11-15T22:17:39.299225Z DEBUG quinn_proto::connection: got stateless reset

If B restarts it'll end up seeing packets from A for the old connection and send a stateless reset. A then seems to reject the stateless reset and doesn't close the old connection till the idle timeout period has elapsed.

2022-11-15T22:20:41.833882Z DEBUG quinn_proto::endpoint: sending stateless reset for 7fcaf33e19482a04 to [::1]:8081
2022-11-15T22:20:41.834624Z TRACE quinn_proto::connection: unable to complete packet decoding: invalid header: packet too short to extract header protection sample
2022-11-15T22:20:41.834752Z TRACE quinn_proto::connection: unable to complete packet decoding: invalid header: packet too short to extract header protection sample

Is there something else I'm missing or is there some issue with a restarted "server" side of a connection sending a stateless reset to the client side of a connection?

Ralith commented 1 year ago

Stateless resets should work in both directions. If anything, server-sent ones should be more robust. This is tested here: https://github.com/quinn-rs/quinn/blob/1d390af2facdf424c7feb470909fffc29a1dca6c/quinn-proto/src/tests/mod.rs#L166-L193

Do you have a packet capture and/or reproducible test case? It's interesting that you're seeing too-short packet errors, as stateless resets shouldn't trigger that, so we may have a bug there.

bmwill commented 1 year ago

Yes i do have a reproducible case which you can find here: https://github.com/bmwill/anemo/tree/stateless-reset

Specifically you can run the following:

# Start the server side
$ RUST_LOG=quinn=debug cargo run --bin stateless-reset -- server

# In another terminal start the client side
$ RUST_LOG=quinn=debug cargo run --bin stateless-reset

You can then Ctrl-C either side of the connection, and then restart it, in order to see the different behavior.

The reproducible case is written in terms of a p2p rpc framework which uses quinn internally. I can try to spend some time distilling this into a lower level repro case if needed (although that's going to take me a bit more time).

Ralith commented 1 year ago

I'm unable to reproduce using that example. Stateless resets are recognized in both directions.

bmwill commented 1 year ago

Hmm I'm able to reproduce the issue locally very consistently (and seeing the too-short packet errors), although i have seen it succeed once or twice. For what its worth I'm doing this on an M1 mac.

Is there any other information that I could capture locally that would help with the debugging?

Ralith commented 1 year ago

A packet capture could be revealing.

djc commented 1 year ago

FWIW, I can also reproduce this on my M1 Mac -- for me it seemed to work okay when I restarted the client, but when I restarted the server it got stuck resending stateless resets.

bmwill commented 1 year ago

@djc Glad you're able to reproduce this issue as well. I was worried for a second this was going to be a case of "this only doesn't work on my machine" :D

@Ralith I'll work on getting a packet capture

Ralith commented 1 year ago

It's definitely plausible for there to be quirks on macOS, as we have slightly different low level I/O code paths there. Still, hard to imagine how this behavior would result. Looking forward to seeing details.

bmwill commented 1 year ago

Ok here is a packet capture for the scenario where the server is restarted, tries to send a stateless reset, but the client side is unable to process the reset. https://gist.github.com/bmwill/bfa7a2b64cd5e05c69f40837434a227b

https://gist.github.com/bmwill/bfa7a2b64cd5e05c69f40837434a227b#file-gistfile1-txt-L174-L178 this is the last packet sent from the server-side before it was killed.

https://gist.github.com/bmwill/bfa7a2b64cd5e05c69f40837434a227b#file-gistfile1-txt-L270-L279 is the first time that the server starts sending packets again once its been restarted, and this should be a capture of the reset packets being sent to the client. https://gist.github.com/bmwill/bfa7a2b64cd5e05c69f40837434a227b#file-gistfile1-txt-L290-L299 is the second set of reset packets sent from the server to the client.

Finally after the idle-timeout period the client kills the old connection and then begins re-establishing a new connection starting at https://gist.github.com/bmwill/bfa7a2b64cd5e05c69f40837434a227b#file-gistfile1-txt-L300.

Ralith commented 1 year ago

Any chance you can provide that in pcap format, with accompanying Quinn trace logs?

bmwill commented 1 year ago

Yes of course, sorry. I've uploaded a new set of files here: https://github.com/bmwill/anemo/commit/7ea31e2b5235cc1dbe46db761cfe5d2a69a1bdc6.

stateless-reset.pcap is the packet capture
client contains the client logs with RUST_LOG=quinn=trace
server contains the client logs with RUST_LOG=quinn=trace

There's also versions of client/server logs that have color codes if that's useful.

Ralith commented 1 year ago

Thanks! That should be plenty for now.

Ralith commented 1 year ago

This was due to PING-only packets generating very small stateless resets, combined with us not checking for stateless resets before bailing out if a packet is too small to be decoded. Fix for both issues up in https://github.com/quinn-rs/quinn/pull/1446.

bmwill commented 1 year ago

Thanks for the quick fix, I've tested it locally and stateless resets from the server now work perfectly!

quinn-rs / quinn

clarification on stateless reset #1443