Orderly Shutdown - Githubissues

victorstewart commented 3 years ago

in the ideal case, the client and server would cooperate in the shutdown procedure to make sure all pending data was read, acted upon, sent and ACK-ed, before terminating either process. Given that server process upgrades are frequent in this continuous integration era, this procedure occurs frequently, thus it's important to do everything that can be done to not lose data unnecessarily.

on the server side this would look something like... 1) stop accepting new connections and new streams, 2) fin all streams while pushing out all buffered packets, 3) wait on ACKs for all packets for each connection, 4) once all packets on a connection have been ACK-ed, close the connection.

on the client side... if the server shuts down its side of the stream (assuming all bi-directional streams here), the client should interpret that as the initiation of shutdown and fin its write side of the stream after pushing the last data onto it... same as is done when recv returns 0 after one side of a TCP socket shuts down its write pipe with shutdown(fd, SHUT_WR).

now with the tentative_max_number_connections parameter, it's technically possible to stop accepting new connections by setting it to 0, but not streams. and an implementer could store all connections and stream ids in application level data structures and mostly trigger this process, but again there's no way to know when connections have been reliably drained (aka all packets ACK-ed).

so i think this warrants inclusion in the library.

the technical implementation is trivial, the only open ended question to me is what to do with uncooperative or adversarial clients who continue writing on their streams and never fin them. I guess the only solution is some kind of master timer... maybe set to some multiple of RTT. the implementor could also provide a maximum timeout.

I envision an interface like below:

int picoquic_drain_then_shutdown(picoquic_quic_t* quic, uint64_t timeAllowanceMs)

victorstewart commented 3 years ago

it might be easier for the server to be time agnostic as well, and the implementor simply manages the timeout and terminates the process if it expires before we flip an "orderly shutdown complete" flag

huitema commented 3 years ago

Some things are simple and clear, and some are not. The simple and clear part is "refusing new connections". The server can just reply "I am busy" and the client will try again after a short while. The "FIN all streams" part is less clear. In HTTP like protocols, a stream is something like a web page or a download. The server cannot just say "I am done" and "FIN the stream" -- that would mean load half a page, or half the bits in a JPEG, and the user impact would not be good. The server also has no clean way to "refuse new streams": it has previously given credits for a number of new streams to the client, and it cannot renege on that without closing the connection. The only application independent way to do an orderly closing would be:

1) Refuse new connection attempt, respond busy. 2) Stop giving credits for new streams. 3) Finish serving the existing streams. 4) Then close.

But that's clearly not optimal. It gets better if the application can cooperate. In that case, the application server can tell the client that it wants to close, and the client can do something orderly like start a connection to another server in the pool and when that's done engage in an orderly shutdown.

In fact, even responding "I am busy" is not exactly what we want. The signal is ambiguous. It does not really mean "this particular instance in the server pool is going to be cycled, just try a new connection". What it does mean is "there is some kind of temporary overload, please wait a while and then retry".

There are two protocol elements that could be used to stop ongoing streams in a somewhat clean manner. "Stop sending" is used to tell the peer to stop sending data on one stream, such as "the browser has moved away from this page, no point continuing steaming this video". "Reset stream" indicates that the peer has to abruptly close a stream due to some kind of local problem, and will effectively "FIN the stream". Both elements carry a reason code to explain exactly why they are sent, but the reason code is application specific.

I guess we could establish an application control shutdown with something like:

int picoquic_drain_then_shutdown(picoquic_quic_t* quic, uint64_t timeAllowanceMs, uint64_t reasonCode)

The implementation would be:

1) Stop accepting new connections, respond busy if one arrives. In an ideal setup, this is coordinated with a load balancer stopping feeding load to the server that's being recycled, so the busy signal is probably OK.

2) Continue feeding open streams for the specified time allowance. (Detail: all picoquic API use microseconds, so probably better to use microseconds instead of milliseconds. Or maybe pass a target close time.) Maybe stop providing credits for opening more streams. Ideally, the application is telling the client about the shutdown, so the client should stop opening new streams on its own.

3) After the delay, start cancelling exist streams using "RESET STREAM" with the reason code specified at the API. Maybe also send "STOP SENDING" with the specified reason in response to any stream data sent by the client.

4) When all streams are closed or reset, send "Application Close" with the specified reason code. In fact, maybe do that without going through step (3), because Application Close has pretty much the same effect as a series of stream resets, so why bother.

Make me think that maybe we should write a draft in which we explain the scenario and look at solutions. Do we need an application independent message to tell the application that the server is shutting down, signalling the application specific reason code and the delay? Do we need to extend the load-balancer draft?

victorstewart commented 3 years ago

this procedure requires buy-in from the application layer, if for no reason other than when the opposing peer receives notification (at whatever layer) that we are transitioning to drain and shutdown, that peer might have partial or pending writes on a stream... thus can't be a transport or library only construct.

so the usage of picoquic_drain_then_shutdown requires preexistence in the application protocol. the only question is whether the peer is adversarial or not, hence a timeout.

true the server responding "i am busy" is not optimal. firstly given that there is no connection established thus no application layer, without a transport level server state of "i am draining and shutting down" there is no way to correctly inform the peer of what's going on. but this is largely a routing/networking failure of the implementer that should be handled by removing itself from the load balancing pool, or stop listening on its anycast address (and have previously transitioned its connections to its Server Preferred Address)... the only unavoidable case is if the server were targeted directly, because even if there are multiple listening sockets the OS will still load balance the traffic to the same socket, then there's no avoiding it. but that edge case seems irrelevant and tolerable to me.

we could send a MAX_STREAMS frame set to the current number of streams, de facto stopping the peer from opening more. but there's a race condition here, and it's not strictly necessary. if the peer happens to open one... we just send it the same "enter drain and shutdown phase" "frame" telling it to hurry up.

STOP_SENDING doesn’t have the right semantics, because we still want to allow further stream communication to occur.

An endpoint uses a STOP_SENDING frame (type=0x05) to communicate that incoming data is being discarded on receipt at application request.

3 Orderly Shutdown States: State 1 = Peer A tells Peer B it wants to drain the connection and promises not to initiate any more communication, but will fully respond to new prompts State 2 = Peer B tells Peer A it has written everything it will write to the stream State 3 = Peer A tells Peer B it has finished responding

how about something like this....

1) Peer A begins orderly shutdown by notifying State 1 on every open stream (and stops accepting new connections if server)

2) Peers B receive notification of State 1 and finish writing to their streams then signal State 2

3) for each stream, Peer A receives and replies to all final data sent by Peer B and once it gets to the State 2 frame it signals State 3 on the stream

4) as Peer A and Peers B transition to State 3 on each stream, once all packets on a stream have been ACK-ed, Peer A resets the stream

5) once every stream of a connection has been reset, or if the connection had no open streams to begin with, Peer A closes the connection

6) once every connection has been closed, Peer A responds to a poll from the application informing such

and as far as i see it we have two choices for state notification here:

1) either we add an "orderly shutdown" transport parameter, extension frame with a state parameter, and another event "orderly shutdown state changed"

2) or allow this procedure to remain entirely at the application level (but maybe allow library support for sending State 1 on every open stream)

huitema commented 3 years ago

Sending MAX STREAMS with a version lower than current is a no op. Per spec, the peer just keeps the previous higher version. But I think this is the right direction:

1) We create a new frame, "REDUCE_MAX_STREAMS", that advertise a lower value and an application error code. 2) Server sends the frame with new proposed max stream value. 3) Client receives that some time later, by which time it may well have started a few new streams. Client should maybe send an explicit acknowledgment, "REDUCING_MAX_STREAMS" with a value maybe slightly higher that the server's proposal. 4) Once all the number of streams negotiated by the client have been served, the server can safely close the connection without risks of losing data.

Maybe even simpler, the new frames could just mean "STOP_OPENING_NEW_STREAMS" with just an error code and a reason, and the client replies "STOPPING_OPENING_NEW_STREAMS" with the highest stream numbers opened for that connection. And once that is done, the server can do a clean close.

That would work fine for DNS over QUIC, and generally for all procedures in which the client maps new transactions to new streams.

Or maybe the server does not

victorstewart commented 3 years ago

i do agree it's cleaner to tie off the opening of new streams, but i still believe it's also unnecessary chatter. negotiating the peer through the 3 states and informing of the drain + shutdown deadline time, it's possible to derive all derivative action.

i think it's enough to say, this is the deadline, you're aware of the RTT, so continue opening streams at your own peril... though we will service them up til the deadline we shared. since even in the case of owning the client and server software, that client software might be running on a compromised device where an adversary has modified and taken control of the binary. so the data loss guarantee can only be extended as far as a cooperating peer. so there's no option to service all streams to finality in perpetuity.

and i can't think of any situations where necessary, but maybe upon transitioning to State 1, some applications might want to open a new stream for some shutdown isolated communication?

that said if you believe it's cleaner to also negotiate that no streams will be opened, then let's do it.

where do you stand on what level the States should be negotiated at? purely application level? or in the library as extension frames with a new event enum? i think it's cleaner as an extension frame, and while i agree with Lucas's point about brittleness and long term desired adaptation, i do believe those states as i've described them are purely transport concept anchored and don't see how they could or would change without QUIC evolution necessitating it.... because all application level negotiation, if any, must still and always occur at the application level. so i can see a lot of sense in not exposing stream transport level negotiation to the application.

huitema commented 3 years ago

I would like a clean transport level solution that could be exposed to the application. That's why I don't think this is too much chatter. I would see something like:

1) Application tells server to start graceful close, either globally or per connection. 2) Server sends STOP_OPENING_NEW_STREAMS to client (for each connection) 3) Client receives frame from server, notifies application. 4) Client app says OK, does whatever is necessary. 5) Client sends server the confirmation message, explaining how many streams are still open. 6) Server finishes these streams and closes cleanly.

I am less concerned by brittleness, because in practice the QUIC implementation ships with the app, so the app will ship a version of QUIC that has the extension. But if the extension is not present, the alternative would be:

1) Server signals to app that it is closing. Or vice versa. 2) App tells peer that it is closing, maybe sending message on new stream. 3) Client app knows how many streams are pending. When it has received the expected data, it tells Server app that it is done. 4) Server app receives client app message, and then close the connection.

That works well if the application has some kind of control channel.

huitema commented 3 years ago

But yes, there is a tension between "orderly" and "shutdown". Orderly in practice means a triple handshake, equivalent to the server asking for a shutdown and the client waiting to be clean and then shutting down. And that implies a wait time that the server does not control. Your proposal is something like the server telling the client "shutdown in 5 seconds no matter what". That has the advantage of being simple and not chatty, does not even require any kind of actual closing message. But then it is not orderly...

victorstewart commented 3 years ago

well orderly, but with a deadline.. application gets to specify that deadline either none or maybe 60 seconds.. but 10 minutes if it so desires. it's necessary to prevent one single adversary from keeping the server in limbo forever. but of course in 99.999%+ of connections the orderly shutdown will complete in just a few RTTs. I don't imagine the deadline might ever be hit, but having it ensures the server can never be held hostage.

but i think that sequence sounds fantastic, and we should move forward on it.

huitema commented 3 years ago

Should we write a draft defining a "SHUTDOWN_WARNING(delay=D, app-error-code=A, reason=R)" extension? That would enable the simplest mechanism that we have described so far.

The only point that is still unclear for me is whether we should describe a re-connection policy, with a choice between:

Reconnect immediately (e.g., because the load balancer will direct the connection to a stable server),
or, reconnect after a delay (e.g., after the server is done rebooting)

victorstewart commented 3 years ago

yes let's write a draft. where do we begin?

i think the assumed behavior should be "start polling to reconnect immediately".

the biggest problems with telling the client when to start trying to reconnect are... 1) it's impossible for the server to know how long it'll take to drain all open connections and 2) impossible for it to know how long it'll take to restart itself once all connections have been closed. so at best the server would have to vastly over estimate, which is a bad situation for user experience on the client.

the only way to minimize client downtime is the begin polling for reconnect immediately.

and in the VAST majority of cases, anyone using this extension, will have other capacity immediately available for reconnect.

the only downside i see is a server process that's using the same socket to receive connection attempts as all connected traffic, thus is unable to avoid drowning in a flood of reconnect attempts which slows down the process of draining open connections. but that's the implementor's failure of networking and routing, and possibly resource capacity, so i'm not concerned.

huitema commented 3 years ago

Yes, predicting delays is hard. The problem gets much simpler if the server that is being cycled has been using SPA, and the active connections have been moved to a server specific port. The server can then stop listening on the shared port, and another server instance can start immediately.

victorstewart commented 3 years ago

exactly. that's mostly the socket design i choose personally. with one socket "listen" on my anycast address for new connection traffic, then transition to another socket on the server's unique public IP. then pull the plug on the anycast socket during orderly shutdown.

problem with changing port is middleware might throttle or blackhole non 443 traffic. but IP is always safe.

huitema commented 3 years ago

Started writing a draft in https://github.com/huitema/quic-shutdown

huitema commented 3 years ago

@victorstewart I invited you to edit that draft -- add your name in the authors' list, fill whatever section you can make progress on, etc.

victorstewart commented 2 years ago

even with the most elegant of shutdown dances, it is still possible at any time for connections to disconnect, thus clients and servers must be bulletproof against data loss and replays. thus the necessity for such protections de facto makes elegant shutdowns an unnecessary frivolity, and burns extra uptime and cycles when both parties can handle simply SIGKILL-ing the server process.

private-octopus / picoquic

Orderly Shutdown #1199