Graceful shutdown - Githubissues

toland commented 5 years ago

We have played fast and loose with application shutdown until now, but it is becoming clear that we need to think about how to achieve a graceful shutdown. Here are some things to consider:

1) Drain all HTTP connections (i.e., allow in-flight requests to finish while disallowing new requests) 2) Drain all websocket requests (may be tricky) 3) Shut down Dawdle pollers and unregister all handlers. Don't shutdown until any messages that are in-flight have been handled. 4) Remove the node from the Swarm cluster and transfer all running processes to another node.

Essentially, we want to ensure that the node is idle before we allow the BEAM to shutdown.

toland commented 5 years ago

Here is an article on connection draining in Phoenix. This would work for GQL/HTTP and ReST requests, but probably doesn't help with websockets.

https://moosecode.nl/blog/implementing_connection_draining_phoenix

bernardd commented 5 years ago

The Dawdle issue is going to be tricky. Or, rather, the pre-dawdle db_watcher step. There's no real way to guarantee that there are no more notifications to be triggered other than doing something like redirecting them through lambda and having it enqueue the SQS messages...which I guess is something we could consider but it feels a bit sledgehammer-wallnutty.

toland commented 5 years ago

There's no real way to guarantee that there are no more notifications to be triggered...

Fair enough. I was mainly thinking about about the local queue and handlers.

It is a bit hard to reason about the db watcher since I'm not sure how Swarm behaves in the different shutdown scenarios. It would be worthwhile to do some testing to see what happens when the BEAM goes down in an ordered shutdown vs a crash under load.

The Lambda solution, or just running a separate BEAM VM with the db watcher, might feel like overkill, but it also might make sense in terms of isolating the db watcher, which doesn't change very often, from the application code, which does.

bernardd commented 5 years ago

Shower thought: the suggested way (possibly the only way) to invoke Lambda directly from Postgres/Aurora is with a built-in Python function. If we're doing that, why not just post to SQS directly from that same function and bypass Lambda entirely? I think this is something I should have a play with so we have a bit more actual data to work with, but on the face of it it seems like a good idea.

bernardd commented 5 years ago

So after spending a while getting everything set up to test, it turns out that built-in python functions are not a thing in RDS. So much for that idea. I did read elsewhere a plan that uses the DB logs to achieve something similar - I'll have to look into that, but that potentially also comes with privacy issues to consider.

bernardd commented 5 years ago

So the log thing isn't a thing like I thought it was which leaves us back where we started. I think what we have now is, for the moment, the best solution in terms of getting stuff from RDS into SQS. I am, however, reliably informed that direct lambda calls from Aurora/Postgres are currently being worked on, so we'll probably want to revisit this in the future.

bernardd commented 4 years ago

The shutdown process should include cleanly removing the node from the Swarm cluster (or Horde if we manage to switch to that).

hippware / wocky

Graceful shutdown #2796