Handle SIGTERM gracefully

tilezen / tileserver

A lightweight tileserver to share code paths with tilequeue for tile generation.

MIT License

79 stars 27 forks source link

Handle SIGTERM gracefully #48

Open zerebubuth opened 7 years ago

zerebubuth commented 7 years ago

Upon receiving SIGTERM, tileserver should:

Start responding to health check requests with an error.
Wait a configurable grace period, or until all outstanding requests have finished.
Shut down.

This allows tileserver to work with ELB connection draining or HAProxy to terminate while not dropping any requests. If this is done along with staggering shutdowns / upgrades so that only part of the cluster is down at any one time, then no requests are lost.

rmarianski commented 7 years ago

This is a good idea. On the one hand we can avoid this by rolling in new instances, but on the other, it's much easier to just run a deploy command in opsworks.

It might be worth considering pushing the scope of this problem outside into an opsworks tools that rolls in the deploy for us. That way it's solved for any service in an opsworks layer. Or, maybe there's a way to not require to roll in deploys, but still handle this mostly outside the actual process. I wonder if we can unregister the instance from the elb, wait until it's unregistered, and then re-register it once it's restarted. I'm assuming the wait step here handles the connection draining for us, and that opsworks wouldn't fight us and try to re-register the instance because it's still in the layer in the interim.

zerebubuth commented 7 years ago

I think both mechanisms would be good to have.

Rolling the deploy requires outside tooling, which is great for anything which is compatible with that. But I wouldn't be confident that it covers 100% of all cases that the service could be stopped. Handling SIGTERM internally is then a safety net in those (hopefully rare) cases that tileserver is stopped outside of a rolling deploy.

rmarianski commented 7 years ago

From @iandees, http://docs.aws.amazon.com/opsworks/latest/userguide/best-deploy.html#best-deploy-rolling

rmarianski commented 7 years ago

But I wouldn't be confident that it covers 100% of all cases that the service could be stopped.

Just curious, what kind of cases would this be?