pelias / wof-admin-lookup

Who's on First Admin Lookup for the Pelias Geocoder
https://pelias.io
MIT License
9 stars 24 forks source link

Handle PIP child_process `error` and `close` events #172

Closed orangejulius closed 7 years ago

orangejulius commented 7 years ago

Background

The multi-process model our admin lookup code uses has been susceptible to issues with the child processes shutting down unexpectedly for a long time.

This can happen for many reasons, some of which are:

In my testing today, the PIP service will shut down after a worker has died, but only after a request has come in that actually ends up hitting that layer. For higher, less common layers like ocean, this may not happen for a long time.

The importers, when using wof-admin-lookup locally, are in even worse shape, and will often hang indefinitely.

In either case, a worker shutting down before all other workers have loaded will cause an infinite hang.

Changes

This PR adds handlers for both the close and error events emitted by child processes. It determines if the event was emitted intentionally or not, for example because the importer or PIP service is shutting down and has told the workers to shut down, or unintentionally, in which case it's an error.

In the error case, the behavior is now to send the SIGTERM signal to all other workers and throw an exception so that the entire importer or PIP service quits immediately. It writes a clear error to stderr showing which worker shut down, which should help diagnose any issues.

By shutting down immediately, time spent in an invalid state is minimized.

For Mapzen Search production, as soon as the PIP service has fully shut down the instance will be taken out of the load balancer until the service has restarted, so again it's advantageous to shut down as quickly as possible.

Testing

I've tested this locally with both importers and the PIP service by manually sending SIGTERM to a worker. Worker shutdown both during loading and after they have all loaded is now handled well. Of course more testing would be appreciated.

Fixes #138

orangejulius commented 7 years ago

@trescube there must be a way. It's probably best tested as a more functional test, since trying to mock out the behavior of processes/signals/socket connections would probably be not only hard to do, but hard to model accurately after real world behavior.

I wonder if we could create a test something like this: A script runs the PIP service but via some sort of flag makes one worker quit after X seconds with an error code. The script then checks that the PIP service shut down within X+~1 seconds.

That's only the failure during loading case, but it would basically be an automated version of the testing I had been doing. Any thoughts on a better approach?