Unhandled error conditions in worker.py

philbudne commented 5 months ago

indexer/worker.py has two places with "# XXX fatal error?" comments. I think the first was the cause of the hung tqfetcher process (Pika thread exited, but all the other threads continued running, cut off from Pika communication) last week.

I don't think there's much value to a traceback (ie; throwing an Exception), because the error likely happened elsewhere, so I'm leaning towards adding sys.exit(1), which I've checked propagates an exception up thru all with context handlers, and exits with less noise...

I've self-assigned this, as I'd like to do some testing, in particular (but not exclusively) to see what happens when a stack is taken down with docker stack rm stackname to make sure nothing ugly happens...

philbudne commented 5 months ago

Two different problems seen. The above description may be a conflation of them:

On 6/15 tqfetcher (using base class MultiThreadStoryWorker) lost rabbitmq connection, Pika thread exited, sending kisses of death to worker threads. Main thread exited with status zero, so the container was not (and could not be) restarted.

On 6/9 the csv-queuer (using base class Queuer) lost rabbitmq connection, and continued to run (outputting error messages, and marking the work file successfully queued). This could be addressed at several different levels (long inheritance path): QApp, Producer, StoryProducer, Queuer...

philbudne commented 5 months ago

While testing a fix for the above two problems (by doing docker service scale ...._rabbitmq=0 and back to 1), I found a third problem:

In an ordinary Worker (parser and importer), the main thread blocks on _message_queue and doesn't see that the Pika thread has exited.

mediacloud / story-indexer

Unhandled error conditions in worker.py #302