Closed philbudne closed 4 months ago
Two different problems seen. The above description may be a conflation of them:
On 6/15 tqfetcher (using base class MultiThreadStoryWorker
) lost rabbitmq connection, Pika thread exited, sending kisses of death to worker threads. Main thread exited with status zero, so the container was not (and could not be) restarted.
On 6/9 the csv-queuer (using base class Queuer
) lost rabbitmq connection, and continued to run (outputting error messages, and marking the work file successfully queued). This could be addressed at several different levels (long inheritance path): QApp, Producer, StoryProducer, Queuer...
While testing a fix for the above two problems (by doing docker service scale ...._rabbitmq=0
and back to 1), I found a third problem:
In an ordinary Worker
(parser and importer), the main thread blocks on _message_queue
and doesn't see that the Pika thread has exited.
indexer/worker.py has two places with "# XXX fatal error?" comments. I think the first was the cause of the hung tqfetcher process (Pika thread exited, but all the other threads continued running, cut off from Pika communication) last week.
I don't think there's much value to a traceback (ie; throwing an Exception), because the error likely happened elsewhere, so I'm leaning towards adding
sys.exit(1)
, which I've checked propagates an exception up thru allwith
context handlers, and exits with less noise...I've self-assigned this, as I'd like to do some testing, in particular (but not exclusively) to see what happens when a stack is taken down with
docker stack rm stackname
to make sure nothing ugly happens...