Open hroberts opened 6 years ago
the specific problem I'm seeing is that if a pool worker tries to add something to a queue while rabbit is down, that add_to_queue and all future add_to_queues for that running process fail with an error message like this:
2018-08-28 09:17:28,196 MediaCloud.JobManager.Broker.RabbitMQ: Unable to declare queue 'MediaWords::Job::Facebook::FetchStoryStats': AMQP socket not connected at /home/mediacloud/.perlbrew/libs/perl-system@mediacloud/lib/perl5/MediaCloud/JobManager
/Broker/RabbitMQ.pm line 290.
We are using the default retry strategy for add_to_queue, which only retries for a total of 0.4 seconds. We need to increase it a lot, to at least five minutes, so that restarting rabbit does not kill all of the jobs running on other servers. This has not been a problem in the past because all of our queueing jobs have been running on the same server as rabbitmq.
This looks like a pretty simple fix to me, but I'd rather you do it, @pypt since you are more familiar with this code.
I have added some retries around the add_to_queue calls in TM::Mine.pm for the time being just so that topics don't crash and send error reports to users when we restart rabbit.