Open mlandauer opened 8 years ago
Switched off auto-run for:
Something is going on with the queue now. We have a backlog of about 200 scrapers.
There doesn't seem to be many really long running scrapers but I did add one to the list (it's designed to run in a loop repeatedly).
One of the sidekiq processes is pegged at 100% CPU. The memory looks OK. There are only 3 runs on that process.
I thought the oldest one was the likely culprit. Manually stopping it didn't remove the stop the run in sidekiq. I restarted sidekiq and the CPU's back to normal.
Now there's 17 running containers but only 2 sidekiq jobs. Running rake app:emergency:show_queue_run_inconsistencies
doesn't show any inconsistencies. Huh?
Now there's 17 running containers but only 2 sidekiq jobs. Running rake app:emergency:show_queue_run_inconsistencies doesn't show any inconsistencies. Huh?
It's because they're on the retry queue. OK, I think I understand this Brave New World a little better.
Some of the longer running ones are retrying because of docker timeout errors. No idea why but thought I should note that here.
There are 13 scrapers that have been running for over 5 hours. A couple of them seem to routinely run for a day so I've disabled auto-run and added them to that list.
There are quite a few that normally don't take long to run. I don't understand what's gone wrong with them but I'm going to try and manually stop them.
All this has now led to a huge starvation problem where long running scrapers are hogging the queue and are not letting in the short running jobs.
We're seeing this quite a lot at the moment.
Some of the longer running ones are retrying because of docker timeout errors.
That happens and the jobs end up way back in the retry queue. Sometimes they get backed off to having hours before a next retry. I'm not sure why they don't because active when they hit the front of the queue.
This leaves only a few queue spots for the rest of the scrapers, so a bit retries backlog takes a long time to run down.
That happens and the jobs end up way back in the retry queue. Sometimes they get backed off to having hours before a next retry. I'm not sure why they don't because active when they hit the front of the queue.
I think this is the issue I've described in #1123.
Now that we've dealt with a number of problem that were causing issues for long running scrapers and scrapers with large amounts of output they're running again. Also, we've improved the control over the number of simultaneous scrapers that can run so that it actually stays close to the limit.
All this has now led to a huge starvation problem where long running scrapers are hogging the queue and are not letting in the short running jobs.
So, the solution here is now to look at a smarter scheduling algorithm rather than the "pick a job out of the hat" kind that we're effectively using now.
In the meantime I'm going to switch off auto run on a few of the really long running scrapers and document them here as well. Then we can either switch the auto run back on once we have a smarter scheduling algorithm or if that doesn't happen in time email the owners of the scrapers.