openaustralia / morph

Take the hassle out of web scraping
https://morph.io
GNU Affero General Public License v3.0
461 stars 74 forks source link

Starvation for short running scrapers #1076

Open mlandauer opened 8 years ago

mlandauer commented 8 years ago

Now that we've dealt with a number of problem that were causing issues for long running scrapers and scrapers with large amounts of output they're running again. Also, we've improved the control over the number of simultaneous scrapers that can run so that it actually stays close to the limit.

All this has now led to a huge starvation problem where long running scrapers are hogging the queue and are not letting in the short running jobs.

So, the solution here is now to look at a smarter scheduling algorithm rather than the "pick a job out of the hat" kind that we're effectively using now.

In the meantime I'm going to switch off auto run on a few of the really long running scrapers and document them here as well. Then we can either switch the auto run back on once we have a smarter scheduling algorithm or if that doesn't happen in time email the owners of the scrapers.

mlandauer commented 8 years ago

Switched off auto-run for:

henare commented 8 years ago

Something is going on with the queue now. We have a backlog of about 200 scrapers.

There doesn't seem to be many really long running scrapers but I did add one to the list (it's designed to run in a loop repeatedly).

One of the sidekiq processes is pegged at 100% CPU. The memory looks OK. There are only 3 runs on that process.

I thought the oldest one was the likely culprit. Manually stopping it didn't remove the stop the run in sidekiq. I restarted sidekiq and the CPU's back to normal.

Now there's 17 running containers but only 2 sidekiq jobs. Running rake app:emergency:show_queue_run_inconsistencies doesn't show any inconsistencies. Huh?

henare commented 8 years ago

Now there's 17 running containers but only 2 sidekiq jobs. Running rake app:emergency:show_queue_run_inconsistencies doesn't show any inconsistencies. Huh?

It's because they're on the retry queue. OK, I think I understand this Brave New World a little better.

Some of the longer running ones are retrying because of docker timeout errors. No idea why but thought I should note that here.

henare commented 8 years ago

There are 13 scrapers that have been running for over 5 hours. A couple of them seem to routinely run for a day so I've disabled auto-run and added them to that list.

There are quite a few that normally don't take long to run. I don't understand what's gone wrong with them but I'm going to try and manually stop them.

equivalentideas commented 7 years ago

All this has now led to a huge starvation problem where long running scrapers are hogging the queue and are not letting in the short running jobs.

We're seeing this quite a lot at the moment.

Some of the longer running ones are retrying because of docker timeout errors.

That happens and the jobs end up way back in the retry queue. Sometimes they get backed off to having hours before a next retry. I'm not sure why they don't because active when they hit the front of the queue.

This leaves only a few queue spots for the rest of the scrapers, so a bit retries backlog takes a long time to run down.

henare commented 7 years ago

That happens and the jobs end up way back in the retry queue. Sometimes they get backed off to having hours before a next retry. I'm not sure why they don't because active when they hit the front of the queue.

I think this is the issue I've described in #1123.