Prioritise manual runs - Githubissues

tmtmtmtm commented 8 years ago

(Slightly separate to #923, but triggered by it)

AIUI

there are a limited number of slots for scrapers that can be running simultaneously
Some scrapers can take a very long time to run, thus reducing the number of available slots
scrapers can be triggered manually, or on a scheduled run
these are not distinguished in terms of prioritisation
scheduled scrapers are no more fine grained than "every day", with no promises as to when it might actually run.

My workflow (as someone adding at least one, and often two or three scrapers per day) is that the first time I run a scraper, I'm much more interested in getting the results in as close to real-time as possible. Subsequent runs will flow into a process that's at least semi-automated at my end, but the first time through I'll need to check the output much more closely, potentially correct problems in the scraper etc etc. Other people's workflow will likely differ, but I suspect than in many cases people will prefer to see the results from a manual run as quickly as possible.

Coming up with the optimal queuing strategy probably veers into extreme complexity rather quickly, so there's certainly a lot to be said for not even trying. But I would suggest that a very simple tweak of prioritising manual runs over scheduled ones would make it much more unlikely that people will have to wait around for those to finish, at negligible effect to scheduled scrapers, which can run at any time anyway.

tmtmtmtm commented 8 years ago

Ah, just noticed that @henare added exactly the same issue (#924) whilst I was writing this! :)

henare commented 8 years ago

Thanks for taking the time to submit this @tmtmtmtm.

One thought I've had since last night was that this hasn't been a problem so far. So I think that means we should be very careful about implementing anything "smart" ;) I'm very keen to see if #923 recurs in the next day or so and if we can simply fix it by increasing the queue size.

tmtmtmtm commented 8 years ago

That depends on what you mean by 'a problem' :)

There haven't been a huge number of one hour plus waits other than in the last few days, but 5-10 minute waits aren't uncommon (both historically, and today). Those have never been bad enough to justify raising an issue about it, but it's just long enough to break the flow (especially when the scraper itself is only running against a single page, and so only takes about a minute to complete once it starts).

henare commented 8 years ago

...but 5-10 minute waits aren't uncommon...

Ahh, I had no idea - very useful information, thanks!

tmtmtmtm commented 8 years ago

This evening we're back to 1 3 hour+ waits for runs again…

henare commented 8 years ago

Thanks for the notification Tony.

There are almost 200 runs waiting right now. The actual queue is filled with 3 jobs - almost all are duplicates of just 2 jobs.

I'm out of band-aid solutions. Until we can fix #926 this will just keep happening.

henare commented 8 years ago

I've manually Pressed The Buttons. The queue is clearing now.

tmtmtmtm commented 8 years ago

Looks like this is happening again today…

tmtmtmtm commented 8 years ago

Actually, looking at https://morph.io/scrapers/running takes me to https://morph.io/detentiondb/guardian-alerts, which is complaining Error: SQLITE_FULL: database or disk is full

henare commented 8 years ago

@tmtmtmtm yep, we're out of disk space again and have hit a full disk a few times in the last few weeks. We need to fix #407.

We've got a huge backlog since this has been affecting the queue for the last 10 hours or so. It will take several hours to clear.

openaustralia / morph

Prioritise manual runs #925