Open tmtmtmtm opened 8 years ago
Ah, just noticed that @henare added exactly the same issue (#924) whilst I was writing this! :)
Thanks for taking the time to submit this @tmtmtmtm.
One thought I've had since last night was that this hasn't been a problem so far. So I think that means we should be very careful about implementing anything "smart" ;) I'm very keen to see if #923 recurs in the next day or so and if we can simply fix it by increasing the queue size.
That depends on what you mean by 'a problem' :)
There haven't been a huge number of one hour plus waits other than in the last few days, but 5-10 minute waits aren't uncommon (both historically, and today). Those have never been bad enough to justify raising an issue about it, but it's just long enough to break the flow (especially when the scraper itself is only running against a single page, and so only takes about a minute to complete once it starts).
...but 5-10 minute waits aren't uncommon...
Ahh, I had no idea - very useful information, thanks!
This evening we're back to 1 3 hour+ waits for runs again…
Thanks for the notification Tony.
There are almost 200 runs waiting right now. The actual queue is filled with 3 jobs - almost all are duplicates of just 2 jobs.
I'm out of band-aid solutions. Until we can fix #926 this will just keep happening.
I've manually Pressed The Buttons. The queue is clearing now.
Looks like this is happening again today…
Actually, looking at https://morph.io/scrapers/running takes me to https://morph.io/detentiondb/guardian-alerts, which is complaining Error: SQLITE_FULL: database or disk is full
@tmtmtmtm yep, we're out of disk space again and have hit a full disk a few times in the last few weeks. We need to fix #407.
We've got a huge backlog since this has been affecting the queue for the last 10 hours or so. It will take several hours to clear.
(Slightly separate to #923, but triggered by it)
AIUI
My workflow (as someone adding at least one, and often two or three scrapers per day) is that the first time I run a scraper, I'm much more interested in getting the results in as close to real-time as possible. Subsequent runs will flow into a process that's at least semi-automated at my end, but the first time through I'll need to check the output much more closely, potentially correct problems in the scraper etc etc. Other people's workflow will likely differ, but I suspect than in many cases people will prefer to see the results from a manual run as quickly as possible.
Coming up with the optimal queuing strategy probably veers into extreme complexity rather quickly, so there's certainly a lot to be said for not even trying. But I would suggest that a very simple tweak of prioritising manual runs over scheduled ones would make it much more unlikely that people will have to wait around for those to finish, at negligible effect to scheduled scrapers, which can run at any time anyway.