open-contracting / kingfisher-collect

Downloads OCDS data and stores it on disk
https://kingfisher-collect.readthedocs.io
BSD 3-Clause "New" or "Revised" License
13 stars 12 forks source link

Allow people to resume scrapers that have experienced intermittent errors #79

Closed odscjames closed 4 years ago

odscjames commented 6 years ago

Old title: Armenia download died after downloading 2316 files with a generic ["Connection error"] error.

Found during #100 work

The URL - https://armeps.am/ocds/release?limit=100&offset=1519400579553 - opens fine in a browser - maybe just a temp glitch?

odscjames commented 6 years ago

I manually edited the SQLite database and cleared the error and fetch_finished_datetime. I then ran it again, and it picked up fine and is now on file 2329. So it was just an intermittent error.

So this issue could become - how can we stop intermittent errors killing a download entirely?

We already have several retries, but maybe we need more, or more sleep between retries? Or maybe after an operator has examined the problem, a tool so they can easily say "I think that is intermittent, just clear all fetch errors and try again"?

robredpath commented 6 years ago

For relatively short-running scrapers, it's probably fine to just start over. For longer-running scrapers, I can see this being a really useful thing to have!

tian2992 commented 6 years ago

@yolile also commented about this;

I think a (different) approach would require the digestor to extract "item" per item, which then would be processed (inserted, validated, etc). Seems like a significant refactor but would enable even more parallelism as well as induce more complexity. Would that merit its own issue branch of course, if it were to be considered...

robredpath commented 6 years ago

I think the priority here is making sure that we're able to re-run scrapers and have them just 'fill in the gaps' if there was some kind of issue with the system that they're talking to. From the analyst's perspective, getting the data loaded and being confident that it's a true reflection of what the API/other system is serving are the important things. Not having to babysit too much is also important.

If we can make them faster (eg through more parallelism) for a relatively low cost at the same time, I'm happy for that to be in scope, but not if it's a large job.

odscjames commented 6 years ago

For this, I'm seeing a new sub command.

It would take a run that is currently in a fetch stage with some errors, and clear those errors from the meta DB.

The operator would run the new command to clear errors.

They would then run the normal run command again to try again.

ps. parallelism is a different issue

jpmckinney commented 6 years ago

Why a new subcommand and not an option on the run subcommand?

For the use case of, "the run is broken – please just let me reset and start over", I can see the sense of a new subcommand (though, it can also just be an option on the run subcommand). This is separate from other use cases like, "I limited the initial run to 1,000 pages of API results, and I now want to resume from page 1001" or "I had to stop the run for some reason (closed my laptop, lost internet connection, etc.), and I now want to resume where it left off." In those cases, it seems better to add options to the run subcommand, since resuming a download is not fundamentally different from running a download (from scratch) in a user's conceptual framework.

The title of this issue is about 'resuming', but I think it's more accurately about 'starting over', in which case a separate 'reset' (or similar) subcommand might be a solution. We should probably rename this issue and split out 'resuming' into a separate issue (I haven't checked if one already exists).

jpmckinney commented 5 years ago

Is this possible in Scrapy?

jpmckinney commented 5 years ago

@odscjames @yolile With Scrapy, is it possible to resume scrapers that have experienced intermittent errors, or is this still an issue that needs to be resolved?

odscjames commented 5 years ago

The use case here is runs that have errors, and we try again in a day or something similar to see if those errors have cleared.

We would need to look into how that would work with Scrapy - this isn't something that works currently.

Some of our spiders do manually do things around this - https://github.com/open-contracting/kingfisher-scrape/blob/master/kingfisher_scrapy/spiders/colombia.py#L32 - but we should look into this.

jpmckinney commented 4 years ago

Assuming the source has a publication pattern that allows resuming:

If the source takes a long time to collect or contains a lot of data, another use case for resuming is to get any new releases since the last collection. This is relevant to a (non-helpdesk) data analyst who is working with the same source(s) over long periods of time and/or at frequent intervals (e.g. daily), and who doesn't want or can't store multiple copies of the same data.

This makes a critical assumption: old releases aren't changed or deleted (this is required by the standard, but a source can be nonconformant).

If a second assumption holds – that only new releases (i.e. those with a more recent date) are added over time – then instead of re-compiling all releases, the previously compiled releases can simply be updated with the new releases (this can be done by putting the compiled release as the first release in the list before merging).

jpmckinney commented 4 years ago

Some relevant Scrapy docs: https://docs.scrapy.org/en/latest/topics/jobs.html

Can also look into how the following maintain state:

jpmckinney commented 4 years ago

Split into two issues above.