open-contracting / kingfisher-collect

Downloads OCDS data and stores it on disk
https://kingfisher-collect.readthedocs.io
BSD 3-Clause "New" or "Revised" License
13 stars 12 forks source link

Add duplicate-checking pipeline #1055

Closed jpmckinney closed 5 months ago

jpmckinney commented 7 months ago

Inspired by #1054

Similar to the Sample pipeline, we can maybe force the spider to stop once a threshold is reached of, let's say, 5 duplicates of the same item. The Kingfisher extension should check the close_spider reason and leave the collection open if the reason is 'duplicate'. That way, the data registry will not complete the job and auto-publish a bad crawl.

Sample code: https://docs.scrapy.org/en/latest/topics/item-pipeline.html#duplicates-filter

jpmckinney commented 5 months ago

We actually already have duplicate checking (on filename) in the Validate pipeline.

Like in #1058, it's maybe hard to set a threshold since:

  1. We have a wide range of number of files downloaded (e.g. 1 to millions). So, we can't make the threshold a fixed number.
  2. We don't always know the total number of files that will be downloaded. So, it would be hard to set a percentage threshold. We could do a sort of rolling percentage, but that can still lead to cases where the first share of requests error but the majority at the end succeed, etc.

Since we haven't encountered this issue often, and since it's just an optimization over reading the log file of the full collection #531, I will close.

Also, in Collect, we try not to parse the response content where possible. So, we aren't currently considering a duplicate checker at the data level (package, release or record).