Add duplicate-checking pipeline

open-contracting / kingfisher-collect

Downloads OCDS data and stores it on disk

BSD 3-Clause "New" or "Revised" License

13 stars 12 forks source link

We actually already have duplicate checking (on filename) in the Validate pipeline.

Like in #1058, it's maybe hard to set a threshold since:

We have a wide range of number of files downloaded (e.g. 1 to millions). So, we can't make the threshold a fixed number.
We don't always know the total number of files that will be downloaded. So, it would be hard to set a percentage threshold. We could do a sort of rolling percentage, but that can still lead to cases where the first share of requests error but the majority at the end succeed, etc.

Since we haven't encountered this issue often, and since it's just an optimization over reading the log file of the full collection #531, I will close.

Also, in Collect, we try not to parse the response content where possible. So, we aren't currently considering a duplicate checker at the data level (package, release or record).

open-contracting / kingfisher-collect

Add duplicate-checking pipeline #1055