open-contracting / kingfisher-collect

Downloads OCDS data and stores it on disk
https://kingfisher-collect.readthedocs.io
BSD 3-Clause "New" or "Revised" License
13 stars 12 forks source link

1058 filter invalid json files #1066

Closed yolile closed 5 months ago

yolile commented 5 months ago

closes #1058 closes #1036 closes #964 closes #963 closes #957 closes #886 closes #876 closes #645

jpmckinney commented 5 months ago

Did you do crawls for the individual spiders? I would be interested to know if the invalid JSON are a small or large proportion. (You can do a sample of 100 or something.)

jpmckinney commented 5 months ago

From #964

With https://github.com/open-contracting/kingfisher-collect/commit/f8d47ca45b3af4b020d1c1f8a235cb4c98259c55 this problem didn't happen again. However, as the publisher is working on a better approach to generate the files, I will keep this issue open so we can remove the download limitations from the spider when the new solution is implemented.

I assume you are okay with closing the issue and leaving those limits for now.

yolile commented 5 months ago

Did you do crawls for the individual spiders?

I tested it with two Nigerian states, one dropped 1 file, and the other 2. I could try with all the affected spiders if you want, or do it in the registry directly

yolile commented 5 months ago

This approach fails when the data is compressed, as the middleware runs before the data is decompressed, e.g. with croatia.

jpmckinney commented 5 months ago

I tested it with two Nigerian states, one dropped 1 file, and the other 2. I could try with all the affected spiders if you want, or do it in the registry directly

Sure, please check the registry logs when the time comes.

This approach fails when the data is compressed, as the middleware runs before the data is decompressed, e.g. with croatia.

Aha, for Croatia we had ReadDataMiddleware read the data (if it survived that long without being read). Fixed in d6e8a7fa (similar to AddPackageMiddleware)

jpmckinney commented 5 months ago

We can add ValidateJSONMiddleware to the test_bytes_or_file test to cover this.

yolile commented 5 months ago

Aha, for Croatia we had ReadDataMiddleware read the data

And also replace item.data with data.read(), otherwise, we get empty JSON files.

I tested locally the spiders that are small and faster enough with these results:

nigeria_anambra_state: 23/169 (13,6% missing) (one ocid per file) nigeria_osun_state: 2/164 (1% missing) (one ocid per file) nigeria_enugu_state: 1/7 (14% missing) (one year per file, 2020 invalid, but 2023 and 2024 without releases) croatia: 9/65 (13,8% missing) (one month per file) costa_rica_poder_judicial_releases: 1306/11991 (10% missing) (one ocid per file) pakistan_ppra_api is down panama_dgcp_records: too slow/big

yolile commented 5 months ago

@jpmckinney I've included the test, do you think we are ok to merge this now?