Closed yolile closed 5 months ago
Did you do crawls for the individual spiders? I would be interested to know if the invalid JSON are a small or large proportion. (You can do a sample of 100 or something.)
From #964
With https://github.com/open-contracting/kingfisher-collect/commit/f8d47ca45b3af4b020d1c1f8a235cb4c98259c55 this problem didn't happen again. However, as the publisher is working on a better approach to generate the files, I will keep this issue open so we can remove the download limitations from the spider when the new solution is implemented.
I assume you are okay with closing the issue and leaving those limits for now.
Did you do crawls for the individual spiders?
I tested it with two Nigerian states, one dropped 1 file, and the other 2. I could try with all the affected spiders if you want, or do it in the registry directly
This approach fails when the data is compressed, as the middleware runs before the data is decompressed, e.g. with croatia
.
I tested it with two Nigerian states, one dropped 1 file, and the other 2. I could try with all the affected spiders if you want, or do it in the registry directly
Sure, please check the registry logs when the time comes.
This approach fails when the data is compressed, as the middleware runs before the data is decompressed, e.g. with croatia.
Aha, for Croatia we had ReadDataMiddleware
read the data (if it survived that long without being read). Fixed in d6e8a7fa (similar to AddPackageMiddleware)
We can add ValidateJSONMiddleware to the test_bytes_or_file
test to cover this.
Aha, for Croatia we had ReadDataMiddleware read the data
And also replace item.data
with data.read()
, otherwise, we get empty JSON files.
I tested locally the spiders that are small and faster enough with these results:
nigeria_anambra_state: 23/169 (13,6% missing) (one ocid per file) nigeria_osun_state: 2/164 (1% missing) (one ocid per file) nigeria_enugu_state: 1/7 (14% missing) (one year per file, 2020 invalid, but 2023 and 2024 without releases) croatia: 9/65 (13,8% missing) (one month per file) costa_rica_poder_judicial_releases: 1306/11991 (10% missing) (one ocid per file) pakistan_ppra_api is down panama_dgcp_records: too slow/big
@jpmckinney I've included the test, do you think we are ok to merge this now?
closes #1058 closes #1036 closes #964 closes #963 closes #957 closes #886 closes #876 closes #645