open-contracting / kingfisher-collect

Downloads OCDS data and stores it on disk
https://kingfisher-collect.readthedocs.io
BSD 3-Clause "New" or "Revised" License
13 stars 12 forks source link

Filter out invalid and incomplete JSON #1058

Closed jpmckinney closed 5 months ago

jpmckinney commented 7 months ago

If we can filter these out, then we can include more publications in the registry where this issue occurs in a small subset of all available files.

Typically, filtering is done in the item pipeline. However, spider middlewares run prior to the item pipeline, and we parse the JSON in these middlewares. (In some cases, we parse the JSON in the spider, but only when we have to in order to create URLs.)

Maybe we mark the item with a first middleware, and the other middlewares are skipped if that mark is present. The pipeline could then drop these marked items, and log the total.

Related to #1055, we maybe want to set a threshold such that the spider closes with a different reason if the threshold for invalid JSON files is exceeded.

yolile commented 5 months ago

I wondering if we want this middleware to be run after ConcatenatedJSONMiddleware and LineDelimitedMiddleware, in case only one item of a file is invalid

jpmckinney commented 5 months ago

Yes, that sounds best.

yolile commented 5 months ago

Related to https://github.com/open-contracting/kingfisher-collect/issues/1055, we maybe want to set a threshold such that the spider closes with a different reason if the threshold for invalid JSON files is exceeded.

Hmm, I guess it would be hard to set a number, ideally, we could have a general percentage, e.g. 50% for all spiders, but we can only do that when the spider is complete (and already closed). Maybe the data registry should check for item_dropped_count and item_scraped_count statistics and if the dropped count is more than 50%, then do not continue processing the collection

jpmckinney commented 5 months ago

Yeah, we can do #531 instead (and the related issues in the data registry).

jpmckinney commented 5 months ago

Some publications have invalid JSON, but their spiders don't require any deserialization (the JSON stays as bytes). For these, we don't need to filter them out in Kingfisher Collect, as Kingfisher Process can handle invalid JSON. I think the condition to stay as bytes is:

not spider.concatenated_json and (see next comment) not spider.root_path and item.data_type not in ('release', 'record') and not getattr(spider, 'resize_package', False)

So this new middleware can just yield item and continue if already deserialized (isinstance(data, (dict, list))) or the above.

jpmckinney commented 5 months ago

Ah, going back to:

I wondering if we want this middleware to be run after ConcatenatedJSONMiddleware and LineDelimitedMiddleware, in case only one item of a file is invalid

ConcatenatedJSONMiddleware can yield items up to the invalid one, but I don't think ijson can continue past some invalid JSON text (in some scenarios, we can probably find where good data starts again, but I don't think there's a universal solution). So, for this issue, maybe we handle the exception in ConcatenatedJSONMiddleware (or, maybe we open a new issue, and for now require that concatenated JSON must be valid).

jpmckinney commented 5 months ago

Re: my last two comments, the behavior is opt-in in #1066, which is also fine, as it spares some deserialization and reserialization in cases where e.g. we set root_path but we know the JSON is always valid. This is a good approach, as invalid JSON should be rare.