Closed jpmckinney closed 5 months ago
I wondering if we want this middleware to be run after ConcatenatedJSONMiddleware
and LineDelimitedMiddleware
, in case only one item of a file is invalid
Yes, that sounds best.
Related to https://github.com/open-contracting/kingfisher-collect/issues/1055, we maybe want to set a threshold such that the spider closes with a different reason if the threshold for invalid JSON files is exceeded.
Hmm, I guess it would be hard to set a number, ideally, we could have a general percentage, e.g. 50% for all spiders, but we can only do that when the spider is complete (and already closed). Maybe the data registry should check for item_dropped_count
and item_scraped_count
statistics and if the dropped count is more than 50%, then do not continue processing the collection
Yeah, we can do #531 instead (and the related issues in the data registry).
Some publications have invalid JSON, but their spiders don't require any deserialization (the JSON stays as bytes). For these, we don't need to filter them out in Kingfisher Collect, as Kingfisher Process can handle invalid JSON. I think the condition to stay as bytes is:
(see next comment) not spider.concatenated_json and
not spider.root_path and item.data_type not in ('release', 'record') and not getattr(spider, 'resize_package', False)
So this new middleware can just yield item
and continue
if already deserialized (isinstance(data, (dict, list))
) or the above.
Ah, going back to:
I wondering if we want this middleware to be run after ConcatenatedJSONMiddleware and LineDelimitedMiddleware, in case only one item of a file is invalid
ConcatenatedJSONMiddleware
can yield items up to the invalid one, but I don't think ijson can continue past some invalid JSON text (in some scenarios, we can probably find where good data starts again, but I don't think there's a universal solution). So, for this issue, maybe we handle the exception in ConcatenatedJSONMiddleware
(or, maybe we open a new issue, and for now require that concatenated JSON must be valid).
Re: my last two comments, the behavior is opt-in in #1066, which is also fine, as it spares some deserialization and reserialization in cases where e.g. we set root_path
but we know the JSON is always valid. This is a good approach, as invalid JSON should be rare.
If we can filter these out, then we can include more publications in the registry where this issue occurs in a small subset of all available files.
Typically, filtering is done in the item pipeline. However, spider middlewares run prior to the item pipeline, and we parse the JSON in these middlewares. (In some cases, we parse the JSON in the spider, but only when we have to in order to create URLs.)
Maybe we mark the item with a first middleware, and the other middlewares are skipped if that mark is present. The pipeline could then drop these marked items, and log the total.
Related to #1055, we maybe want to set a threshold such that the spider closes with a different reason if the threshold for invalid JSON files is exceeded.