Closed odscjames closed 3 years ago
I think we just need to do in parse_json_lines
:
if isinstance(line, bytes):
line = line.decode()
… since ZipFile.read
returns bytes, not str. (We also call parse_json_lines
after calling TarFile.extractfile
, but I'm not sure if that returns bytes or str. Anyway, an if-statement seems safest.)
cc @aguilerapy
Also, I don't know why we set items_dir
in scrapyd.ini
, since this is disabled by default, and since we don't use the outputs to that folder. (Until #277 is resolved, the data in items_dir
won't make much sense.) I made a commit https://github.com/open-contracting/deploy/commit/9633ba64274dc724eb032504f7ceabf6aaa56381, which might also fix this issue, as it will turn off the feed export from which this exception is raised.
Confirming that disabling items_dir
fixed the issue. Noting that I had to manually restart Scrapyd (see https://github.com/open-contracting/deploy/issues/134). I cancelled the run via the API, not realizing that sending two requests causes the spider to stop uncleanly. I'll document how to do it: https://ocdsdeploy.readthedocs.io/en/latest/use/kingfisher-collect.html#collect-data-with-kingfisher-scrape
Leaving open for code change (we should add a test).
With the new changes, this is happening now for all the spiders if they are deployed with scrapyd and the items_dir is not blank. Should we document that the "items_dir" should be set to blank?
The problem is that the data
key of items is bytes in most cases (though I see now that in parse_json_lines
we decode to str). We could either:
KingfisherFilesStore
extension isn't configured).I now think data
should always be bytes, since I/O operations are much faster that way. Having just one type for data will also make the code more consistent.
We should also still make the note about items_dir, because the items don't really have any value – the extensions convert them to useful outputs. That said, we can also configure a default Scrapy exporter that can serialize bytes.
Regarding my I/O comment, we use json.dump
, which is documented as:
The json module always produces str objects, not bytes objects
If we wanted faster writes, we could switch to orjson, which serializes to bytes. However, we haven't confirmed whether this is a bottleneck (generally, the network is the bottleneck).
In terms of whether to use str or bytes, I think our code is fine as-is, as it only comes up in a few places, where we can handle either alternative. I'll suggest a little cleanup in #609 however.
Seems to happen once per line, as it tries to send to server
Cyprus scraper