open-contracting / kingfisher-collect

Downloads OCDS data and stores it on disk
https://kingfisher-collect.readthedocs.io
BSD 3-Clause "New" or "Revised" License
13 stars 12 forks source link

Have consistent Python type for items' data #371

Closed odscjames closed 3 years ago

odscjames commented 4 years ago
2020-04-27 17:50:34 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method FeedExporter.item_scraped of <scrapy.extensions.feedexport.FeedExporter object at 0x7fc2e8dbb860>>
Traceback (most recent call last):
  File "/home/ocdskfs/scrapyd/.ve/lib/python3.6/site-packages/twisted/internet/defer.py", line 151, in maybeDeferred
    result = f(*args, **kw)
  File "/home/ocdskfs/scrapyd/.ve/lib/python3.6/site-packages/pydispatch/robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "/home/ocdskfs/scrapyd/.ve/lib/python3.6/site-packages/scrapy/extensions/feedexport.py", line 268, in item_scraped
    slot.exporter.export_item(item)
  File "/home/ocdskfs/scrapyd/.ve/lib/python3.6/site-packages/scrapy/exporters.py", line 93, in export_item
    data = self.encoder.encode(itemdict) + '\n'
  File "/usr/lib/python3.6/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python3.6/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/home/ocdskfs/scrapyd/.ve/lib/python3.6/site-packages/scrapy/utils/serialize.py", line 36, in default
    return super(ScrapyJSONEncoder, self).default(o)
  File "/usr/lib/python3.6/json/encoder.py", line 180, in default
    o.__class__.__name__)
TypeError: Object of type 'bytes' is not JSON serializable

Seems to happen once per line, as it tries to send to server

Cyprus scraper

jpmckinney commented 4 years ago

I think we just need to do in parse_json_lines:

if isinstance(line, bytes):
    line = line.decode()

… since ZipFile.read returns bytes, not str. (We also call parse_json_lines after calling TarFile.extractfile, but I'm not sure if that returns bytes or str. Anyway, an if-statement seems safest.)

jpmckinney commented 4 years ago

cc @aguilerapy

jpmckinney commented 4 years ago

Also, I don't know why we set items_dir in scrapyd.ini, since this is disabled by default, and since we don't use the outputs to that folder. (Until #277 is resolved, the data in items_dir won't make much sense.) I made a commit https://github.com/open-contracting/deploy/commit/9633ba64274dc724eb032504f7ceabf6aaa56381, which might also fix this issue, as it will turn off the feed export from which this exception is raised.

jpmckinney commented 4 years ago

Confirming that disabling items_dir fixed the issue. Noting that I had to manually restart Scrapyd (see https://github.com/open-contracting/deploy/issues/134). I cancelled the run via the API, not realizing that sending two requests causes the spider to stop uncleanly. I'll document how to do it: https://ocdsdeploy.readthedocs.io/en/latest/use/kingfisher-collect.html#collect-data-with-kingfisher-scrape

Leaving open for code change (we should add a test).

yolile commented 4 years ago

With the new changes, this is happening now for all the spiders if they are deployed with scrapyd and the items_dir is not blank. Should we document that the "items_dir" should be set to blank?

jpmckinney commented 4 years ago

The problem is that the data key of items is bytes in most cases (though I see now that in parse_json_lines we decode to str). We could either:

  1. Change it so that it is always bytes (or always decoded to str)
  2. Do your suggestion, since there is no value in storing the items (they are already written as files unless the KingfisherFilesStore extension isn't configured).
jpmckinney commented 4 years ago

I now think data should always be bytes, since I/O operations are much faster that way. Having just one type for data will also make the code more consistent.

We should also still make the note about items_dir, because the items don't really have any value – the extensions convert them to useful outputs. That said, we can also configure a default Scrapy exporter that can serialize bytes.

jpmckinney commented 3 years ago

Regarding my I/O comment, we use json.dump, which is documented as:

The json module always produces str objects, not bytes objects

If we wanted faster writes, we could switch to orjson, which serializes to bytes. However, we haven't confirmed whether this is a bottleneck (generally, the network is the bottleneck).

jpmckinney commented 3 years ago

In terms of whether to use str or bytes, I think our code is fine as-is, as it only comes up in a few places, where we can handle either alternative. I'll suggest a little cleanup in #609 however.