scrapinghub / python-scrapinghub

A client interface for Scrapinghub's API
https://python-scrapinghub.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
202 stars 63 forks source link

UnicodeDecodeError while fetching items #154

Closed mijamo closed 3 years ago

mijamo commented 3 years ago

It seems like I randomly get errors like this:

 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xde in position 174: invalid continuation byte

        at msgpack._cmsgpack.Unpacker._unpack (_unpacker.pyx:443)
        at msgpack._cmsgpack.Unpacker.__next__ (_unpacker.pyx:518)
        at mpdecode (/usr/local/lib/python3.7/site-packages/scrapinghub/hubstorage/serialization.py:33)
        at iter (/usr/local/lib/python3.7/site-packages/scrapinghub/client/proxy.py:115) 

This happens while iterating the items through last_job.items.iter() It seems to happen about 50% of the time from what I see. I scrape the same website every day and run that function and sometimes it works fine, sometimes raise that error. I am not sure if this is an issue with this library or with the ScrapingHub API though but it is very problematic.

This happens on the latest (2.3.1) version

Gallaecio commented 3 years ago

May https://github.com/scrapinghub/python-scrapinghub/issues/151 be the answer to this?

mijamo commented 3 years ago

I am using msgpack v1.0.2 so I don't think this is the issue

Gallaecio commented 3 years ago

https://github.com/scrapinghub/python-scrapinghub/issues/121 also seems related. I would try uninstalling msgpack, see if that makes any difference.

mijamo commented 3 years ago

After checking my logs it seems like when the error occurs I have received such strings from the ScrapingHub API with well formatted UTF-8 appart from \xde\x00\x18\xa4 , \xde\x00\x16\xa4 or \xde\x00\x19\xa4. Those sequences seem to be inserted between some properties (for instance in my case I have a description field that I get correctly, and then that sequence gets inserted before the next property starts. The weird thing is that I cannot seem to be able to trigger the error manually because everytime I fetch the items through the command line it seems to work, and the source data seems correct.

mijamo commented 3 years ago

After looking even deeper in the logs it seems that those sequences are not randomly inserted. Instead it looks like the description field is in those cases "cut" at some point and then the weird sequence is inserted and then it moves to another field.

After seeing that I suspect this might not be a problem with this library but maybe more with ScrapingHub API?

Gallaecio commented 3 years ago

I think this is https://github.com/scrapinghub/python-scrapinghub/issues/121 . Using iter for long can be an issue. In #121 it’s better explained, including how to work around that issue.

mijamo commented 3 years ago

Thank you I will try that solution and close this issue if it fixes it. It might take a few days though as the error doesn't happen every day as I mentioned.

mijamo commented 3 years ago

It does seem like it fixed the issue, thank you for the help.

This might be worth mentioning in the documentation somewhere though because the error doesn't make it easy to understand the problem.