Closed mijamo closed 3 years ago
May https://github.com/scrapinghub/python-scrapinghub/issues/151 be the answer to this?
I am using msgpack v1.0.2 so I don't think this is the issue
https://github.com/scrapinghub/python-scrapinghub/issues/121 also seems related. I would try uninstalling msgpack, see if that makes any difference.
After checking my logs it seems like when the error occurs I have received such strings from the ScrapingHub API with well formatted UTF-8 appart from \xde\x00\x18\xa4
, \xde\x00\x16\xa4
or \xde\x00\x19\xa4
. Those sequences seem to be inserted between some properties (for instance in my case I have a description
field that I get correctly, and then that sequence gets inserted before the next property starts. The weird thing is that I cannot seem to be able to trigger the error manually because everytime I fetch the items through the command line it seems to work, and the source data seems correct.
After looking even deeper in the logs it seems that those sequences are not randomly inserted. Instead it looks like the description
field is in those cases "cut" at some point and then the weird sequence is inserted and then it moves to another field.
After seeing that I suspect this might not be a problem with this library but maybe more with ScrapingHub API?
I think this is https://github.com/scrapinghub/python-scrapinghub/issues/121 . Using iter
for long can be an issue. In #121 it’s better explained, including how to work around that issue.
Thank you I will try that solution and close this issue if it fixes it. It might take a few days though as the error doesn't happen every day as I mentioned.
It does seem like it fixed the issue, thank you for the help.
This might be worth mentioning in the documentation somewhere though because the error doesn't make it easy to understand the problem.
It seems like I randomly get errors like this:
This happens while iterating the items through
last_job.items.iter()
It seems to happen about 50% of the time from what I see. I scrape the same website every day and run that function and sometimes it works fine, sometimes raise that error. I am not sure if this is an issue with this library or with the ScrapingHub API though but it is very problematic.This happens on the latest (2.3.1) version