scrapfly / python-scrapfly

Scrapfly Python SDK for headless browsers and proxy rotation
https://scrapfly.io/docs/sdk/python
Other
28 stars 8 forks source link

Error on api_response #10

Closed farovictor closed 9 months ago

farovictor commented 10 months ago

I'm having the following issue when scraping yielding ScrapflyScrapyRequest:

ERROR scraper.py:246 Error downloading <GET https://immobilienscout24.de/expose/146870274>
Traceback (most recent call last):
  File "/Users/dev/scraper/.venv/lib/python3.9/site-packages/scrapfly/api_response.py", line 105, in __call__
    return self.content_loader(content)
  File "msgpack/_unpacker.pyx", line 194, in msgpack._cmsgpack.unpackb
  File "/Users/dev/scraper/.venv/lib/python3.9/site-packages/scrapfly/api_response.py", line 51, in _date_parser
    value[k] = _date_parser(v)
  File "/Users/dev/scraper/.venv/lib/python3.9/site-packages/scrapfly/api_response.py", line 53, in _date_parser
    value[k] = v
TypeError: 'bytes' object does not support item assignment

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/dev/scraper/.venv/lib/python3.9/site-packages/twisted/internet/defer.py", line 1697, in _inlineCallbacks
    result = context.run(gen.send, result)
  File "/Users/dev/scraper/.venv/lib/python3.9/site-packages/scrapy/core/downloader/middleware.py", line 75, in process_exception
    response = yield deferred_from_coro(method(request=request, exception=exception, spider=spider))
  File "/Users/dev/scraper/.venv/lib/python3.9/site-packages/scrapfly/scrapy/middleware.py", line 70, in process_exception
    raise exception
  File "/Users/dev/scraper/.venv/lib/python3.9/site-packages/twisted/internet/defer.py", line 1693, in _inlineCallbacks
    result = context.run(
  File "/Users/dev/scraper/.venv/lib/python3.9/site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/Users/dev/scraper/.venv/lib/python3.9/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/Users/dev/scraper/.venv/lib/python3.9/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/Users/dev/scraper/.venv/lib/python3.9/site-packages/scrapfly/scrapy/downloader.py", line 82, in on_body_downloaded
    scrapfly_api_response:ScrapeApiResponse = spider.scrapfly_client._handle_response(
  File "/Users/dev/scraper/.venv/lib/python3.9/site-packages/scrapfly/client.py", line 295, in _handle_response
    api_response = self._handle_api_response(
  File "/Users/dev/scraper/.venv/lib/python3.9/site-packages/scrapfly/client.py", line 453, in _handle_api_response
    body = self.body_handler(response.content)
  File "/Users/dev/scraper/.venv/lib/python3.9/site-packages/scrapfly/api_response.py", line 107, in __call__
    raise EncoderError(content=content.decode('utf-8')) from e
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 0: invalid start byte

This error is present in 2% of my total requests and it's completely random some URLs may hit this error in a few tries, but in most cases, they don't repeat.

Environment Setup:

jjsaunier commented 10 months ago

I have batched 1k request against your target, with no issue.

I have added better support to catch the text representation of this faulty binary issue https://github.com/scrapfly/python-scrapfly/commit/712b37a6ec843d51d59e4154923cfdaa52664337#diff-42acd60fcbec2f0f0da4cdc1f17124c2696319b3868b29b7026c76b25d675dfeR111, If you get it, you can share me the base64.

If you can, share a minimum setup to get the same condition as yours (with a poetry lock or requirement.txt with a fixed version). Because in your stack trace, you have /twisted/internet/, I guess there is scrapy involved? (I tested regular SDK and Scrapy integration)

farovictor commented 9 months ago

We have moved to a more up-to-date version of Scrapy and since we have no occurrences of this. I will close this issue.