scrapfly / python-scrapfly

Scrapfly Python SDK for headless browsers and proxy rotation
https://scrapfly.io/docs/sdk/python
Other
33 stars 9 forks source link

Exceptions from concurrent_scrape? #17

Open williamhakim10 opened 7 months ago

williamhakim10 commented 7 months ago

We have code that looks like this:

        scrapfly = ScrapflyClient(key=self.__scrapfly_api_key, max_concurrency=15)
        targets = [
            ScrapeConfig(
                url=url,
                render_js=True,
                raise_on_upstream_error=False,
                country="us",
                asp=True,
            )
            for url in urls
        ]
        async for result in scrapfly.concurrent_scrape(scrape_configs=targets):
            self.__logger.info(f"Got result: {result}")  # when this code explodes, no log appears
            if isinstance(result, ScrapflyError):  # error from scrapfly itself
                ...
            elif result.error:  # error from upstream
                ...
            else:  # success
                ...

That being said, this code tends to explode on the async iterator sometimes, which will throw an error which looks like this without returning a result at all.

<-- 422 | ERR::PROXY::TIMEOUT - Proxy connection or website was too slow and timeout - Proxy or website do not respond after 15s - Check if the website is online or geoblocking, if you are using session, rotate it..Checkout the related doc: https://scrapfly.io/docs/scrape-api/error/ERR::PROXY::TIMEOUT

Seems like there's some kind of bug where the async iterator can itself throw rather than return an exception, which means the entire process blows up. Any ideas how we might go about fixing?

As an aside, I wanted to point out that it feels like the very inconsistent use of typing throughout the library makes it very hard to debug what's actually going on and reason about what errors can happen and when.

jjsaunier commented 7 months ago

I will try some load test to check, tbh implementation is quite old and probably not super efficient as it relying on threadpool for async behind the scene. I think it's simpler to build a new a implem on top of native async client and leverage out-of-the-box asyncio feature

As an aside, I wanted to point out that it feels like the very inconsistent use of typing throughout the library makes it very hard to debug what's actually going on and reason about what errors can happen and when.

Yeah, that's due to the current implementation, we can't really throw in async way with async for, the only way is to return exception as result, also I made the choice to disable the throw in concurrency mode https://github.com/scrapfly/python-scrapfly/blob/master/scrapfly/client.py#L369 (same issue with asyncio.gather which require return_exception to not stop everything and result in the same typing experience)