scrapy-plugins / scrapy-zyte-api

Zyte API integration for Scrapy
BSD 3-Clause "New" or "Revised" License
36 stars 19 forks source link

Unsupported URL scheme 'https': The object should be created from async function #52

Closed ttilberg closed 2 years ago

ttilberg commented 2 years ago

Following the notes for the settings file, we are experiencing an issue where the http and https handlers are not loading as expected. Generically, we are receiving the exception: The object should be created from async function.

There are log lines mentioning asyncio, and aiohttp paths referenced, so it seems like we are successfully loading AsyncIO. Do you have any thoughts on what this could be?

Relevant log lines:

2022-09-21 15:20:17 [scrapy.utils.log] INFO: Scrapy 2.6.2 started (bot: wheel_pricing)
2022-09-21 15:20:17 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.14, cssselect 1.1.0, parsel 1.6.0, w3lib 2.0.1, Twisted 22.8.0, Python 3.9.4 (default, Sep 20 2022, 14:25:08) - [GCC 7.5.0], pyOpenSSL 22.0.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 38.0.1, Platform Linux-5.4.0-94-generic-x86_64-with-glibc2.27

2022-09-21 15:20:17 [scrapy.crawler] INFO: Overridden settings: {..., 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2022-09-21 15:20:17 [asyncio] DEBUG: Using selector: EpollSelector
2022-09-21 15:20:17 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2022-09-21 15:20:17 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop

2022-09-21 15:20:17 [scrapy.core.downloader.handlers] ERROR: Loading "scrapy_zyte_api.handler.ScrapyZyteAPIDownloadHandler" for scheme "https"
Traceback (most recent call last):
  File "/home/me/.local/share/virtualenvs/scrapy_wheel_pricing-xzY05OPg/lib/python3.9/site-packages/scrapy/core/downloader/handlers/__init__.py", line 52, in _load_handler
    dh = create_instance(
  File "/home/me/.local/share/virtualenvs/scrapy_wheel_pricing-xzY05OPg/lib/python3.9/site-packages/scrapy/utils/misc.py", line 166, in create_instance
    instance = objcls.from_crawler(crawler, *args, **kwargs)
  File "/home/me/.local/share/virtualenvs/scrapy_wheel_pricing-xzY05OPg/lib/python3.9/site-packages/scrapy/core/downloader/handlers/http11.py", line 53, in from_crawler
    return cls(crawler.settings, crawler)
  File "/home/me/.local/share/virtualenvs/scrapy_wheel_pricing-xzY05OPg/lib/python3.9/site-packages/scrapy_zyte_api/handler.py", line 58, in __init__
    self._session = create_session(connection_pool_size=self._client.n_conn)
  File "/home/me/.local/share/virtualenvs/scrapy_wheel_pricing-xzY05OPg/lib/python3.9/site-packages/zyte_api/aio/client.py", line 32, in create_session
    kwargs["connector"] = TCPConnector(limit=connection_pool_size)
  File "/home/me/.local/share/virtualenvs/scrapy_wheel_pricing-xzY05OPg/lib/python3.9/site-packages/aiohttp/connector.py", line 708, in __init__
    super().__init__(keepalive_timeout=keepalive_timeout,
  File "/home/me/.local/share/virtualenvs/scrapy_wheel_pricing-xzY05OPg/lib/python3.9/site-packages/aiohttp/connector.py", line 207, in __init__
    loop = get_running_loop()
  File "/home/me/.local/share/virtualenvs/scrapy_wheel_pricing-xzY05OPg/lib/python3.9/site-packages/aiohttp/helpers.py", line 276, in get_running_loop
    raise RuntimeError("The object should be created from async function")
RuntimeError: The object should be created from async function

2022-09-21 15:20:18 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.walmart.com/browse/auto-tires/wheels-and-rims/91083_4375198>
Traceback (most recent call last):
  File "/home/me/.local/share/virtualenvs/scrapy_wheel_pricing-xzY05OPg/lib/python3.9/site-packages/twisted/internet/defer.py", line 1692, in _inlineCallbacks
    result = context.run(
  File "/home/me/.local/share/virtualenvs/scrapy_wheel_pricing-xzY05OPg/lib/python3.9/site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/home/me/.local/share/virtualenvs/scrapy_wheel_pricing-xzY05OPg/lib/python3.9/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/home/me/.local/share/virtualenvs/scrapy_wheel_pricing-xzY05OPg/lib/python3.9/site-packages/scrapy/utils/defer.py", line 67, in mustbe_deferred
    result = f(*args, **kw)
  File "/home/me/.local/share/virtualenvs/scrapy_wheel_pricing-xzY05OPg/lib/python3.9/site-packages/scrapy/core/downloader/handlers/__init__.py", line 74, in download_request
    raise NotSupported(f"Unsupported URL scheme '{scheme}': {self._notconfigured[scheme]}")
scrapy.exceptions.NotSupported: Unsupported URL scheme 'https': The object should be created from async function

When you trace the code, you find that the http and https keys are dropped from the downloaders dict after the first exception, and the second exception is raised because the dict no longer has those keys.

Gallaecio commented 2 years ago

Can you provide a minimal reproducible example?

For example, the following minimal spider code should work:

from scrapy import Request, Spider

class MinimalSpider(Spider):
    name = "minimal"

    custom_settings = {
        "DOWNLOAD_HANDLERS": {
            "http": "scrapy_zyte_api.handler.ScrapyZyteAPIDownloadHandler",
            "https": "scrapy_zyte_api.handler.ScrapyZyteAPIDownloadHandler",
        },
        "ZYTE_API_KEY": "YOUR_API_KEY",
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
    }

    def start_requests(self):
        yield Request(
            "https://toscrape.com",
            meta={"zyte_api": {"httpResponseBody": True}},
        )

    def parse(self, response):
        pass

You should be able to store this code in a file, set your API key, and run that file with scrapy runspider.

Assuming it works as expected, what do you need to do to make it fail the way you actual code is failing?

ttilberg commented 2 years ago

Thanks for taking this time. Someone noticed that we had allow_prereleases = true in our Pipfile, and removing that has cleared whatever dependency issue may have been causing this. The Pipfile.lock file referenced a handful of things that rolled back, including a major version of aiohttp and a handful of other things. multidict actually bumped forward 3 major versions? Unfortunately I can't speak to exactly which dependency did the trick.

If anyone else comes across this issue, there are likely transitive dependencies in conflict. For us, removing allow_prereleases=true and running pipenv update did the trick.