scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
1.03k stars 113 forks source link

TypeError: Passing coroutines is forbidden, use tasks explicitly. #182

Closed diefergil closed 1 year ago

diefergil commented 1 year ago

Code:

import scrapy
from scrapy.crawler import CrawlerProcess

class CommercialSale(scrapy.Spider):
    name = "sample"

    # init constructor

    # crawler's entry
    def start_requests(self):
        url = "https://www.google.es/"
        yield scrapy.Request(
            url,
            meta=dict(
                playwright=True,
                errback=self.errback_close_page,
            ),
        )

    def parse_links(self, response):
        page = response.meta["playwright_page"]
        print(page)

    async def errback_close_page(self, failure):
        page = failure.request.meta["playwright_page"]
        await page.close()

# main driver

if __name__ == "__main__":
    # run
    import pathlib
    import os
    from scrapy.crawler import CrawlerProcess
    from scrapy.utils.project import get_project_settings

    process = CrawlerProcess(get_project_settings())
    process.crawl(CommercialSale)
    process.start()

Error:

2023-03-11 17:26:23 [scrapy.utils.log] INFO: Scrapy 2.6.3 started (bot: crawlers)
2023-03-11 17:26:24 [scrapy.utils.log] INFO: Versions: lxml 4.9.2.0, libxml2 2.9.4, cssselect 1.2.0, parsel 1.7.0, w3lib 2.1.1, Twisted 22.10.0, Python 3.11.1 (main, Dec 23 2022, 09:40:27) [Clang 14.0.0 (clang-1400.0.29.202)], pyOpenSSL 23.0.0 (OpenSSL 3.0.8 7 Feb 2023), cryptography 39.0.2, Platform macOS-12.4-x86_64-i386-64bit
2023-03-11 17:26:24 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'crawlers',
 'NEWSPIDER_MODULE': 'crawlers.spiders',
 'SPIDER_MODULES': ['crawlers.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2023-03-11 17:26:24 [asyncio] DEBUG: Using selector: KqueueSelector
2023-03-11 17:26:24 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2023-03-11 17:26:24 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2023-03-11 17:26:24 [scrapy.extensions.telnet] INFO: Telnet Password:
2023-03-11 17:26:24 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2023-03-11 17:26:24 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-03-11 17:26:24 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-03-11 17:26:24 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-03-11 17:26:24 [scrapy.core.engine] INFO: Spider opened
2023-03-11 17:26:24 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-03-11 17:26:24 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-03-11 17:26:24 [scrapy-playwright] INFO: Starting download handler
2023-03-11 17:26:24 [scrapy-playwright] INFO: Starting download handler
2023-03-11 17:26:29 [scrapy-playwright] INFO: Launching browser chromium
2023-03-11 17:26:29 [scrapy-playwright] INFO: Browser chromium launched
2023-03-11 17:26:29 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False)
2023-03-11 17:26:30 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2023-03-11 17:26:30 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://www.google.es/> (resource type: document, referrer: None)
2023-03-11 17:26:30 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-11' coro=<ScrapyPlaywrightDownloadHandler._make_request_handler.<locals>._request_handler() done, defined at /Users/diegofernandezgil/projects/personal/inmo-scraper/.venv/lib/python3.11/site-packages/scrapy_playwright/handler.py:462> exception=TypeError('Passing coroutines is forbidden, use tasks explicitly.')>
Traceback (most recent call last):
  File "/Users/diegofernandezgil/projects/personal/inmo-scraper/.venv/lib/python3.11/site-packages/scrapy_playwright/handler.py", line 509, in _request_handler
    await route.continue_(**overrides)
  File "/Users/diegofernandezgil/projects/personal/inmo-scraper/.venv/lib/python3.11/site-packages/playwright/async_api/_generated.py", line 754, in continue_
    await self._async(
  File "/Users/diegofernandezgil/projects/personal/inmo-scraper/.venv/lib/python3.11/site-packages/playwright/_impl/_network.py", line 251, in continue_
    await self._race_with_page_close(
  File "/Users/diegofernandezgil/projects/personal/inmo-scraper/.venv/lib/python3.11/site-packages/playwright/_impl/_network.py", line 270, in _race_with_page_close
    await asyncio.wait(
  File "/usr/local/Cellar/python@3.11/3.11.1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/tasks.py", line 415, in wait
    raise TypeError("Passing coroutines is forbidden, use tasks explicitly.")
TypeError: Passing coroutines is forbidden, use tasks explicitly.
2023-03-11 17:26:30 [py.warnings] WARNING: /usr/local/Cellar/python@3.11/3.11.1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/base_events.py:1904: RuntimeWarning: coroutine 'Channel.send' was never awaited
  handle = self._ready.popleft()

2023-03-11 17:27:00 [scrapy-playwright] WARNING: Closing page due to failed request: <GET https://www.google.es/> exc_type=<class 'playwright._impl._api_types.TimeoutError'> exc_msg=Timeout 30000ms exceeded.
=========================== logs ===========================
navigating to "https://www.google.es/", waiting until "load"
============================================================
2023-03-11 17:27:00 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.google.es/>
Traceback (most recent call last):
  File "/Users/diegofernandezgil/projects/personal/inmo-scraper/.venv/lib/python3.11/site-packages/twisted/internet/defer.py", line 1693, in _inlineCallbacks
    result = context.run(
  File "/Users/diegofernandezgil/projects/personal/inmo-scraper/.venv/lib/python3.11/site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/Users/diegofernandezgil/projects/personal/inmo-scraper/.venv/lib/python3.11/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/Users/diegofernandezgil/projects/personal/inmo-scraper/.venv/lib/python3.11/site-packages/twisted/internet/defer.py", line 1065, in adapt
    extracted = result.result()
  File "/Users/diegofernandezgil/projects/personal/inmo-scraper/.venv/lib/python3.11/site-packages/scrapy_playwright/handler.py", line 297, in _download_request
    result = await self._download_request_with_page(request, page, spider)
  File "/Users/diegofernandezgil/projects/personal/inmo-scraper/.venv/lib/python3.11/site-packages/scrapy_playwright/handler.py", line 331, in _download_request_with_page
    response = await page.goto(url=request.url, **page_goto_kwargs)
  File "/Users/diegofernandezgil/projects/personal/inmo-scraper/.venv/lib/python3.11/site-packages/playwright/async_api/_generated.py", line 7486, in goto
    await self._async(
  File "/Users/diegofernandezgil/projects/personal/inmo-scraper/.venv/lib/python3.11/site-packages/playwright/_impl/_page.py", line 484, in goto
    return await self._main_frame.goto(**locals_to_params(locals()))
  File "/Users/diegofernandezgil/projects/personal/inmo-scraper/.venv/lib/python3.11/site-packages/playwright/_impl/_frame.py", line 122, in goto
    await self._channel.send("goto", locals_to_params(locals()))
  File "/Users/diegofernandezgil/projects/personal/inmo-scraper/.venv/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 39, in send
    return await self.inner_send(method, params, False)
  File "/Users/diegofernandezgil/projects/personal/inmo-scraper/.venv/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 63, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.TimeoutError: Timeout 30000ms exceeded.
=========================== logs ===========================
navigating to "https://www.google.es/", waiting until "load"
============================================================
2023-03-11 17:27:00 [scrapy.core.engine] INFO: Closing spider (finished)
2023-03-11 17:27:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/playwright._impl._api_types.TimeoutError': 1,
 'downloader/request_bytes': 213,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'elapsed_time_seconds': 36.272757,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 3, 11, 16, 27, 0, 593339),
 'log_count/DEBUG': 6,
 'log_count/ERROR': 2,
 'log_count/INFO': 14,
 'log_count/WARNING': 2,
 'memusage/max': 115617792,
 'memusage/startup': 115617792,
 'playwright/context_count': 1,
 'playwright/context_count/max_concurrent': 1,
 'playwright/context_count/non_persistent': 1,
 'playwright/page_count': 1,
 'playwright/page_count/closed': 1,
 'playwright/page_count/max_concurrent': 1,
 'playwright/request_count': 1,
 'playwright/request_count/method/GET': 1,
 'playwright/request_count/navigation': 1,
 'playwright/request_count/resource_type/document': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2023, 3, 11, 16, 26, 24, 320582)}
2023-03-11 17:27:00 [scrapy.core.engine] INFO: Spider closed (finished)
2023-03-11 17:27:00 [scrapy-playwright] INFO: Closing download handler
2023-03-11 17:27:00 [scrapy-playwright] INFO: Closing download handler
2023-03-11 17:27:00 [scrapy-playwright] DEBUG: Browser context closed: 'default' (persistent=False)
2023-03-11 17:27:00 [scrapy-playwright] INFO: Closing browser

pyproject.toml:

[tool.poetry]
name = "inmo-scraper"
version = "0.1.0"
description = ""

[tool.poetry.dependencies]
python = "^3.9"
Scrapy = "2.6.3"
Faker = "^15.3.1"
fake-useragent = "1.1.1"
requests-html = "^0.10.0"
lxml = "^4.9.1"
pandas = "^1.5.1"
scrapy-rotating-proxies = "^0.6.2"
pydantic = "^1.10.2"
sqlalchemy = "^1.4.43"
psycopg2 = "^2.9.5"
httpx = {extras = ["http2"], version = "^0.23.0"}
parsel = "^1.7.0"
tenacity = "^8.1.0"
install = "^1.3.5"
ujson = "^5.5.0"
certifi = "^2022.9.24"
scrapy-splash = "^0.8.0"
scrapy-zyte-smartproxy = "^2.2.0"
pip = "^23.0.1"
scrapy-playwright = "^0.0.26"

[tool.poetry.dev-dependencies]
pytest = "^5.2"
autopep8 = "^2.0.0"
black = "^22.10.0"
ipykernel = "^6.17.0"

[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"

Python version:

Python 3.11.1 (main, Dec 23 2022, 09:40:27) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
diefergil commented 1 year ago

Solved, I've reinstalled the environment.