Scrapy hangs when an exception is raised in applying page method

gurmitteotia commented 7 months ago

Scarpy is hanging when an exception is raised in applying page methods. Though scrapy is not reporting any exception when applying page method but I have verified by writing code using playwright. Exception is raised because page is being redirected when calling page method.

async def main():
    async with async_playwright() as pw:
        browser = await pw.chromium.launch(headless=False)
        page = await browser.new_page()
        await page.goto("https://www.cable.co.uk/go/bt/fibre-1/")
        await page.wait_for_load_state("load")
        await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        await page.wait_for_load_state("load")
        await page.evaluate("()=>{ return new Promise((resolve) => setTimeout(resolve, 2000)); }")
        #i = input("Enter some text to stop the browser")
        with open("output.html", "w") as f:
            f.write(await page.content())
        await browser.close()

asyncio.run(main())

Below is sample spider code to reproduce the error

import scrapy
from scrapy_playwright.page import PageMethod

class TestSpider(scrapy.Spider):
    name = "test"
    def start_requests(self):
        url = "https://www.cable.co.uk"
        yield self._build_request(url, True)

    async def parse(self, response):
        self.logger.info("Response url %s request url %s", response.url, response.request.url)
        if response.meta['is_home_page']:
            new_urls = ['https://www.cable.co.uk/broadband/check-my-area/',
                        'https://www.cable.co.uk/go/bt/fibre-1/',
                        'https://www.cable.co.uk/go/virgin-media/broadband-and-tv/']
            for url in new_urls:
                yield self._build_request(url, False)
        self._cleanup_page(response.meta)

    def _build_request(self, url, is_home_page):
        request = scrapy.Request(url = url, callback=self.parse, errback=self.close_context_on_error, meta={'playwright': True, "playwright_page_methods": [
                PageMethod("wait_for_load_state", "load"), 
                PageMethod("evaluate", "window.scrollTo(0, document.body.scrollHeight)"),
                PageMethod("wait_for_load_state", "load"), 
                PageMethod("evaluate","()=>{ return new Promise((resolve) => setTimeout(resolve, 300)); }")
            ], 
            "playwright_include_page": True
            }
            )
        request.meta['is_home_page'] = is_home_page
        return request

    async def close_context_on_error(self, failure):
        self.logger.info("Failing the request for %r, and exception=%r", failure.request.url, failure.value)
        await self._cleanup_page(failure.request.meta)

    async def _cleanup_page(self, meta):
        if 'playwright_page' in meta:
            page = meta['playwright_page']
            await page.close()

When the error occurred scrapy ends up in the hang state and it does not exit.

elacuesta commented 7 months ago

I cannot reproduce with scrapy-playwright 0.0.33, Playwright 1.39.0 & Scrapy 2.11

scrapy version -v
Scrapy       : 2.11.0
lxml         : 4.9.3.0
libxml2      : 2.10.3
cssselect    : 1.2.0
parsel       : 1.8.1
w3lib        : 2.1.2
Twisted      : 22.10.0
Python       : 3.10.0 (default, Oct  8 2021, 09:55:22) [GCC 7.5.0]
pyOpenSSL    : 23.2.0 (OpenSSL 3.1.2 1 Aug 2023)
cryptography : 41.0.3
Platform     : Linux-5.15.0-79-generic-x86_64-with-glibc2.35

2023-11-20 09:43:31 [scrapy.core.engine] INFO: Spider opened
2023-11-20 09:43:31 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-11-20 09:43:31 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-11-20 09:43:31 [scrapy-playwright] INFO: Starting download handler
2023-11-20 09:43:36 [scrapy-playwright] INFO: Launching browser chromium
2023-11-20 09:43:36 [scrapy-playwright] INFO: Browser chromium launched
2023-11-20 09:43:44 [test] INFO: Response url https://www.cable.co.uk/ request url https://www.cable.co.uk
2023-11-20 09:43:44 [py.warnings] WARNING: /scrapy-playwright/pdf.py:85: RuntimeWarning: coroutine 'TestSpider._cleanup_page' was never awaited
  self._cleanup_page(response.meta)

2023-11-20 09:43:47 [test] INFO: Failing the request for 'https://www.cable.co.uk/go/bt/fibre-1/', and exception=Error('Execution context was destroyed, most likely because of a navigation')
2023-11-20 09:43:47 [test] INFO: Failing the request for 'https://www.cable.co.uk/go/virgin-media/broadband-and-tv/', and exception=Error('Execution context was destroyed, most likely because of a navigation')
2023-11-20 09:43:49 [test] INFO: Response url https://www.cable.co.uk/broadband/check-my-area/ request url https://www.cable.co.uk/broadband/check-my-area/
2023-11-20 09:43:49 [py.warnings] WARNING: /scrapy-playwright/pdf.py:85: RuntimeWarning: coroutine 'TestSpider._cleanup_page' was never awaited
  self._cleanup_page(response.meta)

2023-11-20 09:43:49 [scrapy.core.engine] INFO: Closing spider (finished)
2023-11-20 09:43:49 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 2,
 'downloader/exception_type_count/playwright._impl._api_types.Error': 2,
 'downloader/request_bytes': 1056,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 4,
 'downloader/response_bytes': 1214027,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 17.991774,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 11, 20, 12, 43, 49, 517280, tzinfo=datetime.timezone.utc),
 'log_count/INFO': 17,
 'log_count/WARNING': 3,
 'memusage/max': 70094848,
 'memusage/startup': 70094848,
 'playwright/context_count': 1,
 'playwright/context_count/max_concurrent': 1,
 'playwright/context_count/persistent/False': 1,
 'playwright/context_count/remote/False': 1,
 'playwright/page_count': 4,
 'playwright/page_count/max_concurrent': 4,
 'playwright/request_count': 140,
 'playwright/request_count/method/GET': 132,
 'playwright/request_count/method/POST': 8,
 'playwright/request_count/navigation': 11,
 'playwright/request_count/resource_type/document': 11,
 'playwright/request_count/resource_type/image': 81,
 'playwright/request_count/resource_type/other': 4,
 'playwright/request_count/resource_type/ping': 4,
 'playwright/request_count/resource_type/script': 35,
 'playwright/request_count/resource_type/stylesheet': 5,
 'playwright/response_count': 107,
 'playwright/response_count/method/GET': 99,
 'playwright/response_count/method/POST': 8,
 'playwright/response_count/resource_type/document': 11,
 'playwright/response_count/resource_type/image': 80,
 'playwright/response_count/resource_type/other': 4,
 'playwright/response_count/resource_type/ping': 4,
 'playwright/response_count/resource_type/script': 8,
 'request_depth_max': 1,
 'response_received_count': 2,
 'scheduler/dequeued': 4,
 'scheduler/dequeued/memory': 4,
 'scheduler/enqueued': 4,
 'scheduler/enqueued/memory': 4,
 'start_time': datetime.datetime(2023, 11, 20, 12, 43, 31, 525506, tzinfo=datetime.timezone.utc)}
2023-11-20 09:43:49 [scrapy.core.engine] INFO: Spider closed (finished)
2023-11-20 09:43:49 [scrapy-playwright] INFO: Closing download handler
2023-11-20 09:43:49 [scrapy-playwright] INFO: Closing browser

I also ran the following example which I think is closest to the Playwright script in the sense that it only requests a single page and the specified timeout is the same (2000, not 300):

import scrapy
from scrapy_playwright.page import PageMethod

class TestSpider(scrapy.Spider):
    name = "test"
    custom_settings = {
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
    }

    def start_requests(self):
        yield scrapy.Request(
            url="https://www.cable.co.uk/go/bt/fibre-1/",
            callback=self.parse,
            errback=self.close_context_on_error,
            meta={
                "playwright": True,
                "playwright_page_methods": [
                    PageMethod("wait_for_load_state", "load"),
                    PageMethod("evaluate", "window.scrollTo(0, document.body.scrollHeight)"),
                    PageMethod("wait_for_load_state", "load"),
                    PageMethod(
                        "evaluate",
                        "()=>{ return new Promise((resolve) => setTimeout(resolve, 2000)); }",
                    ),
                ],
            },
        )

    async def parse(self, response):
        self.logger.info("Response url %s request url %s", response.url, response.request.url)

    async def close_context_on_error(self, failure):
        self.logger.info(
            "Failing the request for %r, and exception=%r", failure.request.url, failure.value
        )

2023-11-20 09:48:55 [scrapy.core.engine] INFO: Spider opened
2023-11-20 09:48:55 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-11-20 09:48:55 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-11-20 09:48:55 [scrapy-playwright] INFO: Starting download handler
2023-11-20 09:49:00 [scrapy-playwright] INFO: Launching browser chromium
2023-11-20 09:49:00 [scrapy-playwright] INFO: Browser chromium launched
2023-11-20 09:49:05 [scrapy-playwright] WARNING: Closing page due to failed request: <GET https://www.cable.co.uk/go/bt/fibre-1/> exc_type=<class 'playwright._impl._api_types.Error'> exc_msg=Execution context was destroyed, most likely because of a navigation
Traceback (most recent call last):
  File "/home/eugenio/zyte/scrapy-playwright/scrapy_playwright/handler.py", line 324, in _download_request
    return await self._download_request_with_page(request, page, spider)
  File "/home/eugenio/zyte/scrapy-playwright/scrapy_playwright/handler.py", line 372, in _download_request_with_page
    await self._apply_page_methods(page, request, spider)
  File "/home/eugenio/zyte/scrapy-playwright/scrapy_playwright/handler.py", line 485, in _apply_page_methods
    pm.result = await _maybe_await(method(*pm.args, **pm.kwargs))
  File "/home/eugenio/zyte/scrapy-playwright/scrapy_playwright/_utils.py", line 16, in _maybe_await
    return await obj
  File "/home/eugenio/zyte/scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/async_api/_generated.py", line 8665, in evaluate
    await self._impl_obj.evaluate(
  File "/home/eugenio/zyte/scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_page.py", line 389, in evaluate
    return await self._main_frame.evaluate(expression, arg)
  File "/home/eugenio/zyte/scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_frame.py", line 271, in evaluate
    await self._channel.send(
  File "/home/eugenio/zyte/scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 61, in send
    return await self._connection.wrap_api_call(
  File "/home/eugenio/zyte/scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 490, in wrap_api_call
    return await cb()
  File "/home/eugenio/zyte/scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 99, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: Execution context was destroyed, most likely because of a navigation
2023-11-20 09:49:05 [test] INFO: Failing the request for 'https://www.cable.co.uk/go/bt/fibre-1/', and exception=Error('Execution context was destroyed, most likely because of a navigation')
2023-11-20 09:49:05 [scrapy.core.engine] INFO: Closing spider (finished)
2023-11-20 09:49:05 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/playwright._impl._api_types.Error': 1,
 'downloader/request_bytes': 234,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'elapsed_time_seconds': 10.118541,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 11, 20, 12, 49, 5, 828729, tzinfo=datetime.timezone.utc),
 'log_count/INFO': 14,
 'log_count/WARNING': 2,
 'memusage/max': 70017024,
 'memusage/startup': 70017024,
 'playwright/context_count': 1,
 'playwright/context_count/max_concurrent': 1,
 'playwright/context_count/persistent/False': 1,
 'playwright/context_count/remote/False': 1,
 'playwright/page_count': 1,
 'playwright/page_count/closed': 1,
 'playwright/page_count/max_concurrent': 1,
 'playwright/request_count': 8,
 'playwright/request_count/method/GET': 8,
 'playwright/request_count/navigation': 4,
 'playwright/request_count/resource_type/document': 4,
 'playwright/request_count/resource_type/image': 2,
 'playwright/request_count/resource_type/script': 1,
 'playwright/request_count/resource_type/stylesheet': 1,
 'playwright/response_count': 6,
 'playwright/response_count/method/GET': 6,
 'playwright/response_count/resource_type/document': 4,
 'playwright/response_count/resource_type/image': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2023, 11, 20, 12, 48, 55, 710188, tzinfo=datetime.timezone.utc)}
2023-11-20 09:49:05 [scrapy.core.engine] INFO: Spider closed (finished)
2023-11-20 09:49:05 [scrapy-playwright] INFO: Closing download handler
2023-11-20 09:49:05 [scrapy-playwright] INFO: Closing browser

gurmitteotia commented 7 months ago

Sorry I did not give environment information before. Here is details about my environment:

scrapy version -v

Scrapy       : 2.11.0
lxml         : 4.9.2.0
libxml2      : 2.9.14
cssselect    : 1.2.0
parsel       : 1.8.1
w3lib        : 2.1.2
Twisted      : 22.10.0
Python       : 3.11.6 | packaged by conda-forge | (main, Oct  3 2023, 10:40:35) [GCC 12.3.0]
pyOpenSSL    : 23.2.0 (OpenSSL 3.1.4 24 Oct 2023)
cryptography : 41.0.3
Platform     : Linux-5.10.0-26-amd64-x86_64-with-glibc2.31

and scrapy-playwright 0.0.33 , playwright 1.39.0

One of the major difference I see is that I'm using python 3.11. I created new conda environment with python 3.10 and I can still reproduce error. Here is my python 3.10 environment looks like with scrapy-playwright 0.0.33 and playwright 1.39.0-

scrapy version -v

Scrapy       : 2.11.0
lxml         : 4.9.3.0
libxml2      : 2.10.3
cssselect    : 1.2.0
parsel       : 1.8.1
w3lib        : 2.1.2
Twisted      : 22.10.0
Python       : 3.10.13 | packaged by conda-forge | (main, Oct 26 2023, 18:07:37) [GCC 12.3.0]
pyOpenSSL    : 23.3.0 (OpenSSL 3.1.4 24 Oct 2023)
cryptography : 41.0.5
Platform     : Linux-5.10.0-26-amd64-x86_64-with-glibc2.31

Below is the output of running same spider in python 3.10 enviornment-


2023-11-20 13:32:03 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: play_test)
2023-11-20 13:32:03 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.10.13 | packaged by conda-forge | (main, Oct 26 2023, 18:07:37) [GCC 12.3.0], pyOpenSSL 23.3.0 (OpenSSL 3.1.4 24 Oct 2023), cryptography 41.0.5, Platform Linux-5.10.0-26-amd64-x86_64-with-glibc2.31
2023-11-20 13:32:03 [scrapy.addons] INFO: Enabled addons:
[]
2023-11-20 13:32:03 [py.warnings] WARNING: /home/gurmit/anaconda3/envs/playwright/lib/python3.10/site-packages/scrapy/utils/request.py:254: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

2023-11-20 13:32:03 [scrapy.extensions.telnet] INFO: Telnet Password: be4738b83782bb6f
2023-11-20 13:32:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2023-11-20 13:32:03 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'play_test',
 'CONCURRENT_REQUESTS': 2,
 'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
 'CONCURRENT_REQUESTS_PER_IP': 1,
 'LOG_LEVEL': 'INFO',
 'NEWSPIDER_MODULE': 'play_test.spiders',
 'SPIDER_MODULES': ['play_test.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2023-11-20 13:32:03 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-11-20 13:32:03 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-11-20 13:32:03 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-11-20 13:32:03 [scrapy.core.engine] INFO: Spider opened
2023-11-20 13:32:03 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-11-20 13:32:03 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-11-20 13:32:03 [scrapy-playwright] INFO: Starting download handler
2023-11-20 13:32:08 [scrapy-playwright] INFO: Launching browser chromium
2023-11-20 13:32:08 [scrapy-playwright] INFO: Browser chromium launched
2023-11-20 13:32:09 [test] INFO: Response url https://www.cable.co.uk/ request url https://www.cable.co.uk
2023-11-20 13:32:09 [py.warnings] WARNING: /home/gurmit/source/rd/playwright-test/play_test/play_test/spiders/test_spider.py:60: RuntimeWarning: coroutine 'TestSpider._cleanup_page' was never awaited
  self._cleanup_page(response.meta)

2023-11-20 13:32:11 [test] INFO: Response url https://www.cable.co.uk/broadband/check-my-area/ request url https://www.cable.co.uk/broadband/check-my-area/
2023-11-20 13:32:11 [py.warnings] WARNING: /home/gurmit/source/rd/playwright-test/play_test/play_test/spiders/test_spider.py:60: RuntimeWarning: coroutine 'TestSpider._cleanup_page' was never awaited
  self._cleanup_page(response.meta)

2023-11-20 13:33:03 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 2 pages/min), scraped 0 items (at 0 items/min)
2023-11-20 13:34:03 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
^C2023-11-20 13:34:06 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force 
2023-11-20 13:34:06 [scrapy.core.engine] INFO: Closing spider (shutdown)
2023-11-20 13:34:06 [test] INFO: Failing the request for 'https://www.cable.co.uk/go/bt/fibre-1/', and exception=Error('Browser closed.\n==================== Browser output: ====================\n<launching> /home/gurmit/.cache/ms-playwright/chromium-1084/chrome-linux/chrome --disable-field-trial-config --disable-background-networking --enable-features=NetworkService,NetworkServiceInProcess --disable-background-timer-throttling --disable-backgrounding-occluded-windows --disable-back-forward-cache --disable-breakpad --disable-client-side-phishing-detection --disable-component-extensions-with-background-pages --disable-component-update --no-default-browser-check --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-features=ImprovedCookieControls,LazyFrameLoading,GlobalMediaControls,DestroyProfileOnBrowserClose,MediaRouter,DialMediaRouteProvider,AcceptCHFrame,AutoExpandDetailsElement,CertificateTransparencyComponentUpdater,AvoidUnnecessaryBeforeUnloadCheckSync,Translate --allow-pre-commit-input --disable-hang-monitor --disable-ipc-flooding-protection --disable-popup-blocking --disable-prompt-on-repost --disable-renderer-backgrounding --force-color-profile=srgb --metrics-recording-only --no-first-run --enable-automation --password-store=basic --use-mock-keychain --no-service-autorun --export-tagged-pdf --disable-search-engine-choice-screen --headless --hide-scrollbars --mute-audio --blink-settings=primaryHoverType=2,availableHoverTypes=2,primaryPointerType=4,availablePointerTypes=4 --no-sandbox --user-data-dir=/tmp/playwright_chromiumdev_profile-7xPIY6 --remote-debugging-pipe --no-startup-window\n<launched> pid=61440\n[pid=61440][err] [1120/133208.594703:WARNING:bluez_dbus_manager.cc(247)] Floss manager not present, cannot set Floss enable/disable.\n[pid=61440][err] [1120/133208.645882:WARNING:angle_platform_impl.cc(49)] RendererVk.cpp:1962 (initialize): ANGLE VMA version: 2003000\n[pid=61440][err] [1120/133208.653027:WARNING:sandbox_linux.cc(393)] InitializeSandbox() called with multiple threads in process gpu-process.\n[pid=61440][err] [1120/133208.663658:ERROR:command_buffer_proxy_impl.cc(127)] ContextResult::kTransientFailure: Failed to send GpuControl.CreateCommandBuffer.\n[pid=61440][err] [1120/133208.788203:ERROR:nss_util.cc(357)] After loading Root Certs, loaded==false: NSS error code: -8018\n[pid=61440] <gracefully close start>')
2023-11-20 13:34:06 [test] INFO: Failing the request for 'https://www.cable.co.uk/go/virgin-media/broadband-and-tv/', and exception=Error('Target page, context or browser has been closed')
2023-11-20 13:34:06 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 2,
 'downloader/exception_type_count/playwright._impl._api_types.Error': 2,
 'downloader/request_bytes': 1040,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 4,
 'downloader/response_bytes': 1214028,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 123.20912,
 'finish_reason': 'shutdown',
 'finish_time': datetime.datetime(2023, 11, 20, 13, 34, 6, 619059, tzinfo=datetime.timezone.utc),
 'log_count/INFO': 20,
 'log_count/WARNING': 3,
 'memusage/max': 90038272,
 'memusage/startup': 65658880,
 'playwright/context_count': 1,
 'playwright/context_count/max_concurrent': 1,
 'playwright/context_count/persistent/False': 1,
 'playwright/context_count/remote/False': 1,
 'playwright/page_count': 2,
 'playwright/page_count/max_concurrent': 2,
 'playwright/request_count': 101,
 'playwright/request_count/method/GET': 88,
 'playwright/request_count/method/POST': 13,
 'playwright/request_count/navigation': 2,
 'playwright/request_count/resource_type/document': 2,
 'playwright/request_count/resource_type/image': 78,
 'playwright/request_count/resource_type/other': 6,
 'playwright/request_count/resource_type/ping': 7,
 'playwright/request_count/resource_type/script': 8,
 'playwright/response_count': 101,
 'playwright/response_count/method/GET': 88,
 'playwright/response_count/method/POST': 13,
 'playwright/response_count/resource_type/document': 2,
 'playwright/response_count/resource_type/image': 78,
 'playwright/response_count/resource_type/other': 6,
 'playwright/response_count/resource_type/ping': 7,
 'playwright/response_count/resource_type/script': 8,
 'request_depth_max': 1,
 'response_received_count': 2,
 'scheduler/dequeued': 4,
 'scheduler/dequeued/memory': 4,
 'scheduler/enqueued': 4,
 'scheduler/enqueued/memory': 4,
 'start_time': datetime.datetime(2023, 11, 20, 13, 32, 3, 409939, tzinfo=datetime.timezone.utc)}
2023-11-20 13:34:06 [scrapy.core.engine] INFO: Spider closed (shutdown)
2023-11-20 13:34:06 [scrapy-playwright] INFO: Closing download handler
2023-11-20 13:34:06 [scrapy-playwright] INFO: Closing browser
2023-11-20 13:34:06 [asyncio] ERROR: Future exception was never retrieved
future: <Future finished exception=Exception('Connection closed while reading from the driver')>
Exception: Connection closed while reading from the driver

elacuesta commented 7 months ago

I still cannot reproduce with Python 3.11 (Playwright 1.39.0, Scrapy 2.11, scrapy-playwright 0.0.33). Sorry, I have no idea if the fact that you're using Conda is a factor here, I'm not a Conda user myself.

$ scrapy version -v                           
Scrapy       : 2.11.0
lxml         : 4.9.3.0
libxml2      : 2.10.3
cssselect    : 1.2.0
parsel       : 1.8.1
w3lib        : 2.1.2
Twisted      : 22.10.0
Python       : 3.11.0 (main, Dec 20 2022, 02:21:35) [GCC 9.4.0]
pyOpenSSL    : 23.3.0 (OpenSSL 3.1.4 24 Oct 2023)
cryptography : 41.0.5
Platform     : Linux-5.15.0-79-generic-x86_64-with-glibc2.35

2023-11-20 13:02:27 [scrapy.core.engine] INFO: Spider opened
2023-11-20 13:02:28 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-11-20 13:02:28 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-11-20 13:02:28 [scrapy-playwright] INFO: Starting download handler
2023-11-20 13:02:33 [scrapy-playwright] INFO: Launching browser chromium
2023-11-20 13:02:33 [scrapy-playwright] INFO: Browser chromium launched
2023-11-20 13:02:38 [test] INFO: Response url https://www.cable.co.uk/ request url https://www.cable.co.uk
2023-11-20 13:02:38 [py.warnings] WARNING: /scrapy-playwright/silent_exception.py:26: RuntimeWarning: coroutine 'TestSpider._cleanup_page' was never awaited
  self._cleanup_page(response.meta)

2023-11-20 13:02:42 [test] INFO: Failing the request for 'https://www.cable.co.uk/go/bt/fibre-1/', and exception=Error('Execution context was destroyed, most likely because of a navigation')
2023-11-20 13:02:42 [test] INFO: Response url https://www.cable.co.uk/broadband/check-my-area/ request url https://www.cable.co.uk/broadband/check-my-area/
2023-11-20 13:02:42 [py.warnings] WARNING: /scrapy-playwright/silent_exception.py:26: RuntimeWarning: coroutine 'TestSpider._cleanup_page' was never awaited
  self._cleanup_page(response.meta)

2023-11-20 13:02:43 [test] INFO: Failing the request for 'https://www.cable.co.uk/go/virgin-media/broadband-and-tv/', and exception=Error('Execution context was destroyed, most likely because of a navigation')
2023-11-20 13:02:43 [scrapy.core.engine] INFO: Closing spider (finished)
2023-11-20 13:02:43 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 2,
 'downloader/exception_type_count/playwright._impl._api_types.Error': 2,
 'downloader/request_bytes': 1040,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 4,
 'downloader/response_bytes': 1214028,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 15.334106,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 11, 20, 16, 2, 43, 336733, tzinfo=datetime.timezone.utc),
 'log_count/INFO': 17,
 'log_count/WARNING': 3,
 'memusage/max': 68132864,
 'memusage/startup': 68132864,
 'playwright/context_count': 1,
 'playwright/context_count/max_concurrent': 1,
 'playwright/context_count/persistent/False': 1,
 'playwright/context_count/remote/False': 1,
 'playwright/page_count': 4,
 'playwright/page_count/max_concurrent': 4,
 'playwright/request_count': 140,
 'playwright/request_count/method/GET': 132,
 'playwright/request_count/method/POST': 8,
 'playwright/request_count/navigation': 11,
 'playwright/request_count/resource_type/document': 11,
 'playwright/request_count/resource_type/image': 81,
 'playwright/request_count/resource_type/other': 4,
 'playwright/request_count/resource_type/ping': 4,
 'playwright/request_count/resource_type/script': 35,
 'playwright/request_count/resource_type/stylesheet': 5,
 'playwright/response_count': 108,
 'playwright/response_count/method/GET': 100,
 'playwright/response_count/method/POST': 8,
 'playwright/response_count/resource_type/document': 11,
 'playwright/response_count/resource_type/image': 80,
 'playwright/response_count/resource_type/other': 4,
 'playwright/response_count/resource_type/ping': 4,
 'playwright/response_count/resource_type/script': 9,
 'request_depth_max': 1,
 'response_received_count': 2,
 'scheduler/dequeued': 4,
 'scheduler/dequeued/memory': 4,
 'scheduler/enqueued': 4,
 'scheduler/enqueued/memory': 4,
 'start_time': datetime.datetime(2023, 11, 20, 16, 2, 28, 2627, tzinfo=datetime.timezone.utc)}
2023-11-20 13:02:43 [scrapy.core.engine] INFO: Spider closed (finished)
2023-11-20 13:02:43 [scrapy-playwright] INFO: Closing download handler
2023-11-20 13:02:43 [scrapy-playwright] INFO: Closing browser

gurmitteotia commented 7 months ago

I can reproduce it in non conda environment too. Are you keeping the request scheduling as I have shown in my spider

    def start_requests(self):
        url = "https://www.cable.co.uk"
        yield self._build_request(url, True)

    async def parse(self, response):
        self.logger.info("Response url %s request url %s", response.url, response.request.url)
        if response.meta['is_home_page']:
            new_urls = ['https://www.cable.co.uk/broadband/check-my-area/',
                        'https://www.cable.co.uk/go/bt/fibre-1/',
                        'https://www.cable.co.uk/go/virgin-media/broadband-and-tv/']
            for url in new_urls:
                yield self._build_request(url, False)
        self._cleanup_page(response.meta)

On important thing to note is that scrapy end up in hang state only when the pages, which are doing redirection are last in the queue and there are no more requests to be scheduled. Only following requests are doing redirections-

 https://www.cable.co.uk/go/bt/fibre-1/
 https://www.cable.co.uk/go/virgin-media/broadband-and-tv/

If I send above URLs in the start requests then I don't get this issue any my scrapy logs matches with your scrapy log.

elacuesta commented 7 months ago

Yes, I'm running the spider that you shared unmodified. Notice the 'downloader/request_count': 4, in the stats and also the warning about self._cleanup_page(response.meta) not being awaited.

scrapy-plugins / scrapy-playwright

Scrapy hangs when an exception is raised in applying page method #244