scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
911 stars 101 forks source link

Add support for handling browserless session timeout #279

Open mhdzumair opened 1 week ago

mhdzumair commented 1 week ago

Description

browserless have default browser session timeout as 30s by default https://docs.browserless.io/Docker/docker#connection-timeout, this can be turned off by setting -1 in the browserless container env variable. However, when it enabled, the session got terminated after timeout period then scrapy-playwright unable to make any new connection. It raises following error

playwright._impl._errors.TargetClosedError: Browser.new_context: Target page, context or browser has been closed

browserless logs

  browserless.io:limiter:warn  Job has hit timeout after 599,999ms of activity. +0ms
  browserless.io:limiter:info  Calling timeout handler +10m
  browserless.io:router:error  Websocket job has timedout, sending 429 response +0ms
  browserless.io:limiter:info  (Running: 0, Pending: 0) All jobs complete.  +0ms
  browserless.io:router:trace  WebSocket Request handler has finished. +10m
  browserless.io:browser-manager:info  0 Client(s) are currently connected, Keep-until: 0 +32m
  browserless.io:browser-manager:info  Closing browser session +0ms
  browserless.io:browser-manager:info  Deleting "/tmp/browserless-data-dirs/browserless-data-dir-067abc88-1797-4bc2-9da0-090e23cc0e6a" user-data-dir and session from memory +0ms
  browserless.io:ChromiumCDPWebSocketRoute:info 172.17.0.1 Closing ChromiumCDP process and all listeners +10m
  browserless.io:server:trace  Websocket connection complete +10m
  browserless.io:browser-manager:info  Deleting data directory "/tmp/browserless-data-dirs/browserless-data-dir-067abc88-1797-4bc2-9da0-090e23cc0e6a" +11ms

Worst case scenario, When i setup following config, the scrapy process got hang without raising any errors.

{
         "PLAYWRIGHT_MAX_CONTEXTS": 1,
         "PLAYWRIGHT_MAX_PAGES_PER_CONTEXT": 1,
}

Expected behavior

This should be handled to create a new session, if it's already closed.

elacuesta commented 1 week ago

I assume you are using PLAYWRIGHT_CDP_URL, correct? Please share a a Minimal, Reproducible Example.

elacuesta commented 1 week ago

Also, this seems related to #167 and #189.

mhdzumair commented 1 week ago

Here is my complete spider script. https://github.com/mhdzumair/MediaFusion/blob/ba60d58aad96a278a61249b6d708fd82b7bf5d81/mediafusion_scrapy/spiders/tgx.py#L252-L266

elacuesta commented 6 days ago

This sounds like an actual issue but it's only about connecting to the browser, a few simple requests and some waiting to exhaust the timeout should suffice: it should be possible to reproduce it with around 30 lines of code. Please don't point to a full spider with pipelines, a Redis connection, item processing, etc.

Ehsan-U commented 6 days ago

@mhdzumair The browserless timeout can be turned off by setting timeout=0 in connection string PLAYWRIGHT_CDP_URL = 'ws://localhost:3000/playwright/firefox?token=12345&timeout=0'

mhdzumair commented 6 days ago

Yes, I'm aware of that timeout setting as I mentioned in the issue description, which I'm setting as -e "TIMEOUT='-1'"

So here is the minimal code to reproduce the error.

Run the browserless in container, docker run -e "TIMEOUT=60000" -p 3000:3000 ghcr.io/browserless/chromium I'm just setting timeout as 60s here. So after 60 second it will raise the error as above in the description. Additionally, You can set -e "TIMEOUT='-1'" | -e "TIMEOUT=0" to disable it.

class TGXSpider(scrapy.Spider):
    name = "tgx_spider"
    start_urls = [
        "https://torrentgalaxy.to/profile/F1Carreras/torrents/0",
    ]

    custom_settings = {
        "PLAYWRIGHT_BROWSER_TYPE": "chromium",
        "PLAYWRIGHT_CDP_URL": "ws://localhost:3000?blockAds=true&stealth=true",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "PLAYWRIGHT_MAX_CONTEXTS": 1,
        "PLAYWRIGHT_MAX_PAGES_PER_CONTEXT": 1,
        "LOG_LEVEL": "DEBUG",
    }

    def parse(self, response, **kwargs):
        # Extract torrent links from the profile page

        torrent_links = response.css("div.tgxtablecell a::attr(href)").getall()
        for link in torrent_links:
            if "/torrent/" not in link:
                continue

            tgx_unique_id = link.split("/")[-2]
            torrent_page_link = response.urljoin(link)

            yield response.follow(
                torrent_page_link,
                self.parse_torrent_details,
                meta={
                    "playwright": True,
                    "playwright_page_goto_kwargs": {
                        "wait_until": "domcontentloaded",
                        "timeout": 60000,
                    },
                    "playwright_page_methods": [
                        PageMethod(
                            "wait_for_selector", "#smallguestnav", timeout=60000
                        ),
                    ],
                },
            )

    def parse_torrent_details(self, response):
        title = response.css("title::text").get()
        file_details = response.css(
            "table.table-striped tr td.table_col1::text"
        ).getall()

        data = {
            "title": title,
            "file_details": file_details,
        }
        print(data)
        yield data
Ehsan-U commented 6 days ago

Yes, i can reproduce it in a different way.

Browserless automatically close the page on the server-side after its timeout. If page is accessed afterwards in the client-side then Target Page closed error raised by the scrapy-playwright

elacuesta commented 1 day ago

I can reproduce with the following:

# separate terminal
$ docker run -e "TIMEOUT=5000" -p 3000:3000 ghcr.io/browserless/chromium
# timeout.py
import asyncio
import scrapy

class TimeoutTest(scrapy.Spider):
    name = "timeout"
    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "PLAYWRIGHT_CDP_URL": "ws://0.0.0.0:3000",
    }

    def start_requests(self):
        yield scrapy.Request("https://example.org", meta={"playwright": True})

    async def parse(self, response):
        await asyncio.sleep(6)
        print(response.url)
        yield scrapy.Request("https://example.com", meta={"playwright": True})
$ scrapy runspider timeout.py
(...)
2024-07-03 17:38:08 [scrapy.core.engine] INFO: Spider opened
2024-07-03 17:38:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-03 17:38:08 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-07-03 17:38:08 [scrapy-playwright] INFO: Starting download handler
2024-07-03 17:38:13 [scrapy-playwright] INFO: Connecting using CDP: ws://0.0.0.0:3000
2024-07-03 17:38:14 [scrapy-playwright] INFO: Connected using CDP: ws://0.0.0.0:3000
2024-07-03 17:38:14 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=True)
2024-07-03 17:38:14 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2024-07-03 17:38:14 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://example.org/> (resource type: document)
2024-07-03 17:38:14 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://example.org/>
2024-07-03 17:38:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.org> (referer: None) ['playwright']
2024-07-03 17:38:18 [scrapy-playwright] DEBUG: Browser context closed: 'default' (persistent=False, remote=True)
https://example.org/
2024-07-03 17:38:21 [scrapy.core.scraper] ERROR: Error downloading <GET https://example.com>
Traceback (most recent call last):
  File "/.../lib/python3.10/site-packages/twisted/internet/defer.py", line 1996, in _inlineCallbacks
    result = context.run(
  File "/.../lib/python3.10/site-packages/twisted/python/failure.py", line 519, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/.../lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 54, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/.../lib/python3.10/site-packages/twisted/internet/defer.py", line 1248, in adapt
    extracted: _SelfResultT | Failure = result.result()
  File "/home/eugenio/zyte/scrapy-playwright/scrapy_playwright/handler.py", line 319, in _download_request
    page = await self._create_page(request=request, spider=spider)
  File "/home/eugenio/zyte/scrapy-playwright/scrapy_playwright/handler.py", line 240, in _create_page
    ctx_wrapper = await self._create_browser_context(
  File "/home/eugenio/zyte/scrapy-playwright/scrapy_playwright/handler.py", line 195, in _create_browser_context
    context = await self.browser.new_context(**context_kwargs)
  File "/.../lib/python3.10/site-packages/playwright/async_api/_generated.py", line 13460, in new_context
    await self._impl_obj.new_context(
  File "/.../lib/python3.10/site-packages/playwright/_impl/_browser.py", line 127, in new_context
    channel = await self._channel.send("newContext", params)
  File "/.../lib/python3.10/site-packages/playwright/_impl/_connection.py", line 59, in send
    return await self._connection.wrap_api_call(
  File "/.../lib/python3.10/site-packages/playwright/_impl/_connection.py", line 514, in wrap_api_call
    raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
playwright._impl._errors.TargetClosedError: Browser.new_context: Target page, context or browser has been closed
2024-07-03 17:38:21 [scrapy.core.engine] INFO: Closing spider (finished)
(...)