Open mhdzumair opened 1 week ago
I assume you are using PLAYWRIGHT_CDP_URL
, correct? Please share a a Minimal, Reproducible Example.
Also, this seems related to #167 and #189.
Here is my complete spider script. https://github.com/mhdzumair/MediaFusion/blob/ba60d58aad96a278a61249b6d708fd82b7bf5d81/mediafusion_scrapy/spiders/tgx.py#L252-L266
This sounds like an actual issue but it's only about connecting to the browser, a few simple requests and some waiting to exhaust the timeout should suffice: it should be possible to reproduce it with around 30 lines of code. Please don't point to a full spider with pipelines, a Redis connection, item processing, etc.
@mhdzumair
The browserless timeout can be turned off by setting timeout=0
in connection string
PLAYWRIGHT_CDP_URL = 'ws://localhost:3000/playwright/firefox?token=12345&timeout=0'
Yes, I'm aware of that timeout setting as I mentioned in the issue description, which I'm setting as -e "TIMEOUT='-1'"
So here is the minimal code to reproduce the error.
Run the browserless in container, docker run -e "TIMEOUT=60000" -p 3000:3000 ghcr.io/browserless/chromium
I'm just setting timeout as 60s here. So after 60 second it will raise the error as above in the description. Additionally, You can set -e "TIMEOUT='-1'"
| -e "TIMEOUT=0"
to disable it.
class TGXSpider(scrapy.Spider):
name = "tgx_spider"
start_urls = [
"https://torrentgalaxy.to/profile/F1Carreras/torrents/0",
]
custom_settings = {
"PLAYWRIGHT_BROWSER_TYPE": "chromium",
"PLAYWRIGHT_CDP_URL": "ws://localhost:3000?blockAds=true&stealth=true",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"PLAYWRIGHT_MAX_CONTEXTS": 1,
"PLAYWRIGHT_MAX_PAGES_PER_CONTEXT": 1,
"LOG_LEVEL": "DEBUG",
}
def parse(self, response, **kwargs):
# Extract torrent links from the profile page
torrent_links = response.css("div.tgxtablecell a::attr(href)").getall()
for link in torrent_links:
if "/torrent/" not in link:
continue
tgx_unique_id = link.split("/")[-2]
torrent_page_link = response.urljoin(link)
yield response.follow(
torrent_page_link,
self.parse_torrent_details,
meta={
"playwright": True,
"playwright_page_goto_kwargs": {
"wait_until": "domcontentloaded",
"timeout": 60000,
},
"playwright_page_methods": [
PageMethod(
"wait_for_selector", "#smallguestnav", timeout=60000
),
],
},
)
def parse_torrent_details(self, response):
title = response.css("title::text").get()
file_details = response.css(
"table.table-striped tr td.table_col1::text"
).getall()
data = {
"title": title,
"file_details": file_details,
}
print(data)
yield data
Yes, i can reproduce it in a different way.
Browserless
automatically close the page on the server-side after its timeout. If page is accessed afterwards in the client-side then Target Page closed error raised by the scrapy-playwright
I can reproduce with the following:
# separate terminal
$ docker run -e "TIMEOUT=5000" -p 3000:3000 ghcr.io/browserless/chromium
# timeout.py
import asyncio
import scrapy
class TimeoutTest(scrapy.Spider):
name = "timeout"
custom_settings = {
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"PLAYWRIGHT_CDP_URL": "ws://0.0.0.0:3000",
}
def start_requests(self):
yield scrapy.Request("https://example.org", meta={"playwright": True})
async def parse(self, response):
await asyncio.sleep(6)
print(response.url)
yield scrapy.Request("https://example.com", meta={"playwright": True})
$ scrapy runspider timeout.py
(...)
2024-07-03 17:38:08 [scrapy.core.engine] INFO: Spider opened
2024-07-03 17:38:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-03 17:38:08 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-07-03 17:38:08 [scrapy-playwright] INFO: Starting download handler
2024-07-03 17:38:13 [scrapy-playwright] INFO: Connecting using CDP: ws://0.0.0.0:3000
2024-07-03 17:38:14 [scrapy-playwright] INFO: Connected using CDP: ws://0.0.0.0:3000
2024-07-03 17:38:14 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=True)
2024-07-03 17:38:14 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2024-07-03 17:38:14 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://example.org/> (resource type: document)
2024-07-03 17:38:14 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://example.org/>
2024-07-03 17:38:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.org> (referer: None) ['playwright']
2024-07-03 17:38:18 [scrapy-playwright] DEBUG: Browser context closed: 'default' (persistent=False, remote=True)
https://example.org/
2024-07-03 17:38:21 [scrapy.core.scraper] ERROR: Error downloading <GET https://example.com>
Traceback (most recent call last):
File "/.../lib/python3.10/site-packages/twisted/internet/defer.py", line 1996, in _inlineCallbacks
result = context.run(
File "/.../lib/python3.10/site-packages/twisted/python/failure.py", line 519, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/.../lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 54, in process_request
return (yield download_func(request=request, spider=spider))
File "/.../lib/python3.10/site-packages/twisted/internet/defer.py", line 1248, in adapt
extracted: _SelfResultT | Failure = result.result()
File "/home/eugenio/zyte/scrapy-playwright/scrapy_playwright/handler.py", line 319, in _download_request
page = await self._create_page(request=request, spider=spider)
File "/home/eugenio/zyte/scrapy-playwright/scrapy_playwright/handler.py", line 240, in _create_page
ctx_wrapper = await self._create_browser_context(
File "/home/eugenio/zyte/scrapy-playwright/scrapy_playwright/handler.py", line 195, in _create_browser_context
context = await self.browser.new_context(**context_kwargs)
File "/.../lib/python3.10/site-packages/playwright/async_api/_generated.py", line 13460, in new_context
await self._impl_obj.new_context(
File "/.../lib/python3.10/site-packages/playwright/_impl/_browser.py", line 127, in new_context
channel = await self._channel.send("newContext", params)
File "/.../lib/python3.10/site-packages/playwright/_impl/_connection.py", line 59, in send
return await self._connection.wrap_api_call(
File "/.../lib/python3.10/site-packages/playwright/_impl/_connection.py", line 514, in wrap_api_call
raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
playwright._impl._errors.TargetClosedError: Browser.new_context: Target page, context or browser has been closed
2024-07-03 17:38:21 [scrapy.core.engine] INFO: Closing spider (finished)
(...)
Description
browserless have default browser session timeout as 30s by default https://docs.browserless.io/Docker/docker#connection-timeout, this can be turned off by setting
-1
in the browserless container env variable. However, when it enabled, the session got terminated after timeout period then scrapy-playwright unable to make any new connection. It raises following errorbrowserless logs
Worst case scenario, When i setup following config, the scrapy process got hang without raising any errors.
Expected behavior
This should be handled to create a new session, if it's already closed.