Open NiuBlibing opened 1 year ago
Seems that it need to deal with the browser closed event
.
on("disconnected") Emitted when Browser gets disconnected from the browser application. This might happen because of one of the following:
- Browser application is closed or crashed.
- The browser.close() method was called.
Is this patch ok?
---
scrapy_playwright/handler.py | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/scrapy_playwright/handler.py b/scrapy_playwright/handler.py
index 36c96cd..428f71d 100644
--- a/scrapy_playwright/handler.py
+++ b/scrapy_playwright/handler.py
@@ -132,6 +132,7 @@ class ScrapyPlaywrightDownloadHandler(HTTPDownloadHandler):
if not hasattr(self, "browser"):
logger.info("Launching browser %s", self.browser_type.name)
self.browser: Browser = await self.browser_type.launch(**self.launch_options)
+ self.browser.on("disconnected", self.__make_close_browser_callback())
logger.info("Browser %s launched", self.browser_type.name)
async def _create_browser_context(
@@ -447,6 +448,12 @@ class ScrapyPlaywrightDownloadHandler(HTTPDownloadHandler):
return close_browser_context_callback
+ def __make_close_browser_callback(self) -> Callable:
+ def close_browser_call() -> None:
+ logger.debug("Browser closed")
+ del self.browser
+ return close_browser_call
+
def _make_request_handler(
self,
context_name: str,
--
2.39.1
Is this patch ok?
Looks like a good start, however it might be necessary to also close the contexts like in https://github.com/scrapy-plugins/scrapy-playwright/blob/v0.0.26/scrapy_playwright/handler.py#L260-L261. This needs some research, I guess contexts might be implicitly closed because of the browser crash; in any case I'd like to make sure all related contexts are closed and removed from the context_wrappers
dict (another detail to consider is that persistent contexts are not tied to the browser instance, and there doesn't seem to be a way to listen to the disconnected
event on the context level).
This all assumes we would like the crawl to continue if the browser crashes: deleting the browser
attribute would cause any subsequent request to try to launch a new one. I'm actually more inclined to closing the engine and stopping everything if the browser crashes, but I'm willing to be proven wrong.
There is another problem, when the driver crash, it may not trigger crash event and the page.goto
is blocking which will not be timeout forever.
/usr/local/lib/python3.10/site-packages/playwright/driver/package/lib/server/chromium/crPage.js:378
this._firstNonInitialNavigationCommittedReject(new Error('Page closed'));
^
Error: Page closed
at CRSession.<anonymous> (/usr/local/lib/python3.10/site-packages/playwright/driver/package/lib/server/chromium/crPage.js:378:54)
at Object.onceWrapper (node:events:627:28)
at CRSession.emit (node:events:525:35)
at /usr/local/lib/python3.10/site-packages/playwright/driver/package/lib/server/chromium/crConnection.js:211:39
When the chrome is killed or crash, the context will continue create newpage and throw exception:
I use this code: