scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
907 stars 101 forks source link

Unhandled browser crash event #167

Open NiuBlibing opened 1 year ago

NiuBlibing commented 1 year ago

When the chrome is killed or crash, the context will continue create newpage and throw exception:

2023-01-31 19:29:51 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.baidu.com>
Traceback (most recent call last):
  File "/home/test/source/test/venv/lib/python3.10/site-packages/twisted/internet/defer.py", line 1656, in _inlineCallbacks
    result = current_context.run(
  File "/home/test/source/test/venv/lib/python3.10/site-packages/twisted/python/failure.py", line 489, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/home/test/source/test/venv/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/home/test/source/test/venv/lib/python3.10/site-packages/twisted/internet/defer.py", line 1030, in adapt
    extracted = result.result()
  File "/home/test/source/test/venv/lib/python3.10/site-packages/scrapy_playwright/handler.py", line 261, in _download_request
    page = await self._create_page(request)
  File "/home/test/source/test/venv/lib/python3.10/site-packages/scrapy_playwright/handler.py", line 187, in _create_page
    context = await self._create_browser_context(
  File "/home/test/source/test/venv/lib/python3.10/site-packages/scrapy_playwright/handler.py", line 163, in _create_browser_context
    context = await self.browser.new_context(**context_kwargs)
  File "/home/test/source/test/venv/lib/python3.10/site-packages/playwright/async_api/_generated.py", line 13847, in new_context
    await self._impl_obj.new_context(
  File "/home/test/source/test/venv/lib/python3.10/site-packages/playwright/_impl/_browser.py", line 127, in new_context
    channel = await self._channel.send("newContext", params)
  File "/home/test/source/test/venv/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 44, in send
    return await self._connection.wrap_api_call(
  File "/home/test/source/test/venv/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 419, in wrap_api_call
    return await cb()
  File "/home/test/source/test/venv/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 79, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed

I use this code:

from time import sleep
import scrapy
import psutil
import os
from signal import SIGKILL

class Debug1Spider(scrapy.Spider):
    name = 'debug1'
    allowed_domains = []
    custom_settings = {
        "PLAYWRIGHT_CONTEXTS": {
            "default": {
                "ignore_https_errors": True,
            }
        }
    }

    def start_requests(self):
        yield scrapy.Request('https://www.httpbin.org/get', meta={"playwright": True, "playwright_include_page": False}, callback=self.debug_redirect)
        yield scrapy.Request('https://www.httpbin.org/', meta={"playwright": True, "playwright_include_page": False}, callback=self.debug_redirect)
        for proc in psutil.process_iter(['pid', 'name']):
            if proc.info["name"] == "chrome":
                os.kill(proc.info["pid"], SIGKILL)

    async def parse(self, response):
        print("request:{}".format(response.request.url))
NiuBlibing commented 1 year ago

Seems that it need to deal with the browser closed event.

on("disconnected") Emitted when Browser gets disconnected from the browser application. This might happen because of one of the following:

  • Browser application is closed or crashed.
  • The browser.close() method was called.
NiuBlibing commented 1 year ago

Is this patch ok?

---
 scrapy_playwright/handler.py | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/scrapy_playwright/handler.py b/scrapy_playwright/handler.py
index 36c96cd..428f71d 100644
--- a/scrapy_playwright/handler.py
+++ b/scrapy_playwright/handler.py
@@ -132,6 +132,7 @@ class ScrapyPlaywrightDownloadHandler(HTTPDownloadHandler):
             if not hasattr(self, "browser"):
                 logger.info("Launching browser %s", self.browser_type.name)
                 self.browser: Browser = await self.browser_type.launch(**self.launch_options)
+                self.browser.on("disconnected", self.__make_close_browser_callback())
                 logger.info("Browser %s launched", self.browser_type.name)

     async def _create_browser_context(
@@ -447,6 +448,12 @@ class ScrapyPlaywrightDownloadHandler(HTTPDownloadHandler):

         return close_browser_context_callback

+    def __make_close_browser_callback(self) -> Callable:
+        def close_browser_call() -> None:
+            logger.debug("Browser closed")
+            del self.browser
+        return close_browser_call
+
     def _make_request_handler(
         self,
         context_name: str,
-- 
2.39.1
elacuesta commented 1 year ago

Is this patch ok?

Looks like a good start, however it might be necessary to also close the contexts like in https://github.com/scrapy-plugins/scrapy-playwright/blob/v0.0.26/scrapy_playwright/handler.py#L260-L261. This needs some research, I guess contexts might be implicitly closed because of the browser crash; in any case I'd like to make sure all related contexts are closed and removed from the context_wrappers dict (another detail to consider is that persistent contexts are not tied to the browser instance, and there doesn't seem to be a way to listen to the disconnected event on the context level).

This all assumes we would like the crawl to continue if the browser crashes: deleting the browser attribute would cause any subsequent request to try to launch a new one. I'm actually more inclined to closing the engine and stopping everything if the browser crashes, but I'm willing to be proven wrong.

NiuBlibing commented 1 year ago

There is another problem, when the driver crash, it may not trigger crash event and the page.goto is blocking which will not be timeout forever.

/usr/local/lib/python3.10/site-packages/playwright/driver/package/lib/server/chromium/crPage.js:378
      this._firstNonInitialNavigationCommittedReject(new Error('Page closed'));
                                                     ^

Error: Page closed
    at CRSession.<anonymous> (/usr/local/lib/python3.10/site-packages/playwright/driver/package/lib/server/chromium/crPage.js:378:54)
    at Object.onceWrapper (node:events:627:28)
    at CRSession.emit (node:events:525:35)
    at /usr/local/lib/python3.10/site-packages/playwright/driver/package/lib/server/chromium/crConnection.js:211:39