scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
1.03k stars 113 forks source link

PLAYWRIGHT_RESTART_DISCONNECTED_BROWSER not working on local browser #304

Closed elacuesta closed 4 months ago

elacuesta commented 4 months ago

The handler is not allowing enough time for the new browser to launch after a crash.

Sample spider adapted from #167.

# crash.py
import os
from signal import SIGKILL

import psutil
import scrapy

class CrashSpider(scrapy.Spider):
    name = "crash"
    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
    }

    def start_requests(self):
        yield scrapy.Request("https://httpbin.org/get", meta={"playwright": True})

    def parse(self, response):
        print("request:{}".format(response.request.url))
        for proc in psutil.process_iter(["pid", "name"]):
            if proc.info["name"] == "chrome":
                os.kill(proc.info["pid"], SIGKILL)
        yield scrapy.Request("https://httpbin.org/headers", meta={"playwright": True})
$ scrapy runspider crash.py

(...)
2024-07-16 14:55:09 [scrapy.core.engine] INFO: Spider opened
2024-07-16 14:55:09 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-16 14:55:09 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-07-16 14:55:09 [scrapy-playwright] INFO: Starting download handler
2024-07-16 14:55:14 [scrapy-playwright] INFO: Launching browser chromium
2024-07-16 14:55:14 [scrapy-playwright] INFO: Browser chromium launched
2024-07-16 14:55:14 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2024-07-16 14:55:14 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2024-07-16 14:55:14 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-07-16 14:55:14 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://httpbin.org/get>
2024-07-16 14:55:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://httpbin.org/get> (referer: None) ['playwright']
Response: <200 https://httpbin.org/get>
2024-07-16 14:55:14 [scrapy-playwright] DEBUG: Browser context closed: 'default' (persistent=False, remote=False)
2024-07-16 14:55:14 [scrapy-playwright] DEBUG: Browser disconnected
2024-07-16 14:55:15 [scrapy.core.scraper] ERROR: Error downloading <GET https://httpbin.org/headers>
Traceback (most recent call last):
  File "/.../venv-scrapy-playwright/lib/python3.10/site-packages/twisted/internet/defer.py", line 1996, in _inlineCallbacks
    result = context.run(
  File "/.../venv-scrapy-playwright/lib/python3.10/site-packages/twisted/python/failure.py", line 519, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/.../venv-scrapy-playwright/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 54, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/.../venv-scrapy-playwright/lib/python3.10/site-packages/twisted/internet/defer.py", line 1248, in adapt
    extracted: _SelfResultT | Failure = result.result()
  File "/.../scrapy_playwright/handler.py", line 358, in _download_request
    page = await self._create_page(request=request, spider=spider)
  File "/.../scrapy_playwright/handler.py", line 286, in _create_page
    page = await ctx_wrapper.context.new_page()
  File "/.../venv-scrapy-playwright/lib/python3.10/site-packages/playwright/async_api/_generated.py", line 12379, in new_page
    return mapping.from_impl(await self._impl_obj.new_page())
  File "/.../venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_browser_context.py", line 294, in new_page
    return from_channel(await self._channel.send("newPage"))
  File "/.../venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 59, in send
    return await self._connection.wrap_api_call(
  File "/.../venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 514, in wrap_api_call
    raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
playwright._impl._errors.TargetClosedError: BrowserContext.new_page: Target page, context or browser has been closed
Browser logs:

<launching> /home/eugenio/.cache/ms-playwright/chromium-1117/chrome-linux/chrome --disable-field-trial-config --disable-background-networking --enable-features=NetworkService,NetworkServiceInProcess --disable-background-timer-throttling --disable-backgrounding-occluded-windows --disable-back-forward-cache --disable-breakpad --disable-client-side-phishing-detection --disable-component-extensions-with-background-pages --disable-component-update --no-default-browser-check --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-features=ImprovedCookieControls,LazyFrameLoading,GlobalMediaControls,DestroyProfileOnBrowserClose,MediaRouter,DialMediaRouteProvider,AcceptCHFrame,AutoExpandDetailsElement,CertificateTransparencyComponentUpdater,AvoidUnnecessaryBeforeUnloadCheckSync,Translate,HttpsUpgrades,PaintHolding --allow-pre-commit-input --disable-hang-monitor --disable-ipc-flooding-protection --disable-popup-blocking --disable-prompt-on-repost --disable-renderer-backgrounding --force-color-profile=srgb --metrics-recording-only --no-first-run --enable-automation --password-store=basic --use-mock-keychain --no-service-autorun --export-tagged-pdf --disable-search-engine-choice-screen --headless --hide-scrollbars --mute-audio --blink-settings=primaryHoverType=2,availableHoverTypes=2,primaryPointerType=4,availablePointerTypes=4 --no-sandbox --user-data-dir=/tmp/playwright_chromiumdev_profile-XXXXXXTy2tU6 --remote-debugging-pipe --no-startup-window
<launched> pid=59155
[pid=59155][err] [0716/145514.301003:INFO:config_dir_policy_loader.cc(118)] Skipping mandatory platform policies because no policy file was found at: /etc/chromium/policies/managed
[pid=59155][err] [0716/145514.301041:INFO:config_dir_policy_loader.cc(118)] Skipping recommended platform policies because no policy file was found at: /etc/chromium/policies/recommended
[pid=59155][err] [0716/145514.308584:WARNING:bluez_dbus_manager.cc(248)] Floss manager not present, cannot set Floss enable/disable.
[pid=59155][err] [0716/145514.343012:WARNING:sandbox_linux.cc(436)] InitializeSandbox() called with multiple threads in process gpu-process.
2024-07-16 14:55:15 [scrapy.core.engine] INFO: Closing spider (finished)
(...)
$ scrapy version -v
Scrapy       : 2.11.1
lxml         : 5.1.0.0
libxml2      : 2.12.3
cssselect    : 1.2.0
parsel       : 1.8.1
w3lib        : 2.1.2
Twisted      : 23.10.0
Python       : 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0]
pyOpenSSL    : 24.0.0 (OpenSSL 3.2.1 30 Jan 2024)
cryptography : 42.0.5
Platform     : Linux-6.5.0-41-generic-x86_64-with-glibc2.35

$ python -c "import scrapy_playwright; print(scrapy_playwright.__version__)"
0.0.39

I don't think this can be handled with locking or other synchronization primitives, as the browser crash could happen at any time. Retrying seems like the most sensible way.

gelodefaultbrain commented 4 months ago

Hi @elacuesta in relation to my issue here: https://github.com/scrapy-plugins/scrapy-playwright/issues/294

I think the update you made to have the browser restarted worked for me! However I have this retry middleware enabled. I just find it weird that when the browser crashes, it does show up again but it seems that the retry middleware I've made doesn't necessarily retry that specific request anymore. I'm not sure why , it could be my middleware but just letting you know just in case. Let me know if you need additional info. Thank you.