scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
1.03k stars 113 forks source link

Scrapy hangs with no exception raised #266

Closed Zhou-Haowei closed 4 months ago

Zhou-Haowei commented 8 months ago

Configurations

Configurations 1: Model Name: MacBook Pro Chip: Apple M1 Max Cores: 10 (8 performance and 2 efficiency) Memory: 32GB System Version: macOS Sonoma 14.2.1 Playwright Version: 1.42.0 Python Version: Python 3.10.14 browser: chromium version 123.0.6312.4 browser: firefox version 123.0

Configurations 2: Model Name: Chip: Intel Xeon Cores: 4 Logical Cores Memory: 8GB System Version: Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-76-generic x86_64) Docker Version: Docker version 25.0.4 Playwright Version: 1.42.0 Python Version: Python 3.10.13 browser: chromium version 123.0.6312.4 browser: firefox version 123.0

Description

I'm experiencing a problem where my Scrapy spider hangs indefinitely on a random url during the crawling process without throwing any exceptions. The spider uses Scrapy Playwright for page rendering. I have tried running my project on both of the above configurations and the same problem occurs.

Actual Behavior

Chromium and firefox will hangs on scraping tasks with different urls.

2024-03-23 10:13:07 [scrapy.utils.log] INFO: Scrapy 2.11.1 started (bot: webspider)
2024-03-23 10:13:07 [scrapy.utils.log] INFO: Versions: lxml 5.1.0.0, libxml2 2.12.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 24.3.0, Python 3.10.14 (main, Mar 19 2024, 21:46:16) [Clang 15.0.0 (clang-1500.3.9.4)], pyOpenSSL 24.0.0 (OpenSSL 3.2.1 30 Jan 2024), cryptography 42.0.5, Platform macOS-14.2.1-arm64-arm-64bit
2024-03-23 10:13:07 [scrapy.addons] INFO: Enabled addons:
[]
2024-03-23 10:13:07 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-03-23 10:13:07 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2024-03-23 10:13:07 [scrapy.extensions.telnet] INFO: Telnet Password: 231864415aa317a2
2024-03-23 10:13:07 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2024-03-23 10:13:07 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'webspider',
 'CONCURRENT_REQUESTS': 32,
 'DNS_TIMEOUT': 6,
 'DOWNLOAD_DELAY': 0.25,
 'DOWNLOAD_TIMEOUT': 30,
 'FEED_EXPORT_ENCODING': 'utf-8',
 'NEWSPIDER_MODULE': 'webspider.spiders',
 'REACTOR_THREADPOOL_MAXSIZE': 16,
 'REDIRECT_MAX_TIMES': 5,
 'REFERER_ENABLED': False,
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'RETRY_ENABLED': False,
 'SPIDER_MODULES': ['webspider.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-03-23 10:13:07 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'webspider.middlewares.RandomUserAgentMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-03-23 10:13:07 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-03-23 10:13:07 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-03-23 10:13:07 [scrapy.core.engine] INFO: Spider opened
2024-03-23 10:13:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-03-23 10:13:07 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-03-23 10:13:07 [scrapy-playwright] INFO: Starting download handler
2024-03-23 10:13:07 [scrapy-playwright] INFO: Starting download handler
2024-03-23 10:13:12 [scrapy-playwright] INFO: Launching browser chromium
2024-03-23 10:13:13 [scrapy-playwright] INFO: Browser chromium launched
2024-03-23 10:13:13 [scrapy-playwright] DEBUG: Browser context started: 'm-3baa5cd4-1614-4a6a-af55-5dbbd9682f29' (persistent=False, remote=False)
2024-03-23 10:13:13 [scrapy-playwright] DEBUG: [Context=m-3baa5cd4-1614-4a6a-af55-5dbbd9682f29] New page created, page count is 1 (1 for all contexts)
......
2024-03-23 10:14:40 [scrapy-playwright] DEBUG: [Context=m-3baa5cd4-1614-4a6a-af55-5dbbd9682f29] Response: <200 http://bdimg.share.baidu.com/static/api/js/view/view_base.js>
2024-03-23 10:14:40 [scrapy-playwright] DEBUG: [Context=m-3baa5cd4-1614-4a6a-af55-5dbbd9682f29] Response: <200 http://bdimg.share.baidu.com/static/api/js/share/api_base.js>
2024-03-23 10:14:40 [scrapy-playwright] DEBUG: [Context=m-3baa5cd4-1614-4a6a-af55-5dbbd9682f29] Request: <GET http://bdimg.share.baidu.com/static/api/css/share_style0_16.css?v=8105b07e.css> (resource type: stylesheet)
2024-03-23 10:14:40 [scrapy-playwright] DEBUG: [Context=m-3baa5cd4-1614-4a6a-af55-5dbbd9682f29] Response: <200 http://bdimg.share.baidu.com/static/api/css/share_style0_16.css?v=8105b07e.css>
2024-03-23 10:14:40 [scrapy-playwright] DEBUG: [Context=m-3baa5cd4-1614-4a6a-af55-5dbbd9682f29] Response: <200 https://nsclick.baidu.com/v.gif?pid=307&type=3071&sign=&desturl=https%253A%252F%252Fikikiuu.online%252F&linkid=lu3gj3suojw&apitype=0>
2024-03-23 10:14:41 [scrapy-playwright] DEBUG: [Context=m-3baa5cd4-1614-4a6a-af55-5dbbd9682f29] Response: <200 https://api.share.baidu.com/v.gif>
2024-03-23 10:15:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-03-23 10:16:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-03-23 10:17:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-03-23 10:18:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-03-23 10:19:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-03-23 10:20:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

Code

#  SETTINGS
@classmethod
def update_settings(cls, settings: BaseSettings) -> None:
    super().update_settings(settings)
    # spider
    settings.set("REFERER_ENABLED", False)
    settings.set("TWISTED_REACTOR", "twisted.internet.asyncioreactor.AsyncioSelectorReactor")
    settings.set("DOWNLOAD_HANDLERS", {
        "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    })
    settings.set("SPIDER_MIDDLEWARES", {
        'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None,
    })
    settings.set("DOWNLOADER_MIDDLEWARES", {
        "scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": None,
        "webspider.middlewares.RandomUserAgentMiddleware": 300,
    })

    settings.set("ROBOTSTXT_OBEY", False)
    settings.set("RETRY_ENABLED", False)
    settings.set("REDIRECT_MAX_TIMES", 5)
    settings.set("DOWNLOAD_TIMEOUT", 30)
    settings.set("DOWNLOAD_DELAY", 0.25)

    # playwright
    settings.set("PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT", "30000")
    settings.set("PLAYWRIGHT_BROWSER_TYPE", "chromium")
    settings.set("PLAYWRIGHT_MAX_CONTEXTS", 1)
    settings.set("PLAYWRIGHT_MAX_PAGES_PER_CONTEXT", 1)
    settings.set("PLAYWRIGHT_DEFAULT_HEADERS", {'Cache-Control': 'no-cache'})

    if settings.get("PLAYWRIGHT_BROWSER_TYPE") == "firefox":
        prefs = {
            "security.mixed_content.block_active_content": False,
            "webgl.disabled": True,
            "services.sync.engine.addons": False,
            "services.sync.engine.bookmarks": False,
            "services.sync.engine.history": False,
            "services.sync.engine.passwords": False,
            "services.sync.engine.prefs": False,
            "dom.webaudio.enabled": False,
            "extensions.abuseReport.enabled": False,
            "dom.webnotifications.enabled": False,
            "browser.cache.disk.enable": False,
        }
        browser_args = [
            "-private",
        ]
        settings.set("PLAYWRIGHT_LAUNCH_OPTIONS", {
            "headless": False,
            "timeout": 30 * 1000,
            "devtools": False,
            "handle_sigint": False,
            "handle_sigterm": False,
            "handle_sighup": False,
            "args": browser_args,
            "firefox_user_prefs": prefs,
        })

    if settings.get("PLAYWRIGHT_BROWSER_TYPE") == "chromium":
        browser_args = [
            "--incognito",
            "--disable-gpu"  
            "--disable-sync"  
            "--disable-apps"  
            "--disable-audio"  
            "--disable-plugins"  
            "--disable-infobars"  
            "--disable-extensions"  
            "--disable-translate"  
            "--disable-geolocation"  
            "--disable-notifications"  
            "--disable-winsta"  
            "--disable-dev-shm-usage"  
            "--disable-webgl"  
            "--disable-cache"  
            "--disable-popup-blocking"  
            "--disable-back-forward-cache"  
            "--arc-disable-gms-core-cache"  
            "--process-per-site"  
            "--disable-offline-load-stale-cache"  
            "--disk-cache-size=0"  
            "--no-sandbox"  
            "--disable-client-side-phishing-detection"  
            "--disable-breakpad"  
            "--ignore-certificate-errors",
            "--ignore-urlfetcher-cert-requests"
            "--disable-blink-features=AutomationControlled",
            "--disable-web-security",
            "--allow-running-insecure-content",
        ]
        settings.set("PLAYWRIGHT_LAUNCH_OPTIONS", {
            "headless": False,
            "timeout": 30 * 1000,
            "devtools": False,
            "handle_sigint": False,
            "handle_sigterm": False,
            "handle_sighup": False,
            "args": browser_args,
        })
def start_requests(self) -> Iterable[Request]:
    logging.info("start url: %s" % self.start_urls[0])
    task_item = WebTaskItem(
                task_id="m-" + str(uuid.uuid4()),
                url=self.start_urls[0],
            )
    yield Request(
        url=self.start_urls[0],
        meta={
            TASK_ITEM_META_KEY: task_item,
            "spider": self.name,
            RandomUserAgentMiddleware.use_mobile_ua_flag: task_item.is_mobile,
            "playwright": True,
            "playwright_include_page": True,
            "playwright_context": task_item.task_id,
            "playwright_context_kwargs": {
                "ignore_https_errors": True,
                "is_mobile": task_item.is_mobile,
                "bypass_csp": True,
            },
            "playwright_page_init_callback": self.init_page,
            "playwright_page_goto_kwargs": {
                "wait_until": "load",
                "timeout": 30000
            },
            "handle_httpstatus_all": True,
        },
        dont_filter=False,
        callback=self.response_callback,
        errback=self.failure_callback
    )

async def response_callback(self, response: Response, **kwargs) -> WebTaskResultItem:
page: Page = response.meta.get("playwright_page", None)
if not page:
    raise ValueError("PlaywrightSpider.response_callback: page is None")

try:
    info = PageInfoItem()
    for processor in self.processors:
        # just fetch html / get title / take screenshots from page object
        await processor.process(response=response, info=info)

    result = WebTaskResultItem(
        task_id=response.request.meta[TASK_ITEM_META_KEY].task_id,
        page_info=info,
        error_message=""
    )

    logging.info("result: %s" % result.model_dump_json())
    yield result
except Exception as e:
    logging.warning(f"PlaywrightSpider.response_callback: ex={e}")
    result = WebTaskResultItem(
        task_id=response.request.meta[TASK_ITEM_META_KEY].task_id,
        error_message=str(e)
    )
    logging.info("result: %s" % result.model_dump_json())
finally:
    await page.close()
    await page.context.close()

async def failure_callback(self, failure, **kwargs) -> WebTaskResultItem:
page: Page = failure.request.meta.get("playwright_page", None)
if page:
    try:
        await page.close()
        await page.context.close()
    except Exception as e:
        self.logger.warning(f"stop error:{e}")

result = WebTaskResultItem(
    task_id=failure.request.meta[TASK_ITEM_META_KEY].task_id,
    error_message=failure.getErrorMessage()
)
logging.info("result: %s" % result.model_dump_json())
yield result

Troubleshooting Steps Taken

Debugging

I tried to use python debugger to debug the code and I found following information:

  1. The requests were added to the task queue.
  2. The debugger shows that the request was processed by scrapy-playwright.
  3. The browser started correctly.
  4. The browser render the page correctly. goto function was called.
  5. Then the spider hangs without any reasons, not following any timeout rules.
  6. No more tasks can be processed, only keep printing scraping stats.
  7. Both the callback=self.response_callback and errback=self.failure_callback function wasn't called
  8. Main thread loops in event select.

Huge thanks for any provided ideas!

elacuesta commented 7 months ago

This report is not actionable as it is and should be distilled further, there are several things that are likely not related. Additionally, there are missing symbols and there is no actual URL to test (it is relevant whether or not an issue occurs in every case, in which case a generic URL like https://example.org could be used). Please see https://stackoverflow.com/help/minimal-reproducible-example. I suspect the issue you're observing is probably related to the PLAYWRIGHT_MAX_CONTEXTS=1 & PLAYWRIGHT_MAX_PAGES_PER_CONTEXT=1 settings you have defined.

Zhou-Haowei commented 7 months ago

I think this problem is not related to the PLAYWRIGHT_MAX_CONTEXTS and PLAYWRIGHT_MAX_PAGES_PER_CONTEXT. The original values of these are 16 and 1. I deliberately set them to 1. The higher value of the PLAYWRIGHT_MAX_CONTEXTS setting only increases the time it takes for all contexts to hang. As for the url to test, it's NSFW, I don't know if it's against the rules to post it here. Just to be clear, I am working on developing a content filter, so I need to collect some datasets. My temporary solution is to create a process pool to run scrapy crawl tasks url by url. And kill the timeout CrawlProcess which most likely hangs.

akkuman commented 6 months ago

same problem I can't figure out why, because I can't reproduce it stably.

Even when this occurs, sending a SIGINT signal does not terminate the crawler.

akkuman commented 6 months ago

this is traceback when hang

>>> Signal received : entering python shell.
Traceback:
  File "/usr/local/bin/scrapy", line 8, in <module>
    sys.exit(execute())
  File "/usr/local/lib/python3.10/site-packages/scrapy/cmdline.py", line 161, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/usr/local/lib/python3.10/site-packages/scrapy/cmdline.py", line 114, in _run_print_help
    func(*a, **kw)
  File "/usr/local/lib/python3.10/site-packages/scrapy/cmdline.py", line 169, in _run_command
    cmd.run(args, opts)
  File "/app/info_scraper/commands/crawl.py", line 24, in run
    self.crawler_process.start()
  File "/usr/local/lib/python3.10/site-packages/scrapy/crawler.py", line 429, in start
    reactor.run(installSignalHandlers=install_signal_handlers)  # blocking call
  File "/usr/local/lib/python3.10/site-packages/twisted/internet/asyncioreactor.py", line 253, in run
    self._asyncioEventloop.run_forever()
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
    event_list = self._selector.select(timeout)
  File "/usr/local/lib/python3.10/selectors.py", line 469, in select
Signal received : entering python shell.
    fd_event_list = self._selector.poll(timeout, max_ev)
Traceback:
  File "/usr/local/bin/scrapy", line 8, in <module>
    sys.exit(execute())
  File "/usr/local/lib/python3.10/site-packages/scrapy/cmdline.py", line 161, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/usr/local/lib/python3.10/site-packages/scrapy/cmdline.py", line 114, in _run_print_help
    func(*a, **kw)
  File "/usr/local/lib/python3.10/site-packages/scrapy/cmdline.py", line 169, in _run_command
    cmd.run(args, opts)
  File "/app/info_scraper/commands/crawl.py", line 24, in run
    self.crawler_process.start()
  File "/usr/local/lib/python3.10/site-packages/scrapy/crawler.py", line 429, in start
    reactor.run(installSignalHandlers=install_signal_handlers)  # blocking call
  File "/usr/local/lib/python3.10/site-packages/twisted/internet/asyncioreactor.py", line 253, in run
    self._asyncioEventloop.run_forever()
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
    event_list = self._selector.select(timeout)
  File "/usr/local/lib/python3.10/selectors.py", line 469, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
elacuesta commented 5 months ago

I'm revisiting this issue and the available code is still both too long (e.g. including unrelated settings and classes) and incomplete (e.g. not containing a full working spider). Please provide a minimal, reproducible example, otherwise the issue will be closed.

elacuesta commented 4 months ago

Closing due to inactivity.

shivam-patil-DAX commented 4 months ago

@akkuman @Zhou-Haowei has the problem solved I am facing the same issue where the crawler gets stuck and to continue the process further I have to press Cmd+C manually