Closed d-balaskas closed 4 months ago
I built a spider that crawls sites but when insert playwright and specifically
'DOWNLOAD_HANDLERS': { "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", },
Script stops running on:
[scrapy.extensions.telnet] INFO: Telnet console listening on 127.0. ... INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) . . . INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
A very simple reproduction script is this:
import scrapy from scrapy.spiders import SitemapSpider from scrapy.crawler import CrawlerProcess from scrapy.http import Response from typing import Any class MySpider(SitemapSpider): name = "MySpider" forbiden_calls = set() def __init__(self, *args, **kwargs): super(MySpider, self).__init__(*args, **kwargs) self.start_urls = ['http://<website>'] self.scrapy_meta = { "playwright": True } def start_requests(self): for url in self.start_urls: print("Start URL", url) yield scrapy.Request(url, self.parse, errback=self.onError, meta=self.scrapy_meta) def parse(self, response: Response) -> Any: print("Parsing", response.url, response.body, response.status) def onError(self, failure): print(f"Failed to response parsing {failure.request.url}: {failure.value}") def main(): process = CrawlerProcess({ # 'LOG_LEVEL': 'ERROR', 'DOWNLOAD_HANDLERS': { "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", }, 'TWISTED_REACTOR': "twisted.internet.asyncioreactor.AsyncioSelectorReactor", 'PLAYWRIGHT_BROWSER_TYPE': 'chromium', 'PLAYWRIGHT_LAUNCH_OPTIONS': { 'headless': False, } }) process.crawl(MySpider) process.start() if __name__ == '__main__': main()
What can trigger this event?
Closing as duplicate of #290. There's a workaround at https://github.com/scrapy-plugins/scrapy-playwright/issues/290#issuecomment-2215291430 and a fix in progress at #299.
I built a spider that crawls sites but when insert playwright and specifically
Script stops running on:
A very simple reproduction script is this:
What can trigger this event?