scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
1.03k stars 113 forks source link

Using Scrapy-Playwright on Windows DOWNLOAD_HANDLERS not working #302

Closed d-balaskas closed 4 months ago

d-balaskas commented 4 months ago

I built a spider that crawls sites but when insert playwright and specifically

'DOWNLOAD_HANDLERS': {
   "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
   "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},

Script stops running on:

[scrapy.extensions.telnet] INFO: Telnet console listening on 127.0. ...
INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
.
.
.
INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

A very simple reproduction script is this:

import scrapy

from scrapy.spiders import SitemapSpider
from scrapy.crawler import CrawlerProcess
from scrapy.http import Response
from typing import Any

class MySpider(SitemapSpider):
    name = "MySpider"
    forbiden_calls = set()

    def __init__(self, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.start_urls = ['http://<website>']
        self.scrapy_meta = {
            "playwright": True
        }

    def start_requests(self):
        for url in self.start_urls:
            print("Start URL", url)
            yield scrapy.Request(url, self.parse, errback=self.onError, meta=self.scrapy_meta)

    def parse(self, response: Response) -> Any:
        print("Parsing", response.url, response.body, response.status)

    def onError(self, failure):
        print(f"Failed to response parsing {failure.request.url}: {failure.value}")

def main():
    process = CrawlerProcess({
        # 'LOG_LEVEL': 'ERROR',
        'DOWNLOAD_HANDLERS': {
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        'TWISTED_REACTOR': "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        'PLAYWRIGHT_BROWSER_TYPE': 'chromium',
        'PLAYWRIGHT_LAUNCH_OPTIONS': {
            'headless': False,
        }        
    })

    process.crawl(MySpider)
    process.start()

if __name__ == '__main__':
    main()

What can trigger this event?

elacuesta commented 4 months ago

Closing as duplicate of #290. There's a workaround at https://github.com/scrapy-plugins/scrapy-playwright/issues/290#issuecomment-2215291430 and a fix in progress at #299.