scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
911 stars 101 forks source link

Why can't my Playwright run concurrently? #282

Closed dream2333 closed 1 day ago

dream2333 commented 4 days ago

My code can only proceed with the next request after the previous one has finished, acting as if the requests are blocking.

When I crawl web pages without using Playwright, the Request objects generated by start_requests are downloaded in parallel by the downloader.

However, when I use Playwright for downloading, the requests are not downloaded in parallel but are downloaded in a blocking manner. A new browser page is only opened after the previous page has finished loading both in windows and wsl. This is more evident when I switch to slower websites. How can I make the Request objects from start_requests download in parallel?

class HttpbinTest(scrapy.Spider):
    name = "httpbin"
    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "PLAYWRIGHT_MAX_PAGES_PER_CONTEXT": 16,
        "CONCURRENT_REQUESTS": 32,
        "PLAYWRIGHT_LAUNCH_OPTIONS": {"headless": False},
    }

    def start_requests(self):
        # Test playwright concurrency
        for i in range(32):
            yield Request(
                "https://httpbin.org/headers",
                meta={"playwright": True},
                dont_filter=True,
            )

    def parse(self, response: HtmlResponse):
        print(response.text)
elacuesta commented 4 days ago

https://github.com/scrapy-plugins/scrapy-playwright#reporting-issues

dream2333 commented 4 days ago
Platform:   linux
OS:         posix
Python:     3.12.3
========================
scrapy_playwright : 0.0.36
playwright        : 1.44.0
========================
Scrapy       : 2.11.2
lxml         : 5.2.2.0
libxml2      : 2.12.6
cssselect    : 1.2.0
parsel       : 1.9.1
w3lib        : 2.2.1
Twisted      : 24.3.0
Python       : 3.12.3 (main, May 14 2024, 07:44:45) [GCC 10.2.1 20210110]
pyOpenSSL    : 24.1.0 (OpenSSL 3.2.2 4 Jun 2024)
cryptography : 42.0.8
Platform     : Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
elacuesta commented 4 days ago

I see, thanks for the update. This is an issue with the recent Windows implementation, I'll look into it.

elacuesta commented 1 day ago

Should be fixed as of v0.0.37