scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
1.03k stars 113 forks source link

Allow custom PageMethod callbacks #318

Closed jdemaeyer closed 2 weeks ago

jdemaeyer commented 2 months ago

Hi @elacuesta, still loving this library! :)

I often find myself having to deal with the Playwright page in my request callback because I need to perform some page actions involving loops or conditionals, which can't currently be done with the playwright_page_methods list. E.g. like this "click the 'load more' button while its visible" logic, mixing parsing with response preparation:

import scrapy
from playwright.async_api import expect

class PageActionSpider(scrapy.Spider):
    name = "pageaction"

    def start_requests(self):
        yield scrapy.Request(
            "https://example.com/",
            meta={
                "playwright": True,
                "playwright_include_page": True,
            },
        )

    async def parse(self, response):
        page = response.meta["playwright_page"]
        load_button = page.locator(".loadMore")
        loading_overlay = page.locator(".loadingOverlay")
        while (await load_button.is_visible()):
            await load_button.click()
            await expect(loading_overlay).to_be_hidden()
        sel = scrapy.Selector(text=await page.content())
        await page.close()
        print(sel.css(".interestingData").getall())

This PR allows setting a callable instead of a string as PageMethod.method, which will then be called with the page as its first argument, so that all the page-related async actions can again be handled by the download handler and I don't have to worry about closing the page myself or using a custom Selector instead of response.css:

import scrapy
from playwright.async_api import expect
from scrapy_playwright.page import PageMethod

class PageActionSpider(scrapy.Spider):
    name = "pageaction"

    def start_requests(self):
        yield scrapy.Request(
            "https://example.com/",
            meta={
                "playwright": True,
                "playwright_page_methods": [
                    PageMethod(self.extend_feed),
                ],
            },
        )

    async def extend_feed(self, page):
        load_button = page.locator(".loadMore")
        loading_overlay = page.locator(".loadingOverlay")
        while (await load_button.is_visible()):
            await load_button.click()
            await expect(loading_overlay).to_be_hidden()

    def parse(self, response):
        print(response.css(".interestingData").getall())
elacuesta commented 2 months ago

Amazing, thank you for the contribution @jdemaeyer :smile:

I've added a simple test, I'll also mention it in the docs shortly.

elacuesta commented 2 weeks ago

Thank you @jdemaeyer!

jdemaeyer commented 2 weeks ago

No thank you!