I often find myself having to deal with the Playwright page in my request callback because I need to perform some page actions involving loops or conditionals, which can't currently be done with the playwright_page_methods list. E.g. like this "click the 'load more' button while its visible" logic, mixing parsing with response preparation:
import scrapy
from playwright.async_api import expect
class PageActionSpider(scrapy.Spider):
name = "pageaction"
def start_requests(self):
yield scrapy.Request(
"https://example.com/",
meta={
"playwright": True,
"playwright_include_page": True,
},
)
async def parse(self, response):
page = response.meta["playwright_page"]
load_button = page.locator(".loadMore")
loading_overlay = page.locator(".loadingOverlay")
while (await load_button.is_visible()):
await load_button.click()
await expect(loading_overlay).to_be_hidden()
sel = scrapy.Selector(text=await page.content())
await page.close()
print(sel.css(".interestingData").getall())
This PR allows setting a callable instead of a string as PageMethod.method, which will then be called with the page as its first argument, so that all the page-related async actions can again be handled by the download handler and I don't have to worry about closing the page myself or using a custom Selector instead of response.css:
import scrapy
from playwright.async_api import expect
from scrapy_playwright.page import PageMethod
class PageActionSpider(scrapy.Spider):
name = "pageaction"
def start_requests(self):
yield scrapy.Request(
"https://example.com/",
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod(self.extend_feed),
],
},
)
async def extend_feed(self, page):
load_button = page.locator(".loadMore")
loading_overlay = page.locator(".loadingOverlay")
while (await load_button.is_visible()):
await load_button.click()
await expect(loading_overlay).to_be_hidden()
def parse(self, response):
print(response.css(".interestingData").getall())
Hi @elacuesta, still loving this library! :)
I often find myself having to deal with the Playwright page in my request callback because I need to perform some page actions involving loops or conditionals, which can't currently be done with the
playwright_page_methods
list. E.g. like this "click the 'load more' button while its visible" logic, mixing parsing with response preparation:This PR allows setting a callable instead of a string as
PageMethod.method
, which will then be called with the page as its first argument, so that all the page-related async actions can again be handled by the download handler and I don't have to worry about closing the page myself or using a custom Selector instead ofresponse.css
: