scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
999 stars 110 forks source link

Combine with SitemapSpider #163

Closed Cj-Malone closed 1 year ago

Cj-Malone commented 1 year ago

I know the trick to use Playwright with a CrawlSpider via process_request, but is there a way to use it with a SitemapSpider?

elacuesta commented 1 year ago

There is no way to modify sitemap requests using the public API from what I can see in the docs and the source. One possibility would be to request the addition of such feature in upstream Scrapy.

I've also been thinking about enabling Playwright for all requests for spiders that define a specific attribute or setting, e.g.:

from scrapy import Spider

class PlaywrightArgumentSpider(Spider):
    name = "playwright_argument"
    playwright = True

class PlaywrightSettingSpider(Spider):
    name = "playwright_setting"
    custom_settings = {"PLAYWRIGHT_ENABLED": True}

or via CLI invocation:

scrapy crawl playwright_argument -a playwright=1
scrapy crawl playwright_setting -s PLAYWRIGHT_ENABLED=1

For the moment it's just an idea tough. In the meantime, a hacky workaround would be to override _parse_sitemap (and start_requests as well). This is not pretty, it's a private method and its implementation could change at any minute - use at your own risk.

from scrapy.spiders.sitemap import SitemapSpider

class PlaywrightSitemapSpider(SitemapSpider):
    name = "playwright_sitemap"

    def start_requests(self):
        for request in super().start_requests():
            request.meta["playwright"] = True
            yield request

    def _parse_sitemap(self, response):
        for request in super()._parse_sitemap(response):
            request.meta["playwright"] = True
            yield request
Cj-Malone commented 1 year ago

thinking about enabling Playwright for all requests for spiders

I think this is the solution, it could clean up the CrawlSpider connection and any other base spiders that exist or may exist in the future.

Thanks for your overrides, I had basically reimplemented SitemapSpider in my spider but I'll switch to the overrides for now. Hopefully the spider level setting comes soon.

Cj-Malone commented 1 year ago

I've written a little middleware to enable playwright for all spiders with is_playwright_spider = True

elacuesta commented 1 year ago

Thanks, that's simpler and already possible with the existing APIs.

elacuesta commented 1 year ago

a9061d241d0fa9f088043b9033b006f39b975c36