Closed Cj-Malone closed 1 year ago
There is no way to modify sitemap requests using the public API from what I can see in the docs and the source. One possibility would be to request the addition of such feature in upstream Scrapy.
I've also been thinking about enabling Playwright for all requests for spiders that define a specific attribute or setting, e.g.:
from scrapy import Spider
class PlaywrightArgumentSpider(Spider):
name = "playwright_argument"
playwright = True
class PlaywrightSettingSpider(Spider):
name = "playwright_setting"
custom_settings = {"PLAYWRIGHT_ENABLED": True}
or via CLI invocation:
scrapy crawl playwright_argument -a playwright=1
scrapy crawl playwright_setting -s PLAYWRIGHT_ENABLED=1
For the moment it's just an idea tough. In the meantime, a hacky workaround would be to override _parse_sitemap
(and start_requests
as well). This is not pretty, it's a private method and its implementation could change at any minute - use at your own risk.
from scrapy.spiders.sitemap import SitemapSpider
class PlaywrightSitemapSpider(SitemapSpider):
name = "playwright_sitemap"
def start_requests(self):
for request in super().start_requests():
request.meta["playwright"] = True
yield request
def _parse_sitemap(self, response):
for request in super()._parse_sitemap(response):
request.meta["playwright"] = True
yield request
thinking about enabling Playwright for all requests for spiders
I think this is the solution, it could clean up the CrawlSpider connection and any other base spiders that exist or may exist in the future.
Thanks for your overrides, I had basically reimplemented SitemapSpider in my spider but I'll switch to the overrides for now. Hopefully the spider level setting comes soon.
I've written a little middleware to enable playwright for all spiders with is_playwright_spider = True
Thanks, that's simpler and already possible with the existing APIs.
a9061d241d0fa9f088043b9033b006f39b975c36
I know the trick to use Playwright with a CrawlSpider via
process_request
, but is there a way to use it with a SitemapSpider?