Open pjlsergeant opened 1 year ago
It looks like this is caused by the use of Page.route
. In their docs it says:
Enabling routing disables http cache.
Unfortunately, this is necessary for some of the functionality of this integration, as I've explained elsewhere.
Seems like this is a known limitation and a lot of people are eager to have it removed from upstream Playwright: https://github.com/microsoft/playwright/issues/7220.
I'm having trouble getting Scrapy + Playwright to respect caches when crawling, when using a persistent context. I've tried to get it down to a minimal example, which you can see here:
https://github.com/pjlsergeant/scrapy-playwright-cache-bug
app.py is a minimal Flask app to demonstrate; if you start it (
flask run
) and then run the scrape (scrapy crawl crawl
), you can see that the PNG at/pixel
doesn't get cached, both from the flask logs and by the final body output:<html><head></head><body>count:6</body></html>
, signifying 6 hits.Interestingly, if you then manually load up Playwright using the persistent config (something like
browser_context = chromium.launch_persistent_context(userDataDir)
), you'll see the image is already cached, so the image is being written to the cache during Playwright+Scrapy's run, it's just not being loaded from the cache when Playwright is being driven by Scrapy.Any help gratefully received