scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
1.03k stars 113 forks source link

images don't appear to get read from the persistent context properly / cached #198

Open pjlsergeant opened 1 year ago

pjlsergeant commented 1 year ago

I'm having trouble getting Scrapy + Playwright to respect caches when crawling, when using a persistent context. I've tried to get it down to a minimal example, which you can see here:

https://github.com/pjlsergeant/scrapy-playwright-cache-bug

app.py is a minimal Flask app to demonstrate; if you start it (flask run) and then run the scrape (scrapy crawl crawl), you can see that the PNG at /pixel doesn't get cached, both from the flask logs and by the final body output: <html><head></head><body>count:6</body></html>, signifying 6 hits.

Interestingly, if you then manually load up Playwright using the persistent config (something like browser_context = chromium.launch_persistent_context(userDataDir)), you'll see the image is already cached, so the image is being written to the cache during Playwright+Scrapy's run, it's just not being loaded from the cache when Playwright is being driven by Scrapy.

Any help gratefully received

elacuesta commented 1 year ago

It looks like this is caused by the use of Page.route. In their docs it says:

Enabling routing disables http cache.

Unfortunately, this is necessary for some of the functionality of this integration, as I've explained elsewhere.

Seems like this is a known limitation and a lot of people are eager to have it removed from upstream Playwright: https://github.com/microsoft/playwright/issues/7220.