Open mtatarau90 opened 4 years ago
My scenario is, website renders 3000 pages in Iframe and i want render that website and build per page pdf by using puppeteer.pdf(). So I prepared 200 chunks of array like [ [1..200], [201..400], [401...600],........[2801..3000] ]. So 15 chunks. I am opening/giving 15 as maxConcurrency,
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_PAGE,
maxConcurrency: 15,
monitor: true,
puppeteerOptions: {
timeout: TIMEOUT,
},
timeout: TIMEOUT,
});
Here I am opening same website in 15 browser pages.How can I cache the resources, so that, if it present in cache, then data should get from that, otherwise network call should get fired. When I load website there are ~300 network calls. If anyhow I am able to cache some static resources, then it will increase the processing speed and helps in better performance.
As per documentation, CONCURRENCY_PAGE
should helps in reusing all data like localStorage data/cache/cookie etc.
But unfortunately it don't do that. I tried like opened same website in 6 worker, and console logged
await page.setRequestInterception(true);
page.on('request', async interceptedRequest => {
await interceptedRequest.continue();
});
page.on('response', async res => {
console.log('fromcache=======', res.fromCache());
});
First opened website in only 1 worker. There are total 300 console.log for fromcache======= true/false
. Out of which 20 are for true
, 280 are for false
. It means somehow 20 request's get served from cache.
So now tried opening site in 6 worker and counted console.log for fromcache======= true/false
. There are total 1800 console.log for fromcache======= true/false
. Out of which only 120 are for true
, rest are for false
.
It means concurrency: Cluster.CONCURRENCY_PAGE,
doesn't Shares everything (cookies, localStorage, etc.) between jobs.
So, here my question is, how can I prevent networks which are already fired by 1 of the worker, how can I cache those repeating same network calls?
@rohitsg In your case, it's a puppeteer issue. Request caching is disabled if request interception is used, I had asked about it a while back. There are some workarounds mentioned in the issue you can try though.
Hi guys,
Is a way to cache a website resources(like js/images/css)? Right now if i try to crawler an website all resources will be reloaded.
Thanks