thomasdondorf / puppeteer-cluster

Puppeteer Pool, run a cluster of instances in parallel
MIT License
3.25k stars 309 forks source link

Cache resources #303

Open mtatarau90 opened 4 years ago

mtatarau90 commented 4 years ago

Hi guys,

Is a way to cache a website resources(like js/images/css)? Right now if i try to crawler an website all resources will be reloaded.

Thanks

rohitsg commented 4 years ago

My scenario is, website renders 3000 pages in Iframe and i want render that website and build per page pdf by using puppeteer.pdf(). So I prepared 200 chunks of array like [ [1..200], [201..400], [401...600],........[2801..3000] ]. So 15 chunks. I am opening/giving 15 as maxConcurrency,

 const cluster = await Cluster.launch({
      concurrency: Cluster.CONCURRENCY_PAGE,
      maxConcurrency: 15,
      monitor: true,
      puppeteerOptions: {
        timeout: TIMEOUT,
      },
      timeout: TIMEOUT,
    });
  1. Here I am opening same website in 15 browser pages.How can I cache the resources, so that, if it present in cache, then data should get from that, otherwise network call should get fired. When I load website there are ~300 network calls. If anyhow I am able to cache some static resources, then it will increase the processing speed and helps in better performance.

  2. As per documentation, CONCURRENCY_PAGE should helps in reusing all data like localStorage data/cache/cookie etc. But unfortunately it don't do that. I tried like opened same website in 6 worker, and console logged

await page.setRequestInterception(true);
  page.on('request', async interceptedRequest => {
    await interceptedRequest.continue();
  });

  page.on('response', async res => {
    console.log('fromcache=======', res.fromCache());
  });

First opened website in only 1 worker. There are total 300 console.log for fromcache======= true/false. Out of which 20 are for true, 280 are for false. It means somehow 20 request's get served from cache. So now tried opening site in 6 worker and counted console.log for fromcache======= true/false. There are total 1800 console.log for fromcache======= true/false. Out of which only 120 are for true, rest are for false. It means concurrency: Cluster.CONCURRENCY_PAGE, doesn't Shares everything (cookies, localStorage, etc.) between jobs.

So, here my question is, how can I prevent networks which are already fired by 1 of the worker, how can I cache those repeating same network calls?

ObviouslyGreen commented 4 years ago

@rohitsg In your case, it's a puppeteer issue. Request caching is disabled if request interception is used, I had asked about it a while back. There are some workarounds mentioned in the issue you can try though.

https://github.com/puppeteer/puppeteer/issues/2905