peterbe / minimalcss

Extract the minimal CSS used in a set of URLs with puppeteer
https://minimalcss.app/
MIT License
353 stars 35 forks source link

minimalcss is unable to extract css from many webarchive snapshots #308

Open layoutanalysis opened 5 years ago

layoutanalysis commented 5 years ago

I would like to use minimalcss to extract the used css from http://web.archive.org/ snapshots of a webpage (e.g http://web.archive.org/web/20110310061818/http://www.bloomberg.com/) and compare the results over time to find out how often the layout/appeareance of a webpage has changed in the past.

Unfortunately this is not so easy with minimalcss, because it stops working whenever a stylesheet cannot be fetched (404 error). 404s are a very common thing on web.archive.org, as many captures are incomplete. I could partially work around them using the skippable function, but it only lets me skip the request upfront - i cannot react on response errors. My preferred behaviour would be to output the used css to stdout vs. logging the unretrievable stylesheet urls to stderr.

Another issue is the mandatory CSSO-Optimisation, which crashes on certain CSS property values. I could mitigate some crashes by setting cssoOptions: {restructure: false}, but it would be nicer if i could disable the optimisation altogether.

I'm aware that my use case is somewhat uncommon for minimalcss, but maybe the library can be extended to make it possible?

layoutanalysis commented 5 years ago

I also noticed that minimalcss times out on certain web.archive snapshots:

const minimalcss = require("minimalcss");

minimalcss
  .minimize({ 
      urls: ['http://web.archive.org/web/20161001001006/https://www.theguardian.com/us'],
    ignoreJSErrors: true,
    withoutjavascript: true,
    ignoreCSSErrors: true,
    loadimages: false,
    enableServiceWorkers: true,
    timeout: 90000,
    cssoOptions: {restructure: false},
    skippable: request => {
        return request.url().indexOf('theguardian.com') === -1;
    }
})
  .then(result => {
    console.log(result.finalCss);
  })
  .catch(error => {
    console.error(`Failed the minimize CSS: ${error}`);
  });

results in the error

Failed the minimize CSS: TimeoutError: Navigation Timeout Exceeded: 90000ms exceeded
Tracked URLs that have not finished: http://web.archive.org/web/20161001001006/https://www.theguardian.com/us, http://web.archive.org/web/20161001001006/https://www.theguardian.com/us-news/series/politics-for-humans/rss

This error also happened with timeout: 560000 (9 minutes timeout). Maybe it makes sense to stop all pending requests at start_time + (timeout - 10%) and use the remaining time to calculate the used_css and return it?

stereobooster commented 5 years ago

Timeout can be a puppeteer bug. Related https://github.com/peterbe/minimalcss/issues/112

peterbe commented 5 years ago

What @stereobooster said is true.

But I wonder, why do you have enableServiceWorkers: true in there?