harvard-lil / scoop

🍨 High-fidelity, browser-based, single-page web archiving library and CLI for witnessing the web.
MIT License
117 stars 8 forks source link

Allow PDF capture in headful mode. #375

Closed rebeccacremona closed 2 weeks ago

rebeccacremona commented 2 weeks ago

Background

Our experiments, while limited, suggest that print-to-PDF works just fine with chromium in headful mode, even though Playwright's docs say, "Generating a pdf is currently only supported in Chromium headless", and their tests only check headless chromium

Digging through their codebase and git history, I was unable to uncover any information about why: there are no Github issues mentioning it being flaky/unreliable/buggy; there are no PRs adding "experimental support" or the like; the Chrome dev tools protocol doesn't mention any restrictions in its current verion OR in the oldest capture of that page in the Internet Archive...

I can see that the pdf method on Playwright's Page object is optional and may be undefined, but I see no evidence that, if the method is available, that it is sometimes unreliable or untrusted (e.g., if running chromium in headful mode).

And, indeed, there is a report from May 2020 where a user is complaining that page.pdf is throwing TypeError: page.pdf is not a function; they closed the issue when they heard that headful mode was not supported.

I think support for headful mode was added "by accident" due to some upstream change, and the docs and tests in the Playwright repo were never updated.

This PR

This PR removes the init check that disallows running Scoop with --pdf-snapshop and --headless false.

While it is certainly possible there is some good reason for not running that way, and further communication with the Playwright team will uncover why, it seems low-risk to enable for now... and perhaps collect our own data / watch for errors.

rebeccacremona commented 2 weeks ago

(The Node 18 tests got hung up, so I canceled and re-ran them... looks good on this pass!)