machawk1 / warcreate

Chrome extension to "Create WARC files from any webpage"
https://warcreate.com
MIT License
207 stars 13 forks source link

WARCs of PDF include browser's wrapper #110

Open machawk1 opened 5 years ago

machawk1 commented 5 years ago

When viewing a PDF in Chrome, the browser creates a faux DOM as a wrapper of the PDF content. For example:

<html><body style="height: 100%; width: 100%; overflow: hidden; margin: 0px; background-color: rgb(82, 86, 89);"><embed width="100%" height="100%" name="plugin" id="plugin" src="https://uri/of.pdf" type="application/pdf" internalinstanceid="10"></body></html>

In this case, the Content-Type: application/pdf in the HTTP response does not match that which was captured. While it may be a philosophical issue as to whether the wrapper should be preserved, there seems to be an issue with capturing the PDF's contents. This may be due to WARCreate checking with the faux DOM for embedded resources and there being a conflict with the Content-Type (really application/pdf but the presentation contains content like text/html).

Also, it may be questionable whether WARCreate is currently configure to scrape this URI in the embed, generated by Chrome.