machawk1 / warcreate

Chrome extension to "Create WARC files from any webpage"
https://warcreate.com
MIT License
205 stars 13 forks source link

Embedded fonts are not included in WARCs #125

Open machawk1 opened 3 years ago

machawk1 commented 3 years ago

On my own site (e.g., https://matkelly.com), I reference some fonts to be included and used in the CSS of the web page, e.g.,

<link rel="preload" href="/_font/IM_FELL_English_Roman.woff2" as="font" type="font/woff2" crossorigin>

The resource resolution procedure never fetches these, so the HTML representation is affected at replay. The request for the resource does appear in the WARC.

machawk1 commented 3 years ago

A generic query selector like document.querySelectorAll('link') will return all of the link tags in the document (header) but I am still searching for a less generic way to identify fonts in the same spirit of the current logic with (e.g.) document.styleSheets for CSS.

ibnesayeed commented 3 years ago

You may want to use "not perfect but good enough" approach of matching patters in the href, as, and/or type attribute values using the attribute selectors of CSS Selectors in your querySelectorAll call.

machawk1 commented 3 years ago

@ibnesayeed That will be my first approach. I am still investigating if there are other resources that perhaps are missing but represented in these elements. If so, the more generic approach of querying the DOM for link elements would yield additional representations to store.