Transform links in CSS content before injecting into the DOM (in jQuery mode)

Jaifroid commented 6 years ago

In line with what was discussed in #336 (and #335) we should investigate the possibility of transforming links to fonts and possibly other content that are contained in the CSS, which currently cause a cross origin or similar error in the console:

downloadable font: download failed (font-family: "Glyphicons Halflings"
style:normal weight:normal stretch:normal src index:1): bad URI or 
cross-site access not allowed source: file:///C:/Users/XXXXX/Source/
I/static/bootstrap/fonts/glyphicons-halflings-regular.woff2

mossroy commented 6 years ago

For example, if we have a CSS file that contains a link to a font (like in bootstrap CSS, see error message above), or a link to an image (by using a background-image property with a URL), they are currently left as they are in jQuery mode. As a consequence, the browser fails to download the corresponding file.

The difficulty here would be to find a way to detect these urls, read them from the backend, and inject them in the DOM.

In any case, I'm wondering if it's worth working on that. It is one of the limitations of the jQuery mode (it works well in ServiceWorker mode), and there can be a lot of other edge cases like this one, that we might need to handle one by one.

Jaifroid commented 6 years ago

OK, I've looked into this, and what we have are a series of embedded fonts in the CSS with what look to be fallback URLs. This is rather specific to the Stackexchange ZIMs. There are masses and masses of embedded data, that look like this:

@font-face{font-family:'Glyphicons Halflings';src:url(data:application/vnd.ms-fontobject;base64,
n04AAEFNAAACAAIABAAAAAAABQAAAAAAAAABAJABAAAEAExQAAAAAAAAAAIAAAAAAAAAAAEAAAA
AAAAAJxJ/LAAAAAAAAAAAAAAAAAAAAAAAACgARwBMAFkAUABIAEkAQwBPAE4AUwAgAEgAYQBsAGY
...
format('embedded-opentype'),
url(../../../../I/static/bootstrap/fonts/glyphicons-halflings-regular.woff2)
format('woff2'),url(data:application/font-woff;base64,
d09GRgABAAAAAFuAAA8AAAAAsVwAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAABGRlRNAAABWAAAA
BwAAAAcbSqX3EdERUYAAAF0AAAAHwAAACABRAAET1MvMgAAAZQAAABFAAAAYGe5a4ljbWFwAAAB3
...

The error is coming from that URL which is clearly wrong. While we could search and replace the path in the same way we do for URLs in the HTML, we don't have a static file to point the URL at. It very much looks as if the font data are embedded in any case as base64 data (a lot of data), so it would be redundant and a waste of resources to extract from the ZIM, given that the browser can use the embedded data that have already been extracted. There are some embedded SVGs also in the data, with URLs, and extracting these is extremely costly.

In the end, visually I cannot see anything missing on the corresponding page(s), so it looks as if the browser is able to use the embedded data when the URL fails.

I would agree that this is probably a "won't fix", because it would degrade performance and be complex for no gain.

mossroy commented 6 years ago

Very interesting. I'm surprised of this info duplication, too. Of course, that will depend on the ZIM content.

A different example is the use of a background image in the main page of wikipedia_fr_all_2017-05.zim : there should be a wikipedia logo below the text "Bienvenue sur Wikipédia", that does not appear in jQuery mode

Jaifroid commented 6 years ago

Hmm, to fix the missing image, assuming there are no embedded data for it, would require that we a) extract the stylesheets before we extract the images, and b) that there is a simple and reliable method for jQuery to select a URL in CSS, so that we can add that URL to the image extraction loop and embed its data (or possibly a blob URL) back into the CSS. In principle it's no different from the process we currently use to correct image URLs in the HTML and then loop over them to extract the blobs.

I guess it would be possible to add an extra CSS image-extraction loop (or turn the current image-extraction process into a function), in order not to have to change the order of extraction of HTML images and CSS.

We need to investigate the issue in the example you cite. It is unusual for a logo that has no real reason to be in a stylesheet to be coded that way!

kiwix / kiwix-js

Transform links in CSS content before injecting into the DOM (in jQuery mode) #338