kiwix / kiwix-js

Fully portable & lightweight ZIM reader in Javascript
https://www.kiwix.org/
GNU General Public License v3.0
286 stars 125 forks source link

Enable injection of javascript content coming from ZIM files in jQuery mode #152

Closed mossroy closed 2 years ago

mossroy commented 8 years ago

The current Ray Charles ZIM file does not have any testable javascript content. And it was causing issues in Firefox and Firefox OS, so I disabled the injection of javascript (both in jQuery and ServiceWorker modes)

So we have to find a test ZIM file that has some, and re-enable injection of javascript

julianharty commented 7 years ago

The PhET Simulations ZIM file is both small and relies on Javascript to render and use the exercises. It might be a useful testbed to test the support of JavaScript in kiwix-html5...

Jaifroid commented 6 years ago

@mossroy Were the JavaScript issues due to the decompression engine hanging, or were they to do with the interaction of JavaScript with the loaded page or app operation? If hanging, then it was probably a similar issue to overloading the engine with SVG decompression or even CSS decompression. As plain-text files are highly compressed, the engine deals poorly with decompressing several things thrown at it at once.

I'm wondering if it will be feasible to re-enable JavaScript injection with a bit of pipeline management.

mossroy commented 6 years ago

It would probably be worth testing again : I don't remember exactly what were the symptoms, and they might have disappeared (if we're very lucky), with the improvements of recent browser engines.

In any case, it will be more difficult in jQuery mode, because, unlike images and CSS, the injection timing of javascript code can sometimes be important. For example, some javascript code is sometimes launched on the "load" event of the page, because the javascript code of the "head" section should be ready at this time. If we don't inject the javascript at the right time, these javascript calls might fail. In ServiceWorker mode, it should work naturally, as the browser engine will handle this by itself.

mossroy commented 6 years ago

After a quick test, I managed to enable javascript in ServiceWorker mode, and it does seem to solve #355. I'll try to test more later, and try to enable it in jQuery mode (which might be harder)

mossroy commented 6 years ago

In jQuery mode, I manage to inject the javascript content in the src tag, but it does not seem to be executed by the browser engine. I don't know if there's a way to ask for that. I tried to use eval() to execute it instead : it works, but seems to be executed in the global context of the document. As a consequence, the jQuery script (coming from the ZIM file) tries to load and sees that it is already loaded (because kiwix-js also uses jQuery). This eval() should be run in the context of the iframe : I don't know if it's possible.

A solution that should work is to inject the javascript content inside the html string, before putting it in the DOM. I don't like that, as we have to use regular expressions to find the src tags. But it's probably the only way to keep the execution order of javascript.

Jaifroid commented 6 years ago

Maybe we can extract the JavaScript file content from the ZIM and make it into a blob, like we do with images and then inject them into the document in the right order? I do this with stylesheets already in Kiwix JS Windows, i.e. I extract the code, store it in the cache, and then insert the code as a blob file. I haven't tried it yet with JavaScript, but this is how I do it with CSS:

var cssBlob = new Blob([content], { type: 'text/css' });
var newURL = [fileDirEntry.namespace + "/" + fileDirEntry.url, URL.createObjectURL(cssBlob)];
blobArray.push(newURL);
if (cssBlobCache) {
    cssBlobCache.set(newURL[0], newURL[1]);
} 
injectCSS();

I'm pretty sure the DOM will execute a JavaScript URL injected using DOM methods even after the document has already loaded.

I'd be happy to do some tests when I get some time, maybe just with dummy content to see if the principle works and on which systems.

Jaifroid commented 6 years ago

A consideration in working out a strategy for this:

I believe we probably must delay the extraction of JavaScript from the ZIM until after the main page has loaded. The newer ZIMs have no fewer than six scripts for each page:

<script src="../-/j/js_modules/jsConfigVars.js"></script>
<script src="../-/j/js_modules/startup.js"></script>
<script src="../-/j/js_modules/jquery.js"></script>
<script src="../-/j/js_modules/mediawiki.js"></script>
<script src="../-/j/js_modules/site.js"></script>
<script src="../-/j/js_modules/ext.cite.a11y.js"></script>

These are located just before the close body tag, so they are intended to be run after first paint. But more seriously for us, unless the scripts are stored uncompressed in the ZIM, we have little hope of extracting all of that with our current engine before we display the HTML and maintaining an acceptable user experience on lower-spec hardware. Have you been able to determine if these are as highly compressed as CSS is (which is extremely slow to extract on lower-spec hardware)? Unless we're very lucky with them, I think we should aim to extract them with or after images, and inject them into the DOM in the right order once we've gathered them all. We can probably inject them with element-creation and appending DOM methods, possibly as script blocks or possibly as blobs.

mossroy commented 6 years ago

I did not check, but it seems very probable that these files are stored compressed : in terms of ZIM file size, it would be a waste not to compress these textual files. And I agree with you that it might give the same performance issues as with CSS.

Jaifroid commented 6 years ago

Commit https://github.com/kiwix/kiwix-js/commit/0b79404089185895c64dcfa5ef146a9e819fbd8c on the javaScript-support branch now adds generic support for converting inline JavaScript, both inside <script>...</script> tags and as inline events such as onclick="...", and onmouseover="..." to attached scripts and to addEventListener events. It works for browsers and CSPs that support running scripts from blobs. Full functionality in Edge, Firefox and Internet Explorer (untested in Chrome yet). In FFOS, it degrades gracefully, with no errors. This branch is rebased on top of our current PR for opening headings, so it starts with all headings open by default.

It's experimental but working generically with the supplied mw-offliner scripts and events. It's probably a long way from merging, but hopefully can be tested, improved, etc. It's all done with DOM methods. It can be applied to jQuery mode and ServiceWorker (untested) depending on the extent to which ServiceWorker mode is affected by the CSP.

To prevent potential performance issues on large documents, I limit the number of events searched for, with a note to DEV to add more if needed. I also limit the number of tags to look inside (with querySelectorAll). The test code is running with events ["click", "mouseover"] and with tags ["h2", "p", "a"] just for testing purposes, and I'm not seeing performance issues. Adding more is just a matter of editing the signalled arrays. I realize this is not fully satisfactory, but an even more generic search would require a fair amount of Regular Expression intervention before we inject the HTML, and may not have performance advantages over simply expanding the lists sensibly and using querySelectorAll. If you can think of a better solution, I'm all ears.

The most important thing here is that, even if we end up not needing to support inline events and scripts (though PhET may well have some), the methods used for attaching and running scripts here can easily be applied to scripts taken from the ZIM, if the store CSPs support running scripts from blobs.

Jaifroid commented 6 years ago

The latest commits (https://github.com/kiwix/kiwix-js/commit/b67c5dac1d8b00dfc2cdff161cde70eef2d07647 and subsequent) radically simplify the procedure, and make it fully generic by using regexes. It actually turned out to be a lot easier than I thought. No longer any need to specify which tags or which events.

The code has a function for attaching found functions with a single Blob that contains all functions. There is also an alternative function that does the same but using the data: URI scheme. Both are provided in case one turns out to be more acceptable than another when it comes to testing CSP for extensions.

The next step is extracting JavaScript and attaching to the iframe, but it must be done after images and css are extracted, so we need to implement some form of queue. Any ideas gratefully received. The most basic method is to check the total number of images and keep count as each loads. But there must be better methods of queue control with promises...

mossroy commented 6 years ago

One minor bug I found : on wikivoyage_en_all_novid_2018-03.zim , click on Europe, then on any link, then use the button to go back to Europe : it triggers the following error in the console : 13:02:23,100 [attachInlineFunctions] The specified functions could not be found in the content window! 1 uiUtil.js:130:17 attachInlineFunctions file:///path/to/kiwix-js/www/js/lib/uiUtil.js:130:17 parseEvents/< file:///path/to/kiwix-js/www/js/app.js:899:21 createScriptBlob/newScript.onload file:///path/to/kiwix-js/www/js/lib/uiUtil.js:65:17

That's really cool, and seems to work on current wikimedia ZIM files. But it does not work in every context : it seems to work in a Chrome extension, but not in a Firefox extension, which does not seem to allow javascript as blob. See at the end of https://developer.mozilla.org/Add-ons/WebExtensions/manifest.json/content_security_policy.

Inline javascript will give us a lot of difficulties with these various CSPs. I think we should also discuss with the openzim team (mwoffliner, and the other ones), to see if it would be possible to generate the ZIM files without inline javascript at all.

Jaifroid commented 6 years ago

Ah yes, I didn't actually test it with the back button. Thanks for pointing that out. Need some code to handle going back.

On Web extensions in Firefox, the implication of that page seems to be that we would need to loosen the CSP by specifying the sript-src as "blob:" in the manifest. That page seems to imply developers have some control over that.

I agree the best solution is to work with mwoffliner etc. to ensure future ZIMs don't use inline JavaScript. This is intended as a workaround for now, but partly also as a stepping stone on the way to full support for scripts in the ZIM.

On the latter point, I was wondering how difficult it would be to run xzdec.js or the file-reader wrappers as a web worker. If I understood what @sharun-s was doing with workers, I believe he out-sourced looking up dir entries to workers, which is slightly different from running the decompression in the worker. I might have misunderstood, but that's what I got from looking through the code for his finder.js.

Using a web worker to decompress specific (groups of) files would also be the reverse of the Service Worker model. @mossroy, please correct this, but this is what I understand is happening with Service Worker: it deals with the browser's requests for files and passes messages to the main app asking it to decompress things the browser needs, which are then passed back to Service Worker via postMessage. Using a Web worker would kind of be the opposite, no? We just hive off the heavy grunt work to another thread and wait for it to post the content back. Shouldn't this be relatively easy? Might it even be possible to do it with some switch in require.js (see http://requirejs.org/docs/api.html#webworker)?

I'm just thinking practically on how we add decompression of JavaScript to an already over-taxed app....

Jaifroid commented 6 years ago

Hi @mossroy , this turns out to be a blocker bug even on our current master. I've been doing some tests on the history.back issue, which only occurs in Firefox (I haven't tested Chrome). What happens is that our iframe.onload function gets fired by Firefox after it returns the iframe to its previous state. This has some bad consequences: we re-parse all the anchors, for example. As a result, even in master, at least on Firefox Quantum, none of the internal links work after going back to a previous page, presumably because they have conflicting event listeners added. Can you confirm? If you can, I'll add an issue for this, because it's a blocker. This doesn't happen in Edge, where everything works fine (Edge actually reloads the previous page from the ZIM, so it's obviously putting the DOM back to an earlier state than Firefox is doing and is re-running app.js, whereas Firefox only seems to run the iframe.onload event).

mossroy commented 6 years ago

That's right, I see that regression too (on Firefox 52 ESR, master branch) : please create a separate issue for that (for version 2.3). It looks like we'll have to wait a bit more before releasing 2.3...

mossroy commented 6 years ago

Regarding ServiceWorker and/or WebWorker :

mossroy commented 6 years ago

NB : I think the require.js feature you mention only allows the use of require.js inside a WebWorker. It does not create automatically a WebWorker around our javascript files.

Jaifroid commented 6 years ago

I've now rebased this javaScript-support branch onto the latest fixes in #366 that inject the stripped class(es) from the html tag into the body tag. And as a result, capable clients (Edge, IE and Firefox in browser context, not tested on Chrome) can now open and close content underneath the h2 headings by clicking on them, using the routines supplied in the ZIM to do this. This is a proof of concept, and is useful in showing that the .innerHTML injection method is not incompatible with JavaScript support.

Jaifroid commented 6 years ago

Regarding inline javascript and ServiceWorkers, it might be dealt differently. First, I'll try to convince that inline javascript should be removed in future ZIM files, because of CSP (that can exist in other contexts, too). I really think that this job (of removing inline javascript) should be done by the ZIM maker, not by the client. It would be much more efficient. And it would avoid having to parse the DOM in ServiceWorker mode (so far we did not need to do it, and it was the idea to be 100% generic). If it's not possible, we'll think about which solutions we have.

Replying to above comment posted in #366 , just to note that PhET ZIMs use extensive inline scripting. See for example the extract below from the front page of PhET. To support this in extensions where the CSP prevents running inline script, but allows Blob script sources, I think we'll need to do some DOM parsing. But we can revisit that when we're closer to a complete solution.

Incidentally, PhET also uses React, as second extract below shows.

<script>
    window.importedData = {"languageMappings":{"en":"English",
"ar_SA":"العربية (السعودية)","be":"беларускі","bs":"Bosnian","zh_CN":"中文 (中国)",
"zh_TW":"中文 (台灣)","cs":"čeština","da":"Dansk","nl":"Nederlands","et":"Eesti","fi":"suomi","fr":"français",
"gl":"Gallegan","ka":"Georgian","de":"Deutsch","el":"Ελληνικά","hu":"magyar","in":"Bahasa Indonesia",
"it":"italiano","ja":"日本語","ko":"한국어","ku":"Kurdish","ku_TR":"Kurdish (Turkey)",
"mk":"македонски","mr":"Marathi","nb":"Norwegian Bokmål","nn":"Norwegian Nynorsk",
"fa":"Persian","pt":"português","pt_BR":"português (Brasil)",
"ro":"română","sr":"Српски","si":"Sinhalese","sk":"Slovenčina","es":"español",
"es_PE":"español (Perú)","th":"ไทย","tr":"Türkçe","uk":"українська","vi":"Tiếng Việt"},
"simsByLanguage":{"English":[{"categories":[[{"title":"By Device","slug":"by-device"}],
[{"title":"By Device","slug":"by-device"},{"title":"Chromebook","slug":"chromebook"}]

PhET extract with React template written in React-JSX:

    <script id='ractive-template' type='text/html'>
        <div class="language-container">
            {{#if languages.length > 1}}
                <label>
                    Selected language:
                    <select value="{{selectedLanguage}}">
                        {{#each languages}}
                            <option value='{{.}}'>{{.}}</option>
                        {{/each}}
                    </select>
                </label>
            {{/if}}
            <label>
                Categories
                <select value='{{selectedCategory}}'>
                    <option value='all'>All</option>
                    {{#each categories}}
                        <option value='{{makeCategoryId(.)}}'>{{makeCategoryName(.)}}</option>
                    {{/each}}
                </select>
            </label>
        </div>
Jaifroid commented 6 years ago

Commit https://github.com/kiwix/kiwix-js/commit/e56ecde3d252490a3551eb1ad126e7ff6f05daef now adds support for extracting JavaScript files from the ZIM and injecting them as Blobs.

It works with Firefox, Edge and IE11 on Wikimedia ZIMs such as WikiMed. The script is able to extract the jQuery (and other scripts), attach them to the document, and enables the open-close functionality without attaching our own local copy of jQuery.

Although this code uses a basic and simple queue to load JavaScript after all CSS has been loaded, it does not yet attach the scripts in order. Therefore we get some console errors of the type "JQuery is not defined", and also a somewhat curious "JQuery is not compatible with Quirks Mode" in all three browsers. Nevertheless, jQuery still works (once attached) for opening and closing the sections.

It does not work on PhET yet. I think this is due to script loading order. At the moment all scripts are being attached at the end of the iframe's document body, whereas some are probably intended to be attached to the header.

Nevertheless, I'm pleased it is working with WikiMed generically without having to attach jQuery externally.

To inspect the contents of attached scripts in the console, it is necessary to comment out this line. This is only for debugging.

The next step is to cache the scripts for performance.

Jaifroid commented 6 years ago

Commit https://github.com/kiwix/kiwix-js/commit/96580cc98045ad4624ed2d8bd774d4858fbedd77 caches the scripts (on a per-session basis) in the cssCache (which should eventually be renamed), and ensures Standards mode as per #368.

Jaifroid commented 6 years ago

The JavaScript support branch has now been rebased on top of commit https://github.com/kiwix/kiwix-js/commit/68d1a4482ac25d92543627248cd352f352f926f4.

Jaifroid commented 5 years ago

The JavaScript support branch has now been rebased on top of current master (commit https://github.com/kiwix/kiwix-js/commit/ed75f710582e7b2cdc4af6135ce369705cc45935).

Jaifroid commented 5 years ago

I've updated the JavaScript support branch to work with new-style ZIMs. It threw up an interesting issue, which is that the new-style ZIM code uses window.onload. This of course never gets executed in jQuery mode if the JavaScript URL isn't in the ZIM before it is written to the iframe. For such events to run, we'd either have to patch them (very, very ugly), or preload the JS. I think we also have to revert to iframeArticleContent.write(), as only that guarantees a window.onload event inside the iframe, if I've understood correctly, but this can be tested.

The updated branch preloads and caches the JS so that performance is not badly affected. It does it by extracting the script URLs from the html string, looking up the scripts in the ZIM and setting a BLOB URI to their contents. Currently it caches the BLOB URI only, but a longer-term solution would be to cache the script contents and store it in indexedDB (we have #415 for that), so that it doesn't need to be extracted from the ZIM each time the app is loaded.

mossroy commented 5 years ago

I'm not surprised window.onload event causes issues here. It might be hard to make that work in a generic way.

Jaifroid commented 5 years ago

@mossroy window.onload is solved in jQuery mode. It's enough to ensure a BLOB reference to the script is in the html before it's injected into the iframe, and then the browser will execute it in the frame, with window transparently interpreted as the iframe's contentWindow. The JS in the new-style ZIM (and the JS in the legacy ones) now works fine. Please try it out with any ZIM in this branch [jQuery mode either from file:// or from localhost].

PhET support in this branch is partial: the landing page still doesn't work, but any experiment loaded with the random button seems to work fine.

Jaifroid commented 5 years ago

The JavaScript support branch has now been rebased on the latest master, so it now includes support in both jQuery and SW modes.

mossroy commented 5 years ago

I've tested a bit, and it works pretty well on PhET (with some exceptions you mentioned) : congratulations! I did not have a look at the code yet, as I don't have much time for that, and try to give higher priority to the work on ServiceWorker mode.

It's a very technical task you're working on, as you're handling things a browser should natively handle. If you want more ZIMs to test, you can try gutenberg for example (at least the main page seems to use some javascript)

Jaifroid commented 5 years ago

I rebased the JavaScript Support branch to enable testing of @ISNIT0's polyfill script for browsers that don't support the details tag and can't run SW mode ( (e.g. FFOS and UWP) . Below is what it looks like in Edge in jQuery mode. Note the sections are closed by default in desktop view. Clicking to open and close them works, and so does tapping (touchscreen).

image

Jaifroid commented 5 years ago

As a separate issue, due to the way this branch loads CSS (I think), the landing page is not displayed well, and has to be resized for the images to be rendered correctly with the CSS. This needs a fix in this branch, but that's not the point of this rebase.

kelson42 commented 4 years ago

Just FYI, this will be necessary if you want to take benefit of the soon to be released first ZIM JS API.

Jaifroid commented 4 years ago

@kelson42 I'm interested in this. Enabling JS support in jQuery mode is possible, and it works with non-complex scripts. It doesn't support complex things like ReactJS currently, as it is too complicated to inject all the required script blocks in a compatible way, and this is an issue with PhET ZIMs. But it may be possible to inject known scripts where the functionality is useful and if the API is fairly generic across ZIM types. Once we have a sample, I'd like to test the possibilities.

Jaifroid commented 4 years ago

I've rebased the JavaScript support branch on master.

Jaifroid commented 3 years ago

The JavaScript Support branch has been rebased in order to test #650 (the ZIMs are ZSTD encoded).

Jaifroid commented 2 years ago

I don't think we're going to provide even any rudimentary JS support in jQuery mode any more due to #727. While I did try to rebase the JS Support branch a couple of months ago, there are now so many inline scripts in the average ZIM (see #789 for a consequence of this), that things do not work well. We are also close to making Service Worker the default, which does not need these workarounds.