kiwix / kiwix-js

Fully portable & lightweight ZIM reader in Javascript
https://www.kiwix.org/
GNU General Public License v3.0
304 stars 130 forks source link

Prevent unnecessary display of 404s before image load #272

Closed Jaifroid closed 7 years ago

Jaifroid commented 7 years ago

The current code shows broken image icons before the images have finished loading, which can be a problem for heavy articles. I noticed that @sharun-s implemented a method for hiding the images until they load. I've improved it by hiding also the fixed image height and width until the images are loaded. The code is below. I've tested on Edge (mobile and desktop) and Firefox. I'd be happy to implement it here as a PR:

EDIT: I've radically simplified the code. Please ignore this and see next comment.

@@ -787,8 +787,23 @@ define(['jquery', 'zimArchiveLoader', 'util', 'uiUtil', 'cookies','abstractFiles
         $("#articleContent").contents().scrollTop(0);

         // Display the article inside the web page.
-        $('#articleContent').contents().find('body').html(htmlArticle);
-        
+        var $body = $(htmlArticle);
+        $body.find('img').each(function(){
+            var image = $(this);
+           // Prevents unnecessary 404's being produced when iframe loads images
+            $(image).attr("data-src", $(image).attr("src"));
+            $(image).removeAttr("src");
+            $(image).attr("data-height", $(image).attr("height"));
+            $(image).removeAttr("height");
+            $(image).attr("data-width", $(image).attr("width"));
+            $(image).removeAttr("width");
+            //Restore image height and size on image load
+            $(image).on("load", function (e) {
+                this.width = $(image).attr("data-width");
+                this.height = $(image).attr("data-height");
+            });
+        });
+        $('#articleContent').contents().find('body').html($body);

         // If the ServiceWorker is not useable, we need to fallback to parse the DOM
         // to inject math images, and replace some links with javascript calls
@@ -854,7 +869,7 @@ define(['jquery', 'zimArchiveLoader', 'util', 'uiUtil', 'cookies','abstractFiles
                 var image = $(this);
                 // It's a standard image contained in the ZIM file
                 // We try to find its name (from an absolute or relative URL)
-                var imageMatch = image.attr("src").match(regexpImageUrl);
+                var imageMatch = image.attr('data-src').match(regexpImageUrl);
                 if (imageMatch) {
                     var title = decodeURIComponent(imageMatch[1]);
                     selectedArchive.getDirEntryByTitle(title).then(function(dirEntry) {
Jaifroid commented 7 years ago

I've simplified the code outlined earlier. I'm not so familiar with the ins and outs of jQuery, so didn't realize there is a simpler way of doing this with jQuery:

            // Prevents unnecessary 404's being produced when iframe loads images
             $(image).attr("data-src", $(image).attr("src"));
             $(image).removeAttr("src");
             $(image).hide(); //Hide images to prevent occupying whitespace
             //Restore hidden images on load
              $(image).on("load", function (e) {
                  $(this).show();
             });
         });
         $('#articleContent').contents().find('body').html($body);

PS As this code isn't radical, and it has tangible benefits, I'll implement it as a PR for evaluation.

sharun-s commented 7 years ago

Unfortunately the problem with javascript (esp jquery) is the code can look simple, but in the background it can seriously screw with performance. Every time a height or width is tinkered with, the browser has to recalculate layout.

The advantage of setting height and width even before image is painted to screen is when the html is loaded for the first time. The browser just has to do one layout pass. (okay 2 because it then goes and retrieves css which would cause another layout pass - issue #149)

With this show/hide code every time an image is ready to be painted the browser would also need to schedule a recalculate layout task. Not only is it going to be a battery drainer it will occupy space/time on the task queue delaying other tasks like read next image or whatever.

btw my code wasn't trying to "hide the images". When $('#articleContent').contents().find('body').html($body) is called http requests are produced for every img src="something" tag being parsed. My code blanks out the src attribute so the devtools console isn't flooded with all those 404 file not found errors. It just made finding real errors a bit easier. Which was the main reason I added it plus a bit of performance gain from not generating all those requests.

Jaifroid commented 7 years ago

Hmm, well the image placeholders are very annoying, and had been bothering me for a while, which is why I had a go with this. I'm not seeing a performance issue on testing this code, especially the simplified version, which only affects the visibility of the image. The loading experience for the user is much cleaner. On a large page, on mobile, the user sees a bunch of broken image links for up to 30 seconds. This way, they see clean text until the respective image loads. I think it's worth testing and profiling at least.

sharun-s commented 7 years ago

:) You just need to be a little patient. 30s is soon going to be old news.

Jaifroid commented 7 years ago

@sharun-s I'm looking forward to it!

Jaifroid commented 7 years ago

I've simplified this even more to avoid any jQuery use. It's now native JavaScript and very fast, using a RegEx on the raw HTML of the article:

         // Display the article inside the web page.
+        //Fast-replace img with data-img and hide image [kiwix-js #272] 
+        htmlArticle = htmlArticle.replace(/(<img\s+[\s\S]*?)src(\s*=)/ig, 
+            "$1style=\"display: none;\" onload=\"this.style.display='inline'\" data-src$2"); 
         $('#articleContent').contents().find('body').html(htmlArticle); 

This has been pushed to PR #273 . Tested on Edge and Firefox.

sharun-s commented 7 years ago

@Jaifroid if you can figure out a regex way to do the same for css content it would be awesome.

Currently

  1. The filename(s) of css link(s) tag in head of the article, has to be pulled out,
  2. Then file retrieved from zim,
  3. The returned content added into a style tag.
  4. And then the style tag is added into iframe head.
  5. Then page is auto redrawn since a new style has been applied.

So the user sees the non css page load first, followed by the styled page a few seconds later. If there are more than one css file the redraw happens again and again.

If you can use a regex to write the style tag(s) into the htmlarticle head before the first time it gets applied -> $('#articleContent').contents().find('body').html(htmlArticle); all these unnecessary redraws will be eliminated. It would close #149. It's been on my todo list for a while, but I have been preoccupied with other bugs.

Jaifroid commented 7 years ago

@sharun-s , I've actually been working on something along these lines. But code is still a bit rough and I have to call for a blob: ref to the .css file to insert it into the raw html.

On this point, how hard would it be for you to extend and abstract your fast image loading functions so that they can also load files with extension .css and .js and return a blob: to them? I know I can use uiUtil.feedNodeWithBlob , but I think you've nailed down a very fast low-level routine for extracting these blobs in parallel.

I imagine some of the speed of your functions is that images for a particular article are stored physically close to the article in the .zim. Do you know if that's the same case for .css files? I imagine that the global Wikipedia .css files are only stored once in the .zim, not once for each article. and that part of the slowdown with .css extraction is not so much the redrawing as the fact that we have to scan through a 16Gb file to find a piece of randomly located .css.

In any case, we'll see if applying blob: hrefs to the .css in the raw html will produce a speedup.

sharun-s commented 7 years ago

I have to call for a blob: ref to the .css file to insert it into the raw html.

I hope this works. Should definitely help my current experiments.

My current loader basically blocks everything until the css and first N image lookups+reads are done. The idea being to optimize load for resources required "above the fold". So if the user can only see 2 images, there is no point overloading the browser with the remaining 200+ lookups and reads until those 2 lookups+reads + CSS are done. This way I am getting images that are viewable to load within 500ms. The current css load+inject unfortunately takes about a second, I think because of the double draw I talked about above and probably lot of unrequired jquery processing. So any shaving down of that number can potentially make things look instantaneous.

Do you know if that's the same case for .css files?

I don't know about this. I hope they are stored only once. What is on my todo list, is to maintain a small cache for common css and js blobs. But a week or two of other work before I get there.

Now thinking about it, if your ref to blob approach works I wonder if there is someway to tell the browser to cache that blob.

mossroy commented 7 years ago

Avoiding the 404 errors is a good idea, and the PR seems to do that quite well. But I don't see the point of removing the image placeholders. When there are width and height attributes in the HTML source, it allows the browser to do the layout once for all, and it allows the user to see that an image will appear there. If the image is hidden, I suppose the layout needs to be done again after each loaded image, which takes some CPU. And, for the user, it makes the layout "move" too : if you scroll the page to read some text below (before everything is loaded), the lines you are reading might be moved several times until every image above is loaded.

Jaifroid commented 7 years ago

I think it's disconcerting for a user opening a page to see a placeholder either with a broken link icon, or a big red cross (I realize not all platforms show broken image links quite so obtrusively), and to be looking at them for up to 30 seconds on a slower mobile.

However, I think @sharun-s 's code speeds up the display of images so much that it's no longer an issue (images load instantaneously on Windows Mobile even on a heavy page), so hiding the placeholders is less necessary. However, we don't yet know (do we?) if that code will work in older browsers, though that may be a matter of tweaking.

I suggest we keep this issue open until the new code has been tested. In the meantime, you can either close the PR and re-open this issue, or I can use style="visibility: hidden" instead of "display: none" if you would like tot apply the PR but reserve the image space on the page before image load. Personally I think it looks neater (while waiting) without the space reserved, but I completely recognize that's subjective.

Jaifroid commented 7 years ago

@mossroy said:

In general, jQuery certainly does the parsing better (in terms of compatibility) than we might do manually, even with a regexp. I'm thinking about old browser compatibility and/or the many ways to write HTML code. For example, I'm not 100% sure what would happen if some style, onload or data-src attributes already existed in the img tag. It seems jQuery can also work on HTML strings with https://api.jquery.com/jQuery.parseHTML/

But I have to say you did it very well, as you handle the case-sensitivity and multiline aspects. Regarding speed, it's certainly much faster than a complete jQuery parsing of the String, that will be done again when the string will be inserted in the DOM with .html(htmlArticle).

Another idea would be to make jQuery parse the string with .parseHTML, change the img src attributes to data-src with jQuery, and then inject the elements in the body with .append instead of .html. So that the jQuery parsing is only done once (I did not test, though). The performance must be tested, but it might be similar, and we might use all the power of jQuery if we want to do more changes in the HTML before injecting it.

Obviously you get convenience and compatibility patched in with jQuery, but it looks like we're getting a serious performance hit on repeated DOM parsing in loops, which is what we've been trying to eliminate (and I'm working on something similar for .css, which is currently terribly slow as @sharun-s points out).

RegExes like this should be compatible even with Symbian, as they've been in JavaScript since December 1999. So long as we don't try to use some of the more exotic forms (e.g. negative look-behind), we should be fine.

Anyway, let's see what the effect of inserting the .css blobs into the raw HTML is in terms of speedup, to see if any of this is worth it.

Jaifroid commented 7 years ago

PS we can basically do data-anything, so we could use a custom data label like data-kiwixsrc. See https://www.w3schools.com/tags/att_global_data.asp

PPS I've now used data-kiwixsrc in the PR, and have commented out the code that hides the whitespace reserved for images.