ZIM backend : Cache the directory entry list

mossroy commented 8 years ago

To improve performance

sharun-s commented 7 years ago

If the cache is maintained only during asset loading and then thrown away there is some nice article loading speed up to be gained.

I noticed this while playing around with replacing util.readFileSlice with XHR range requests. For example when I load the wikipedia article for Paris I was seeing ~14000 requests in the network tab of devtools. Just total read time was 30-35s. This didn't make any sense as there were only 281 images to be loaded. So on avg there were ~50 read requests per image. And most of the images are really small (flags, icons etc). So where were all the reads coming from?

Binary search.

Since the zim article count is 1.7 million binary search requires on avg log2(1.7 million) steps or around 20 calls to getDirEntfromtitle (which itself is a 2 step process of reading index followed by reading dirent). So what happens if a temporary cache ( a Map of url\index -> dirent) is maintained just for the image loading process 14000 requests fell to 4000. And time fell from 30-35s fell to 9-11s.

I haven't confirmed this, but it is probably because most of the assets of a given page are all stored close together in the zim, so a whole lot of binary search steps(if cached) don't get repeated. The cache size for a large page like Paris was about 3 MB.

I made a note of this here instead of opening another issue, so any dirent caching related comments are in one place. There are further ways to optimize this. Will leave notes in #240 as these two issues are interconnected.

mossroy commented 7 years ago

Very interesting investigation. It would be great to improve performance, which is very bad for now (compared to the native libzim implementation).

@peter-x had started working on some cache implementation, and submitted the PR #183 Unfortunately, it did not speed up as much as we expected, so it was not merged (caching can be efficient but makes the code more complicated to understand/maintain/debug).

Also keep in mind that if #116 finally works, it would replace this kind of optimization (at least for wasm compatible browsers).

sharun-s commented 7 years ago

The code here is pretty simple because it's just being used during one page load (further just restricted to image loads - which cause the biggest number of requests). If cache had to be maintained across page loads, dirent hit rates would drop drastically and code would be more complex (to keep cache size in check with the right cache replacement strategy).

Basically now it's just retrieve from cache and add to cache in zimArchive.getDirEntryByTitle and zimFile.dirEntryByUrlIndex Because the cache is quite small even for large pages it can be maintained just for the load duration and discarded.

Currently the branch is a mess because of other unrelated experiments, but will clean it up and send a PR when I get some time or if the other stuff add to the gains. I have a feeling the current 9 seconds (with cache) spent in lookup, can further drop to around 1-2s. A lot of the code is async, but not concurrent. This is the crux of the performance issue.

That is all the tasks getting produced are being queued up (randomly) one behind the other on the main thread. And then there is a long wait for everything to return. Much of this can be parallelized with webworkers, xhr requests etc so some of those tasks are being done on separate threads.

Whether 116 works out or not I find this a nice little problem that is helping me learn some javascript :)

mossroy commented 7 years ago

Cool : that looks very promising!

sharun-s commented 7 years ago

@mossroy I believe, I have some good news for you :-) Please take a look at a new branch I have created called workers when you get time. And use a big zimfile like wikipedia to test it.

This branch builds on the binary search cache I posted about earlier. Basic idea is all the urlindex/dirent lookups generated by binary search (during the image loading process of a page) are sent off to a web worker. This is easy to do as we have the list of images whose dirents we want to find. Every time an image's dirent is found a message is sent to the main thread to initiate a file read.

So what we get is dirent lookups (which are in the thousands thanks to binary search and large zim size) happen in the worker (taking time X). While blob reads (which are in the hundreds) happen on the main thread (taking time Y). So total image load time before used to be X+Y and now with two threads total time is which ever is greater. Add the binarysearch cache for image loads and value of X shrinks quite a bit.

The Paris image load time has dropped for both lookup+blobreads from ~45s to 10s. And I have tested with other pages too and it's looking good :) Firefox to my surprise is showing even better performance than Chrome. Also works on Edge!

There is lots more to do and try, including a whole lot of cleanup, so good point to get feedback before adding all kinds of other things.

mossroy commented 7 years ago

Thanks a lot for that! I'll try to have a look in the weekend

mossroy commented 7 years ago

That's really cool! An amazing work on performance optimization. The speedup is very clear on big articles : the whole page is ready around twice faster. That proves there is room for serious performance gains in pure javascript.

As you said, the code needs a lot of cleanup (for example remove the readXHRSlice stuff, the commented lines, avoid code duplication inside the webworker, more meaningful variable names etc). As it is, I find it very complicated to get into, which is normal because, as you said, they are also unrelated experiments. There are also some other refactoring you made on the code (in app.js), that should be proposed (in the end) in separate PRs as they're not related to optimization.

The main concern is to find a balance between speed gain and code complexity. I agree with you that the cache on dirEntries itself is quite simple, but the whole thing is not : several concurrent instances of a webworker, the code of the webworker itself is not trivial, and all this only applies to images, where other content is still read as before. Inside the webworker, why do you loop N times (with N set to 2) on startChain, with and id parameter that is not used? I suppose it's a previous optim experiment that is remaining?

I'm wondering if the webworker(s) could do all the work on ZIM files : as soon as we need to read something from the ZIM file, it would be done in a separate thread (through the webworker). It would leave the main thread for the UI only, and avoid code duplication. It's just an idea, I did not look into it.

This optim also broke a few things for now, which is not surprising, and can certainly be fixed (but it's not the priority) :

Unit tests are now failing
The CSS styles are not applied any more on Firefox (but are on Chrome)
When clicking on a link to a new page, it does not scroll to the top of the page any more
It does not work any more on Firefox OS (which supports WebWorkers : it was working on Evopedia, which was using a webworker too)

Globally, I think your research gives very good hints on what can be optimized. We need to find the right way to do it efficiently while keeping the code simple.

sharun-s commented 7 years ago

:) Yup there are lot of issues and lot possibilities. Which is why I posted and didn't submit a PR. There is lot's to discuss before that happens.

I think the right way to do it efficiently involves figuring out Issue 240. Lot of things will fall into place once that process is set.

All this is possible and possible to talk about I feel only when code flow timing clearly reveals itself for any change proposed.

Otherwise non-obvious to know which changes take us 5 steps forward and which take us 10 steps back. Once some insight is gained, it is easy to try all kinds of things. And as changes pile up on top of each other, things get more complicated very fast. I have a long list of things worth trying next, but there is no use trying them until we have a common established way to talk about perf.

Time is not straightforward to measure because of the amount of async tasks (Promises), spawning async tasks, spawning async tasks ad infinitum. While it is possible to measure when the task started and ended, many of the spawned children, grandchildren and so on could be still being processed or on the queue and so endtime-starttime hardly ever captures what is going on. This is why you will see random returns, Promise.resolves and Promise.all, id parameter etc in my code. It's was my goal to capture when each promise is resolved and how long they are taking in Total. But not being a js dev I don't know what the best practices are. This timing info gave me the first clues that binary search was the culprit. It further gave ideas of how to parallelize the code by putting the "right" bit into the worker.

I need help figuring this measurement stuff out. Whether this is in code or a documented process, it will make sure anyone who touches the code in future, is measuring things the same way. This is why #240 is important. The sooner we lock down how to do that the foundations are set, for all future changes to be performance sensitive.

Now to address your other comments: on looping N times - I started explaining this but this became quite a long story (a back and forth chat session or something would be more appropriate) We can do that, after you get through these two links - run n promises in parallel and the js event loop

They also helped me understand how exactly to time the $(articleContent img).each(...) main loop which produced the clues on what needed to be done.

I know it's breaking a lot of things and am keeping track of them, but I am treating this as an experiment rather than code ready to be checked in. As you say this has to be planned out as many different things need to be worked out. Let me know when you are ready for a chat (you can just mail me). It's hard to type out everything here as there are too many things to talk about.

mossroy commented 7 years ago

I agree with your approach, and the way to do it. It's too early to create PRs. It's probably better to try a few different optimizations (even if the code is not clean, even if it breaks things), measure the gain, and then decide what to do. I think something like nightwatch might help us, as it's able to wait for an element to be visible. We might ask it to wait for all the images to be visible on a specific big page, and measure how much time it takes to have all of them. It would also avoid to create timers that would depend on how the code is implemented. The existing nightwatch_runner.js file (run on each build by Travis/Sauce) might be a start.

Jaifroid commented 7 years ago

Excellent work, @sharun-s ! I've tried it out on Windows Mobile, and the code works very well -- the image load time is greatly reduced, though CSS is no longer loading (as noted by @mossroy above). My Bolivia "test" page (very heavy page inside a 15GB ZIM file) displays images almost instantaneously on the phone!

Note however that the buffer overrun bug is still present in this branch, and it blocks loading of almost any page in a large file on WM10. Could you possibly fix it on your repo to facilitate my testing of further changes on WM10? Easiest would be for you simply to edit xzdec.js: search for "8e7" and change it to "0x8e7" so that bit of the minified file reads: var TOTAL_MEMORY=Module["TOTAL_MEMORY"]||0x8e7;. That might even fix some of the issues with the other browsers.

Thanks for your great work on this -- even in this preliminary form, it would be a great improvement for a beta release of the UWP app (obviously CSS needs fixing).

sharun-s commented 7 years ago

:) thanks @Jaifroid but I wouldn't use this code right now. It will in all likelihood lead to newer more exotic issues. Worker lifecycle/termination is not at all being handled, so no idea right now what issues that might produce. I will get some time this weekend to work on it, and will give you an update on when a more tested/cleaned up/stabler version will be available. This is going to be a positive change so I'd rather get it right than rush through with it. That said please report any other issues you see if you are playing with that branch. Have made a note to update the branch with your xzdec fix and other newer changes.

Jaifroid commented 7 years ago

Thanks, @sharun-s . Don't worry, we're not ready to release a Beta yet, there are a few things that need cleaning up and debugging elsewhere first, including a problem with the app force closing during file picking on random occasions, which may be to do with the require.js method of chaining libraries, but I have to debug it. I won't use your code yet, but I will test it in-app, as opposed to in-browser (it works in-browser on Windows 10 Mobile, haven't yet tested in the app environment, though it's the same rendering engine).

sharun-s commented 7 years ago

Just an fyi - have updated my dev branch. Fingers crossed with this change you should see some nice load times.

Before if there were N images to load I was just splitting the lookups between w workers. So an N/w array of tasks would be sent to each worker. This change has added another worker that just looks up images "above the fold". Theory being, the user is only going to be seeing less than 2-3 images within the initial visible portion of the article. The goal is to take every other piece of work of the browsers plate until these images + the first css file get injected. So I WAIT for that to complete before initiating any further image read/injecting. I have currently set the "above the fold" number to 1 to minimize time spent as much as possible. But it can be higher for similar results and we can decide what it should be after more testing.

I have also reverted to using the iframe instead of the div. So css should load and scroll to top should work.

Code has to be cleaned up but unless you guys spot some issues I think it's ready to be merged.

Jaifroid commented 7 years ago

OK, that's great, thanks @sharun-s . I'll test asap on my platforms. Can I just check, are you still using the existing jQuery code for injecting the CSS?

sharun-s commented 7 years ago

@Jaifroid haven't change anything much in the css portion, other than to add a promise (cssLoaded), to track when the css file has completed it's loading. You can lookup where the cssLoaded is used to see how it controls when the workers get created.

Jaifroid commented 7 years ago

@sharun-s , I've tested on Edge, on Edge Mobile, and as a UWP app for PC and for mobile, and it's functioning very well -- congratulations! The images load extremely fast, even on mobile with a very heavy page from a 19Gb file. The CSS is still very slow to load on mobile, but a fair bit faster on PC.

FYI, the blob approach works with CSS, but it is difficult to ensure the code is inserted into the HTML before the HTML is injected into the iframe, due to the (necessary) asynchronous design. Your promise approach with the CSS might help, but there'd necessarily be a bottleneck if working on the HTML string BEFORE it is injected. This will only be useful if it dramatically speeds up extraction of CSS.

I really don't understand why extracting a little bit of ASCII from large multi-gigabyte ZIM file takes so much longer than extracting a dozen binary images from the same file, even before your speedups with the images. I can only think this is because the CSS is stored at the end of the ZIM file. If we know roughly where it is stored, maybe we can target the search. I don't understand the search algos well enough, but it seems you do.

Jaifroid commented 7 years ago

@sharun-s , @mossroy , I'm away in Paris this weekend from later today, back Tuesday, so I have to pause on the dev work till then. Have a nice weekend!

sharun-s commented 7 years ago

I have mostly been testing on Chrome. But I did a brief Firefox and Edge (on desktop) run and found things mysteriously much faster on Firefox and much slower on Edge. Have to look into what they are doing differently. Might throw up things that can be applied across the board.

My feeling is CSS loading issues might not be a retrieval (i.e. read/lookup) bottleneck. Timing wise, have seen them take about the same time as image lookup+reads. More likely an injection/render issue. Need to test more though, to verify things. My focus has been more on the image side of things so far. And I think that CSS delay is more pronounced due to the double draw/reflow issue we discussed If the page suddenly changes style after a second or more the delay is very noticeable to the user.

Good to hear the blob approach works. Will try it out over the weekend.

When you do get a chance next, please save and send me the timing info printed out on the console. "TimeToFirstPaint" holds how much time it's taking for that first image and css load. Just to get an idea of the performance differences platform to platform. Anyways no big rush. Enjoy your trip!

Jaifroid commented 6 years ago

@mossroy Having ported @sharun-s 's DirEntry cache code to Kiwix JS Windows some time ago (it's been in since the first release of the app in the MS Store), I could very easily port it into Kiwix JS. From memory, it's virtually a couple of lines and it dramatically increases the speed of subsequent loads of any page (it doesn't affect the speed of the first page loaded), due to caching of the entries for stylesheets, and caching of the blobs (there are two possible cache optimizations, one of which I found more useful than the other). I ported this to Kiwix JS Windows in order to increase performance if a user disables the loading of cached assets.

No rush of course, as this would be for v2.3, but thought I'd offer to do this, having had the experience of integrating it already.

mossroy commented 6 years ago

Thanks a lot for the offer @Jaifroid . It would indeed be worth porting this on kiwix-js, as the code is not very complicated. But I think of a few things to be careful at :

There should be a limit of how much data we put in the cache. Else it might crash the browser engine on some devices that do not have a lot of memory
If there are 2 cache implementations and one is clearly more efficient than the other one, I would prefer to implement only this one for now. And take the time afterwards it it's worth adding the second one or not

mossroy commented 6 years ago

I should also add that there had been an attempt in #183 to do similar things, (but the speed improvement was not worth it)

Jaifroid commented 6 years ago

Commit https://github.com/kiwix/kiwix-js/commit/871836c036a6ae4ad86386d62031bfe85c0db7b8 implements the cssBlobCache. Of the two, dirEntryCache and cssBlobCache, it is the latter that makes the most dramatic difference, because the css is very slow to load, and once cached (after first page load, sometimes after second or third for less-used CSS) it then loads instantly from the blob cache. Thanks to @sharun-s for the underlying code.

Tested and working in the FFOS simulator. Note that I've left console log entries in so that you can see what it's doing (when the cache is hit).

VERY IMPORTANT: in this prototype code (for testing) I'm only clearing the cache on loading a ZIM with the File Picker here . However, it will be very very important to identify where Kiwix JS running on FFOS / Ubuntu phone / other packaged apps loads a new ZIM (without using the FIle Picker) and add in a copy of the cache clearing code at that point too.

As you know the ins and outs of the FFOS app, can you take a look at where else this code should be added, @mossroy ? The cache doesn't persist between sessions afaik (well, that might need testing on FFOS -- it doesn't persist between browser sessions). But it must be cleared on loading a new ZIM.

Jaifroid commented 6 years ago

PS the branch for testing is: https://github.com/kiwix/kiwix-js/tree/Backport-CSSCache

Jaifroid commented 6 years ago

PPS As the comment in the code suggests, it may be sufficient to insert the cache clearing code in setLocalArchiveFromArchiveList , but I've fiddled too much with that part of the code in Kiwix JS Windows to be confident of putting it in the right place for FFOS.

Jaifroid commented 6 years ago

Hmm, I spoke too soon about it working in FFOS. It appears to work, but it's having trouble loading CSS refs that look like this:

<link href="blob:app://kiwix.org/2fe29d70-2478-4ed9-8e64-0308d9413c7c" rel="stylesheet" type="text/css">

I'm getting this error in the console:

Content Security Policy: The page's settings blocked the loading of a resource at self ("script-src app://kiwix.org").

No such error running this code in Firefox. Any thoughts on what the problem is in FFOS, @mossroy ?

Jaifroid commented 6 years ago

OK, I think FFOS is seeing blob: URLs as remote URLs. According to https://developer.mozilla.org/en-US/docs/Archive/B2G_OS/Firefox_OS_apps/Building_apps_for_Firefox_OS/CSP , on FFOS:

Remote styles are banned All tags must link to files in your app's package. Inline style tags and attributes (

kiwix / kiwix-js

ZIM backend : Cache the directory entry list #130