kiwix / kiwix-js

Fully portable & lightweight ZIM reader in Javascript
https://www.kiwix.org/
GNU General Public License v3.0
299 stars 124 forks source link

Can Kiwix-JS deal with an online ZIM file available through HTTP? #595

Closed kelson42 closed 3 years ago

kelson42 commented 4 years ago

I know we have talked about that in the past... but it the outcome of the discussion is unclear to me.

Jaifroid commented 4 years ago

Kiwix JS can load a ZIM file into memory over HTTP(S) and use it that way, which is not much good because you might as well be downloading it in that case. Sharun-s had developed a technique using XML Range Requests to access part of a ZIM file over http(s). It was slow.

Jaifroid commented 4 years ago

I might add that the slowest element here would be binary search of the title list and url list, as we would be opening potentially hundreds of XMLHttpRequests to the server for very small bits of data. There is a caching module (developed by peter-x) that helps in that situation, but given #581 this could still be pretty slow on large files.

kelson42 commented 4 years ago

@Jaifroid I mean without downloading the whole file. Just dealing like with a local filesystem file, just downloading the cluster and unpacking them?

Jaifroid commented 4 years ago

@kelson42 Sorry, did not see your further query till now. Kiwix JS cannot currently download clusters over http, but sharun-s developed a prototype of that capability in his branch. He used XML range requests to read clusters from a file on localhost, but there was no reason why the exact same technique could not be used for a remote server. XML range requests ask the server only for a small part of a file, so it fits what you are asking about.

The problem is not integrating sharun-s's technique (that is not particularly difficult). The problem is that it would inevitably be slow: binary search in particular is the main problem, because you would generate hundreds of requests to the server for each article. We do have block cache code that would greatly reduce the number of requests by caching dirEntries and clusters that are already requested, and that helps a lot. See #619.

So, it may be feasible for smallish ZIM files, but probably excruciatingly slow for anything with a very large url pointer list that needs to be searched.

What's your use case?

Jaifroid commented 4 years ago

PS the server would need to accept range requests: see https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests. I think most do (those that allow resumable download, or skipping through a streaming media file).

kelson42 commented 4 years ago

@Jaifroid Thank you for your answer. My feelings about this approach are mitigated, I don't really believe in this for the future (but I might be wrong). But I wanted to get this documented properly for my own knowlege and for a few other reasons as well.

One of them is that @ikreymer, the lead developer of https://github.com/webrecorder, has interest on this approach and has built a similar feature around the WARC format. This is a WIP, but he might give here a few pointers to a code/demo. We might work with him more closely in the near future for the Zimit project and he is interested in the ZIM format as an alternative storage format to WARC.

An other one is that guys, in particular @lidel, at IPFS are interested as well in that approach, and are building a grant to hack Kiwix-JS to access ZIM files on IPFS that way. See https://github.com/ipfs/devgrants/pull/49.

Jaifroid commented 4 years ago

Yes, it seems quite specialized, as most users would prefer to access the original resource from which the ZIM was made. It could work as a kind of replacement for Kiwix Serve in local area networks not connected to the Internet, or as a static mirror for Wikipedia where articles can change very fast or get deleted. But it will never be so fast as Kiwix Serve, which could do the same job.

ikreymer commented 4 years ago

Apologies for delay! I've launched a project I've been working on, called: https://replayweb.page

It's a fully browser based replay system for web archives. I've focused on the WARC format initially, as that remains a core source of web archive data, but it would be great to support ZIM format as well.

The system is designed to replay archived web data in a service worker, loading everything on-demand. (Small files or where on-demand load is not possible are loaded fully).

Here is an example of a 17GB web archive of a complex, interactive website, loaded entirely from static storage: https://replayweb.page/?source=https%3A%2F%2Fdh-preserve.sfo2.cdn.digitaloceanspaces.com%2Fwebarchives%2Ffilmingrev.zip#view=replay&url=https%3A%2F%2Ffilmingrevolution.supdigital.org%2F&ts=20200510030411

This page shows how it can also be embedded within another site in an iframe: https://webrecorder.net/embed-demo-2.html

The whole archive is downloaded from a single file. To make this work, there is a small binary-searchable file that's downloaded first, and it serves as an index to other index data and offsets to WARC data.

I think this could also work for ZIM, but may need to generate a similar index outside of the ZIM file, and package it up together. (The key is that binary search does not happen over http, but locally). I think it really amounts to have an efficient index that can be loaded client-side.

I am working on a spec for such a ZIP based format here, which would bundle existing archive, indices and metadata, in one package, here: https://github.com/webrecorder/web-archive-collection-format

Happy to answer any questions about any of this!

Jaifroid commented 4 years ago

Fascinating @ikreymer. This looks excellent, congratulations!

For this to work with ZIM files, there are some options I can think of. You are absolutely right that binary search will be the greatest challenge, and I agree that it is necessary to download and cache the index to be searched, because of its nature binary search involves large hops within an index that become progressively smaller. I think at the very least we would need to access the archive and download from it the Title Pointer List, and save it/cache it. For full Wikipedia this would be extremely large, probably unfeasible, but for smaller themed Wikipedia ZIMs it should be fine. Kiwix JS accesses the standard ZIM indices, and it wouldn't be too hard to extend the code to get the full index so long as the server supports XHR range requests.

There is an alternative for newer ZIMs, which is the Xapian compressed full-text index that some ZIMs contain. Again, this would be impossibly large to download for a full Wikipedia ZIM, but doable for smaller ones I think (I haven't worked with the Xapian indices because Kiwix JS doesn't support them yet).

ikreymer commented 4 years ago

I've added a quick prototype for replayweb.page to load ZIM files, packaged into the WACZ format I have been working on. Here are two examples.

Here is a small wiki from (https://download.kiwix.org/zim/wikipedia/wikipedia_en_100_2020-06.zim, 300MB)

Index

An Article

There is no full text search yet, and I didn't use the title index, but you actually can search ZIM entries by url, for example: Search for 'Em'

And here is full German wikipedia. In this example, its just loading the index from DIgitalOcean, the .zim is not bundled, as I didn't want to download and re-upload 37GB file, so its loaded directly from https://download.kiwix.org/zim/wikipedia/wikipedia_de_all_maxi_2020-05.zim (via https://cors-anywhere.herokuapp.com/ since the download links do not support CORS).

It's a little bit slower, especially on initial load, but I think generally usable:

Index

An Article

Search for 'Kiwi'

Since replayweb.page uses full urls, the urls are mapped to http://zim/<namespace>/<path>, eg. http://zim/A/Kiwix

What I did was generate an index, similar to the style I have for WARCs, that allows going directly to the entry. The clustering in ZIM makes it slightly more complicated, as have to decompress from beginning of cluster, rather than from the response itself.. Not sure if there's any way to avoid that. The index adds a 162MB for this large 37GB archive, which is negligible anyway, but certainly could be reduced in size further -- so far I optimized for access only.

I used the existing xzdec.js from kiwix-js and the copied the approach kiwix-js uses for decompressing the cluster, which can be found here: https://github.com/webrecorder/wabac.js/blob/zim-work/src/zimloader.js#L160

Jaifroid commented 4 years ago

@ikreymer This is hugely impressive! Congratulations!

I'm not sure how much of Kiwix JS you've used other than xzdec, but you should be aware of #630 which has recently been merged and squashes a Kiwix-wide bug that forced the reader to attempt to load the entire ZIM into memory on recent ZIMs which have moved the URL pointer list to the end of the file.

You should also be aware of an upcoming compression format change from XZ to ZSTD. See #611 , where I have begun to make a little bit of progress towards supporting a wasm/asm version of ZSTD decompression. If it's something you can help with, any contribution would be gratefully received.

I'll look into your implementation when I have a bit more time, but it looks very exciting.

Jaifroid commented 4 years ago

@ikreymer You say download.kiwix.org doesn't support CORS, but we had a specific issue on this and it was fixed some time ago. I just checked again with an online checker and I get the following:

image

kelson42 commented 4 years ago

@Jaifroid @ikreymer Considering that this SW reads a ZIM file directly, it will probably be read from one of the mirrors... which might not have CORS configured properly. For this demo, I would recommend to use https://mirror.download.kiwix.org/ to secure we use a Web server we control.

ikreymer commented 4 years ago

@Jaifroid Yes, I was referring to the direct download mirror links specifically. It does not appear that any of them support CORS for the ZIM downloads, including https://mirror.download.kiwix.org/ so was justing a proxy for now..

This xzdec is the only component of kiwix-js that this is using at the moment. What it's doing is pre-computing a separate binary search index:

A/Benutzer:The_other_Kiwix_guy/Landing {"mime": 9, "d_off": 1772485, "d_len": 17763, "offset": 4331030089, "length": 294912}

The offset and length are the ranges to fetch for the full cluster, and then the d_off and d_len are the offsets into the decompressed cluster.

By comparison, for WARCs, there is only a offset and length, which is the full response to be un-gzipped, as each object is compressed individually..

I wonder if the main perf difference is in the compression itself (would be interesting to have some benchmarks of decompression of gzip vs xz vs zstd, esp in JS), or the additional offset to the beginning of the response..

ikreymer commented 4 years ago

Since the new Wikipedia EN maxi ZIM just came out, wanted to make a client-side update for that as well. This uses a fast-access index (~400MB) and then loads directly from: https://cors-anywhere.herokuapp.com/https://mirror.download.kiwix.org/zim/wikipedia/wikipedia_en_all_maxi_2020-06.zim in the browser:

kelson42 commented 4 years ago

@ikreymer This is really great, it seems we are really not far from what https://github.com/ipfs/devgrants/pull/49 wants. Does something stops you to use the ZIM article index in place of https://dh-preserve.sfo2.cdn.digitaloceanspaces.com/webarchives/zim/zim-wiki-en-all.wacz? It seems the is the only thing left to be able to complete https://github.com/ipfs/devgrants/pull/49?

ikreymer commented 4 years ago

@ikreymer This is really great, it seems we are really not far from what ipfs/devgrants#49 wants. Does something stops you to use the ZIM article index in place of https://dh-preserve.sfo2.cdn.digitaloceanspaces.com/webarchives/zim/zim-wiki-en-all.wacz? It seems the is the only thing left to be able to complete ipfs/devgrants#49?

Unfortunately, I don't think the ZIM format alone as is will work, because it would require a binary search over the entire index over HTTP, and too many seeks to get the content. As @Jaifroid points out, this won't really scale to this size. The approach I am using is 'clustering' the sorted index, similar to what ZIM is doing for data, but for the index. The index is grouped into sorted clusters, and a binary search is performed over clusters, of say ever 1000 entries. This reduces the main index so that its small enough to be downloaded and binary searched locally, not over HTTP. (In this case, it's just putting it into IndexedDB and letting it handle it). Instead of doing a binary search over 20 million entries, its doing a binary search over ~20 thousand, and then downloading a 'cluster' of index entries. Each entry in that index also precomputed with the offset/length of the HTTP fetch, and the offset/length into the decompressed chunk. The end result is that with this index, binary search is performed locally, and there are exactly two HTTP fetches per url, one to get the index, and one to get the data. The index stored in the .wacz adds ~400MB, which I think this is negligible compared to the 89GB of the data.

You could probably change the ZIM format to have a cluster for the index as well, but I assume that would be a fairly big change to the ZIM format? The goal would be to minimize the amount of random access seeking that is necessary to find the offset to the blob, given a URL. (With this index, there are 2 seeks only). One approach to do this might be to have an extra urlPtrList that references, say every N urls, eg. every 1000 urls, first offset is to 1st url, next to 1001st url, 2001 url, etc... That could allow that section of the index to be bin-searched locally. However, that alone may not be enough, as that's only for the directory entries, and then there's the cluster and blob lookup. Looking up the cluster offset, and then finding the blob offset would require extra seek/read. Perhaps if the list of all cluster offset+lengths could somehow be precomputed and stored locally, that might allow for a similar access pattern, of loading the entire cluster, then finding the blob offset by blob index, though seems like it would still add extra overhead for access.

The approach I'm using precomputes all of these lookups to minimize access time. I'm sure its not necessarily the best indexing approach, just one that I knew would work, as its been implemented before. There are probably other ways to do this also, and reduce the extra index to be even smaller.

kelson42 commented 4 years ago

@ikreymer Thank you for this interesting explanation. My first though was like yours and the one of @Jaifroid. Thinking twice about it during this WE I don't believe anymore that the format is not proper for that.

Let's take it from scratch, the search in the urlPtrList complexity is O(log n). This means that for 1M entries (kind of worse case scenario) you will need ~20 HTTP requests. This might sounds a lot but in fact:

I believe it could work properly only with a local indexedDb cache build dynamically. It will probably be a bit slower (in particular for the first pages) than what you have done. If we want maximal and constant speed, we would indeed have to prefill this cache, this could be done relatively easy by proposing the N first levels of the binary search tree in a form of a json in the ZIM file to populate it (but to me this is just an optimisation).

@mossroy @Jaifroid @lidel What do you think about this?

Jaifroid commented 4 years ago

@kelson42 We have code for a block cache which is particularly effective with binary search (see #619). It was written by peter-x, and has been running on kiwix-js-windows for a few months now. The speedup is most notable in title search which becomes almost instant. The speed of search you see demonstrated at this video time-point is down to the binary search cache.

Whether it would be sufficient to make it usable over HTTP would be a matter of trial-and-error. There are parameters that can be tweaked, but the cache isn't currently permitted to grow to anything as large as ~400MB, as it is currently kept in memory rather than in indexDB. The latter is perfectly possible: although slower to access than memory, it should be "fast enough".

Jaifroid commented 4 years ago

@ikreymer I just re-iterate, in case you didn't see it, that the ZIM file compression is changing this summer to zstandard. I have compiled a zstdec.js with Emscripten, which is documented here https://github.com/kiwix/kiwix-js/issues/611#issuecomment-654374286 and available on that branch. It may need recompiling with extra functions exposed, but if you are able to help at all with integrating it into our code (given that you clearly have a quick understanding of how to wire up a decompressor outside of Kiwix JS's own code), it would be an immense help both to us and to your own project!

ikreymer commented 4 years ago

To clarify, the binary search is over the initial index, which is about 1.2M here. The total size of all index clusters is ~400MB, but they would not all be loaded unless viewing thousands of articles across the entire range. The binary search over this 1.2M secondary index could be done in memory too, and probably would be faster still -- IndexedDB is used as service workers as storing global data in service workers is discouraged. The way to look at it is it adds about a ~0.4% increase in storage size (400MB for 89GB) in exchange for significantly improved access time.. And even the 0.4 could likely be made lower.

In addition to the binary search for directory entries, there is still the cluster lookup, and the blob lookup within the cluster, and fetching the remainder of the cluster, which I think adds another 1-2 fetches (or can the size of the cluster known and can be prefetched? I couldn't find that)

So probably you would end up with at least 4-5 lookups at the least, just a guess.. I am still skeptical that it would be performant enough, but there are many other factors. Maybe IPFS would be faster than HTTP fetch, but maybe not..

Luckily with Wikipedia, the page can be rendered right away and it is ok to wait for images to load as long as at least the text is there -- more generally, other pages may require CSS, JS files to load before the page can render, which can lead to an unreasonably long wait times before a user sees anything. Another factor is that the browser limits to 6 simultaneous connections to the same host, which can end up throttling the loading as well when multiple resources are being loaded on the same page.

But I agree, could use more testing experimentation!

Re: ZStd, sure, I could take a look when I have a chance. I didn't see any ZIM files that use Zstd yet, are there some examples? Would be interesting to see how Zstd decompression compares to XZ.. Is decompression generally faster or slower compared to XZ?

Jaifroid commented 4 years ago

Thanks @ikreymer. We currently have a small test zim in zstd that just contains numbers, I believe, as test data. I have added it to this branch and directory: https://github.com/kiwix/kiwix-js/tree/Add-zstd-decompression-support/tests .

@kelson42 Do we yet have a more complete demo ZIM file with real-world articles? E.g. Ray Charles or something small? What's the ETA on releasing the first zstd ZIMs?

kelson42 commented 4 years ago

@ikreymer @Jaifroid We will release more complete demo ZIM file using zstd within the next two weeks. So we can really start to experiment this.

javgh commented 3 years ago

Hey everyone, I modified Kiwix-JS to access a ZIM file of the English Wikipedia (100 GB) over HTTP and wanted to report on my experience here. You can see the result here:

https://siasky.net/_BmN2WhKvKAm0CzsfmHyiP81yUnYyQxh5To8JkVbj1RUxg/www/index.html

It's not that fast, but I find it to be reasonably usable.

In a first step I used the work done by sharun-s to simply turn all file access operations into XMLHttpRequestS. I then integrated the block-level cache developed by peter-x (from this pull request: https://github.com/kiwix/kiwix-js/pull/183) and tuned it a bit to cut down on the number of requests.

This still resulted in a large number of requests, so I added yet another cache. Since a lot of binary searches are happening, I prepared a cache that includes the first few steps for all possible binary searches. In essence the cache allows the binary search to narrow down to around 100 items. To complete the search, it then needs to fetch those items and perform the rest of the binary search. The block-level cache actually helps here and will often fetch those 100 items in one go.

The code can be found in this branch: https://github.com/javgh/kiwix-js/tree/skynet The code that cobbles together the binary search cache is here: https://github.com/javgh/kiwix-js/tree/extract-index

This is all done in a very hack-ish manner, so not ready to be merged or anything. It also needs quite a bit of memory (> 100 MB), so might not even be desirable in this form. But should hopefully be helpful as a data point on how well this can perform and be useful to others who are interested in hosting ZIM files on the web.

kelson42 commented 3 years ago

@javgh This is really great, and it seems to work good for me. I would love to see a way to merge your effort cleanly in our Kiwix JS code base.

lidel commented 3 years ago

@ikreymer @kelson42 Potentially useful here: IPFS gateways have very liberal CORS set up by default (access-control-allow-origin: *) so if .zim is published on IPFS, it will be available for fetching via JS from most of the public gateways (+ localhost, if user runs ipfs-companion with go-ipfs daemon or ipfs-desktop app).

ikreymer commented 3 years ago

@lidel Good to know, yes, makes it easier to serve archives from IPFS, and will be needed for the replayweb.page access as well.

Nice to see this implementation! I only took a brief look so far, so apologies if this is incorrect, but it seems that this approach is not too different from what I was trying to do above as well (no longer working as the June zim file has been removed), namely that a secondary index cache needs to be computed and made available separate from the ZIM file. To compute this binary search index, it would be necessary to first read the entire ZIM file.

To make this work then, the input to the system is not a ZIM file, but a ZIM file + precomputed index. I would say that this is no longer just reading from ZIM but from a new format, one that combines ZIM + a new index. This is exactly what I've been trying to do with the wacz format, which can bundle together a ZIM file + another index.

I still not sure that it is not possible to simply take a ZIM file, as is, on its own and have this work, without doing an offline pre-computation and producing a new file format (ZIM + some type of extra index), as done here and in my experiments.

If this is important use case, perhaps it would be possible to consider changes to the ZIM format to include such a precomputed index in the ZIM itself?

Jaifroid commented 3 years ago

@ikreymer Just to say that most ZIM files do contain a precomputed full text index. It is stored in namespace X and libzim (but not Kiwix JS yet) includes the binary that reads the format.

I haven¡t checked out @javgh 's approach yet, but will do this weekend! Looks exciting!

Jaifroid commented 3 years ago

PS @javgh I integrated peter-x 's block-level cache into Kiwix JS Windows. It really does make a difference with slow file access and/or access over network, precisely as you say because it cuts binary search right down. I experimented with this on the PWA vesrion used on Android, because the filesystem is presented to the browser engine in a very very slow way, basically rendering Kiwix JS unsuable. But with peter-x's code, it suddenly became vialbe to search a 90GB full English Wikipedia in the PWA app on Android (not that I would use it instead of the much faster Android app -- but it's a demonstration of how well that optimization works.).

lidel commented 3 years ago

@ikreymer For a PoC having index in a separate file is perfectly fine approach. I suspect that if we create JS reader for "namespace X", we may have luck with leveraging full-text index that is already present in ZIM file. If that is not enough, we could augument ZIM format by adding additional index to benefit lazy-loading over HTTP/IPFS.

Either way, the end goal should be that a .zim file has all the info required for operation.

Sidenote: an optimization similar to block caching mentioned by @Jaifroid could be provided by a local IPFS node for free (without any additional development): in IPFS data is chunked into blocks, and each block is cached in local repo after it is fetched for the first time. Then, all future reads of the block are local, so no need for remote range requests.

ikreymer commented 3 years ago

@lidel @Jaifroid I was referring to a secondary index for the title/url ptr list, just the title/url mapping to block/blob offset. I don't think full-text search has been tackled yet :)

Either way, the end goal should be that a .zim file has all the info required for operation.

Yes, that my point is just that to make loading over HTTP workable requires changing the ZIM format to include an extra index. This would mean changing the ZIM writer to create such an index, perhaps a binary search pre-cache, and build it into the ZIM. But without such a change to the ZIM format/zim writers, this extra index will need to be generated and side-loaded separately.

For example, in that prototype, it looks like the ZIM is loaded from: https://siasky.net/fANHCNNOgRNk0uc16BMFb5KoEbWrovtG9efLbS-toIYXmQ?r=0.9875405273101383

And the compressed 'secondary index' (binary search precache?) is loaded from: https://siasky.net/AAAJfZwOeKiZ0uFQikbWoSD3eO6RmhSKWPHnKX8c2_L_ng, and is only 11MB compressed!

In my experiment, the secondary index was much larger (~400MB), and it was storing every URL and a reusing a system designed for warc files), so this seems like a much better approach!

But the fact remains that a secondary index of some sort in addition to the ZIM is needed unless the ZIM format is changed.

Jaifroid commented 3 years ago

@ikreymer You could discuss needed changes on https://github.com/openzim/libzim/issues/424 where there is discussion of a new "fanout" search algorithm (would that algorithm help - if you have ideas for a change in format that would speed up access, it might be a good place to see what is possible.

mossroy commented 3 years ago

Congratulations for all this promising work! I confirm what @Jaifroid says about changes on the OpenZIM format : it should be discussed on the https://github.com/openzim/libzim git repo. Kiwix-js is "only" a ZIM reader.

Regarding the block-level cache, there is one in the libzim C implementation. This library also implements the openZIM features that are currently missing in our Javascript code (like the full-text search in the xapian index). IMHO, the way to go further is to compile this libzim (and kiwix-lib) with emscripten, and use them instead of maintaining our current JS backend code (and instead of adding custom cache implementation in kiwix-js). We had started (several years ago) to work on that (see #116) and managed to make it read some ZIM content. There are still many issues that need to be fixed (#508, #509, #510, #511, #512, #513, #514, #515) : what was missing was simply time to work on it, on my side. Hopefully someone will finish this work : either me (when I'll have more free time, I hope this will happen in the following months) or anyone willing to work on it.

kelson42 commented 3 years ago

Both @ikreymer and @javgh have created workable POC to read ZIM files online in Javascript. To achieve that. they have replaced the online reading of the url/title index by a trivial solution (good for a POC but not for a viable product). The key point now, if we want to make the next step forward, is to create an online index/dirent reader which works efficiently. To me, this seems pretty doable. Changing the ZIM format or additional index will be considered for this use-case only if proven that there is not other viable path.

Jaifroid commented 3 years ago

@kelson42 I agree that this looks achievable. I also agree with @mossroy that we would hugely welcome help towards achieving #508 etc. (porting of libzim to JS with Emscripten). If any of the devs here could help with that, we'd be very happy!

However, I don't believe the libzim port would help with this issue (reading ZIM files over HTTP) because libzim reads files, not network requests, whereas Kiwix JS just decodes binary data/compressed strings, which can be obtained from anywhere.

Peter-x's block-level cache could be merged in Kiwix JS, if @mossroy agrees (it is trivial to merge and was actually finished). It resides in a separate (RequireJS) module and does not complicate the main codebase. At the time it was not judged to provide a sufficient performance increase to warrant merging. However, having used it in production in KJSW, I think that judgement was based on the then limited title search algorithm. Now that we search many many more case combinations, the block cache provides tangible benefit, and clearly would help with this issue. There are performance comments in #619.

mossroy commented 3 years ago

Mmmh, you're certainly right @Jaifroid

It might be complicated to read ZIM files through HTTP with a libzim port. But this API https://emscripten.org/docs/api_reference/fetch.html might do what we need : if it were (optionally) used in libzim, it could replace the file slice reads by XHR range requests (done by the browser).

In any case, you convinced me to consider #619 : it's true that it makes a real difference. If you can make a PR, we would discuss it.

Jaifroid commented 3 years ago

Well spotted @mossroy , yes that API supports byte ranges, which could be interesting depending on whether it directly patches C++ file reads.

kelson42 commented 3 years ago

@mossroy @Jaifroid Now that it looks more or less feasible, I have to ask if you are you interested, whatever the end feature looks like, based on the POCs of @ikreymer and @javgh, to add the capability to read online ZIM files?

ikreymer commented 3 years ago

@mossroy @Jaifroid Now that it looks more or less feasible, I have to ask if you are you interested, whatever the end feature looks like, based on the POCs of @ikreymer and @javgh, to add the capability to read online ZIM files?

@kelson42 My conclusion, based on my experiments and looking at @javgh work, is that it is only possible by changing the ZIM format to include an extra cache to optimize XHR lookups over http. If you're not willing to change the ZIM format, it is not really feasible with ZIM alone as it is today. I think the first discussion should be about how/what type of index to add to ZIM, maybe its just what @javgh has generated or something similar.

Jaifroid commented 3 years ago

@kelson42 peter-x's block-cache feature is leveraged by @javgh, and what I have proposed in #658 fixes some issues with that code and incorporates it correctly into Kiwix JS's codebase. I think it is a prerequisite for a full implementation of @javgh's or @ikreymer's ideas for remote file reading. Personally, I think it could be a USP of Kiwix JS if it could read remote ZIM archives, with or without a generated external title pointer index, and I hope @javgh and/or @ikreymer will continue to work towards a solution that could be merged into Kiwix JS.

kelson42 commented 3 years ago

@Jaifroid My question is if we should, not if someone will or could? If “yes” how this feature would integrate in term of UX with the current software? What is USP?

Jaifroid commented 3 years ago

@kelson42 It's very hard to answer directly. Whether we "should" depends on whether there is a use case and whether there is enough developer time. I personally use Kiwix JS probably like the majority of people, i.e. for offline access to ZIM archives I have downloaded. However, I can imagine a strong use case for accessing a large ZIM or library of ZIMs over an intranet in areas where access to the Internet is restricted. You already have a solution for this, but in this configuration Kiwix JS could provide an interesting non-hardware-based, client-side alternative. Another use case could be "sampling" ZIM archives before downloading one.

In terms of UX and UI, @javgh's version uses the standard Kiwix JS interface, but dedicates the app to one remote ZIM for now. We would need a way to discover other remote ZIMs, but that wouldn't be hard. @ikreymer's approach is a very different, dedicated system that only uses the Kiwix JS back end. Currently in both cases there is a requirement for an external index, so some specialist approaches are required that only @javgh and @ikreymer could drive forward.

USP (Unique Selling Point) is jargon, sorry - I just meant "unique feature of this client".

kelson42 commented 3 years ago

@Jaifroid @mossroy I think been able to read online ZIM files directly via HTTP might open new interesting perspectives. At least the one allowing to read ZIM files directly from IPFS. It is a bit adventurous, but might be worth the effort. For the project, this is not a priority, but I feel ready to support the idea. I believe as well that this feature could be integrated relatively easily to the current Kiwix JS by allowing to choose an URL and then have something like a "Online Preview" checkbox. I would recommend to close this ticket as we have the answer to the original question and open a new ticket to request this new feature (I'm voluntering to phrase it).

Jaifroid commented 3 years ago

@kelson42 I agree!

javgh commented 3 years ago

I wanted to add some more technical details regarding my POC, as it might be helpful for future design decisions. It's indeed similar to what @ikreymer did before in that it's a ZIM file + precomputed index. The Wikipedia ZIM file has around 20 million dirEntries (around 6 million actual articles). Just all the pointers for the dirEntries are already 160 MB and then a dirEntry itself is around 50 bytes in the file, so all dirEntries add up to around 1 GB. Binary searching through this will take around log_2(20 million) ~= 24 steps. My precomputed index contains the results for the first 18 steps for all possible searches. These are 2 ^ 18 = 262144 items and I'm doing it both for "search by url" and "search by title", so it's twice that. For the index I reused the toStringId() and fromStringId() functions that a dirEntry has. The precomputed index looks like this:

  "byUrlIndexCache": {
    "77": "98209341060|6|-|99717|142|s/magnify-clip-rtl.svg||false|undefined",
    "154": "98209343468|65535|A|undefined|undefined|!Atame!||true|12850053",
    "232": "98209345661|65535|A|undefined|undefined|!Ora language||true|7094622",
[...]
  "byTitleIndexCache": {
    "77": "98209341060|6|-|99717|142|s/magnify-clip-rtl.svg||false|undefined",
    "154": "98209343243|65535|A|undefined|undefined|!Alla tu!||true|14219434",
    "232": "98209345403|65535|A|undefined|undefined|!O!ung||true|11419802",
[...]

It's around 50 MB in size. I then compressed it with xz to 11 MB, as it was easy to reuse Kiwix's xz decompressor.

I tuned the block-level cache for a fairly large block size, as I wanted to ensure that the last 6 or 7 steps of any binary search would trigger a single read by the block cache and then be able to complete with just that. So my settings there are: 4096 cache slots with a size of 16384 each (so around 67 MB in total). Perhaps the block size would not need to be quite as large. The stats for a first search look something like this:

hits: 806 misses: 195 Perc: 0.8051948051948052
hits: 989 misses: 12 Perc: 0.988011988011988
hits: 999 misses: 2 Perc: 0.998001998001998
hits: 974 misses: 27 Perc: 0.973026973026973
hits: 1001 misses: 0 Perc: 1
hits: 1001 misses: 0 Perc: 1
hits: 1001 misses: 0 Perc: 1

Regarding Xapian: I hadn't looked at it before, but I just now checked it for the Wikipedia file. It has a "Xapian Fulltext Index" and "Xapian Title Index". The title index alone is already 2.5 GB in size, so I'm not sure if it would be helpful as an alternative to such a custom precomputed index. I found this paragraph on https://xapian.org/docs/scalability.html :

One effect to be aware of when designing benchmarks is that queries will be a lot slower when nothing is cached. So the first few queries on a database which hasn't been searched recently will be unrepresentatively slow compared to the typical case. In real use, pretty much all the non-leaf blocks from the B-trees being used for the search will be cached pretty quickly, as well as many commonly used leaf blocks.

I wonder if it would be possible to have an Emscripten-compiled version of Xapian that does HTTP range requests into the Xapian database and how fast that would be.

Jaifroid commented 3 years ago

Pinging @cptolemy who is interested in the issues behind this thread.

kelson42 commented 3 years ago

@Jaifroid why reopening this issue?

Jaifroid commented 3 years ago

@kelson42 Only because @cptolemy wrote to me on another Repo about wanting to develop a very similar-sounding solution to those already extensively discussed above (and partially implemented) by lidl and ikreymer and javgh. I promised to ping cptolemy in this issue, and thought it could be re-opened in case he has anything to add. I'll close again in a few days if there is no further discussion or query.

kelson42 commented 3 years ago

OK. #659 is open and states clearly what should be done IMO, to the contrary to this ticket which is a question which as been answered already. Closed tickets can still welcome comments.

Jaifroid commented 3 years ago

Ah, OK. This one has all the research/technical information, but as you say it can still be accessed while closed. So I close this one and if @cptolemy wants to comment it's perhaps best to do so on #659.