Open Hello1024 opened 2 years ago
Update: I have re-run the analysis with a smarter 'tree' of clusters, still never decompressing more than 2MB total to access any blob, and we now get a 7.1% decrease in file size. Alternatively, for a fixed file size, the access speed is now over 4x faster (needing to only decode a maximum of 420kb, vs the 2Mb baseline).
Do these numbers make it worth adding this complexity?
I’m interested to this approach. To me it looks similar to what we discussed a few years ago with https://github.com/openzim/libzim/issues/76.
What is unclear to me is the way to choose the parent cluster for each cluster… in a way that ultimatively it goes faster to decompress in most (if not all) of the scenarios.
My latest approach (which is just a pile of python to try out ideas - nowhere near implementation stage) constructs a tree with the following logic:
Put the documents in a tree, in 'pre-order traversal' order. Ie. . I added the documents in alphabetical order, nothing smart.
Each route from root to tip is capped at 2 megabytes. Each node is limited to 4 children. The root node may have more children.
The benefit of this approach is the whole lot can be written out sequentially with no substantial ram requirements.
When reading, you never need to decode more than 2MB.
This approach might not be the best - just the first that came to mind.
This is definitively interesting. Do you have some code (even quick and dirty poc) showing how you regroup the content in clusters and how they are decompressed ?
I see few potential issues with this approach:
But the numbers you are providing worth at least we discuss further on this and see what we can do.
I slowly working on a new archive format (https://framagit.org/jubako), I will probably integrate your idea in this format.
Here is my quick and dirty POC which compresses and decompresses: Adjust the compression level on line 20 - most benefits are seen at higher levels, but obviously that's slow for testing.
https://gist.github.com/Hello1024/9cc20300205c6aec214519cf84bbebb3
One potential disadvantage of this approach...
While random access is very fast, decompressing all the data sequentially becomes slow with API's available in libzstd or liblzma.
That is because, if one wants to decompress a single entry, one simply decompresses each node from the root to the node you care about. However, if you want to decompress another node, you theoretically have already decompressed the root, so there is no need to decompress that again... But... With the API available in zstd, there is no way to get the decompressor back into the state it was in after decompressing the root node, without decompressing the root node again.
Theoretically, all thats needed is an API to 'snapshot' the internal state of the decompressor. However, while such an API appears to exist(ZSTD_copyCCtx()
), the comments in the source code show it cannot be used when data has already passed through the decompressor/compressor... Making it not useful for our uses.
Considering that ZIM seems to mostly require random read performance and a high compression ratio, this tradeoff seems acceptable. Especially since the tradeoff is not a theoretical one, but merely an implementation detail which could be fixed if really necessary.
ZIM currently allows data to be arranged in clusters, which are then compressed. Large clusters give better compression, while small clusters allow faster access to a random article.
In a future version of the ZIM file format, this tradeoff could be substantially improved by allowing clusters to have an extra field specifying "parent cluster for compression context".
To decode a given cluster, the decoder would first have to decode the parent cluster, and then decode the cluster it needs, using the same instance of the data decompressor (so internal state is maintained, and therefore a better ratio is achieved).
This change means that, for
wikipedia_nb_50000_nopic_2021-05.zim
:There is also the opportunity to cache decompressor states for commonly used parents, substantially reducing access times to multiple documents.
Is there interest in taking this approach?