Support extended cluster offsets for ZIM major version 6

kiwix / kiwix-js

Fully portable & lightweight ZIM reader in Javascript

https://www.kiwix.org/

GNU General Public License v3.0

309 stars 135 forks source link

Support extended cluster offsets for ZIM major version 6 #716

Open Jaifroid opened 3 years ago

Jaifroid commented 3 years ago

See the new ZIM file format specification for clusters https://wiki.openzim.org/wiki/ZIM_file_format#Clusters

This would not be a difficult fix I think, but it's not clear to me whether any ZIM files exist with extended cluster offsets. Extended offsets mean that the blob offsets in the cluster are 8 bit rather than 4 bit. This means we would need to multiply the offset number by 8 instead of by 4 when we detect an extended cluster format. We will no doubt also need to modify the way we read the byte as well.

kelson42 commented 1 year ago

@Jaifroid Do we still really need this (actually I don’t know for which purpose this extended cluster offsets are or woukd be used) considering this is supported in the wasm libzim? @mgautierfr Would you be able to give more info for the usage of this feature?

Jaifroid commented 1 year ago

I stumbled across this specification change when developing the ZSTD support in Kiwix JS (I think), and noted it as something we might need to watch out for if the specification is ever used. I think this issue is just a marker for that, so I don't forget. The problem is that for now we're not ready to strand the userbase that may not be able to use the libzim port -- anyone with an old computer in the developing world, for example, or... a certain famous user of the Icecat extension. So, we'll still need to keep up with any format changes for a while yet. However, it's not clear to me if or when extended cluster offsets will be used.

mgautierfr commented 1 year ago

Extended cluster are use to store (individual) content bigger than 4GB. (called "big content" here). As the offsets in the cluster are use to point the data of the content, if the content is more than 4GB, we cannot use 4 bytes offsets.

This can be use for any content bigger than 4GB (a big video for exemple) but we faced the problem when storing xapian indexes than can be pretty big for huge zim file.

Also note, that we try to keep cluster size under few MB. So such big content is stored alone in a cluster (the cluster is containing only one blob). But this limitation on cluster size is a libzim implementation details, specification do not enforce that. In fact, the reader must be ready to read extended cluster even for content smaller than 4GB.

Jaifroid commented 1 year ago

Also note, that we try to keep cluster size under few MB. So such big content is stored alone in a cluster (the cluster is containing only one blob). But this limitation on cluster size is a libzim implementation details, specification do not enforce that. In fact, the reader must be ready to read extended cluster even for content smaller than 4GB.

OK, thanks @mgautierfr, it looks like I should add this to the backend, even if we are only reading Xapian indices currently with libzim (which has the capability built in). As far as I remember, it didn't look like a difficult fix.

mgautierfr commented 1 year ago

No, it should not be difficult. The reading code in libzim is here : https://github.com/openzim/libzim/blob/main/src/cluster.cpp#L89-L128 As you can see, there is very few difference between "normal" and extended parsing.

kelson42 commented 1 year ago

Also note, that we try to keep cluster size under few MB. So such big content is stored alone in a cluster (the cluster is containing only one blob). But this limitation on cluster size is a libzim implementation details, specification do not enforce that. In fact, the reader must be ready to read extended cluster even for content smaller than 4GB.

OK, thanks @mgautierfr, it looks like I should add this to the backend, even if we are only reading Xapian indices currently with libzim (which has the capability built in). As far as I remember, it didn't look like a difficult fix.

@Jaifroid Today, the only cluster which can be that big is the fulltext index. We never have and won´t (in a near future) generate a ZIM file with 4GB+ cluster. Nice if this is quick to implement, but to 99% this won't be used.