IMayBeABitShy commented 11 months ago

Introduction

This issue proposes the removal of the requirement that the offset of the mimetypelist must be exactly 80 from the ZIM file specification. It follows a short conversation on the kiwix slack about the potential impact of such a change and is intended to open a discussion whether this could and should be done.

Motivation

Currently, the ZIM file specification requires the mimetypelist to be written at offset 80 (directly after the header). The problem with this is that the length of this list can only be known after all entries have been added to the ZIM file. Maintaining this behavior forces any library writing ZIM files to either:

reserve a lot of space (which may - in a hypothetical extreme case - not even be enough space) for the mimetypelist or
write the rest of the ZIM file content (entries, clusters, ...) somewhere else first (or worse, keep it in RAM), combining it to a full file only after all other data is ready. This involves unnecessary I/O and is IMHO quite inefficient.

Removing the requirement that the mimetypelist must be written at offset 80 would allow the the writing library to immediately begin writing the content to the ZIM file and writing the mimetypelist only once all entries have been written (and thus their mimetypes are known).

The ZIM format specification already has a header field reserved for setting the offset of the mimetypelist, which provides all the features needed to remove this requirement.

The advantage of this changes are as follows:

it allows writers to write the mimetypelist once all mimetypes are known. This, in turn, means that the writer can immediately begin writing clusters and entries without having to either reserve space, which also implies further advantages:
- writers do not have to reserve a specific amount of space for the mimetypelist. As such a reservation is unlikely to guess the required space exactly right, such reservation are either wasting space or (significantly worse) be not enough. mimetypes can (according to RFC 6838) theoretically have a size of up to 255 bytes. Adding the requirement for the zero-termination, this means that 1024 bytes (which is apparently the common reservation) would, in a worst case scenario, be limited to floor((1024-1) / (255 + 1)) = 3 mimetypes. Obviously, such a worst-case scenario will never happen, but it highlights how quickly one could run out of a reserved space.
- alternatively, writers that do not reserve disk space for the mimetypelist can avoid assembling the ZIM file only once all entries are known and directly begin writing the entries, avoiding uneccessary disk/IO and/or memory usage to buffer these entries before the ZIM file is assembled.
It makes the header field mimeListPos relevant by allowing ZIM library implementations to actually utilize the functionality and flexibility this field allows one rather than enforcing a fixed value for this field.
More importantly, it is integral for allowing ZIM file updating/patching/editing in-place. As during updating/editing/patching of a ZIM file entries with new mimetypes may be added to the library, the previously allocated space for the mimetypelist may not be enough, which would prevent such updates to happen. Of course, one could simply apply the patch to a second, seperate file or move all of the file content backwards (and thus also adjusting the offsets) to make space, but both of these solutions are rather inefficient and may not be doable on more restricted hardware like mobile phones.

The proposal

The proposal of this issue is rather simple: Remove the requirement that the mimetypelist must be at offset 80. As the current ZIM file specification already contains the mechanism to specify the offset of the mimetypelist, no further changes other than the removal of this constraint would be needed to adjust the specification itself.

Compatibility concerns

Of course, the biggest question is the impact of this change on the whole ZIM&kiwix ecosystem. The following subsections will discuss the compatibiltiy impact on various parts of the aforementioned ZIM&kiwix ecosystem to the extend I can assess it.

Backwards compatibility to older ZIM files

This change should not cause older ZIM files to become incompatible in any way.

Backward compatibility with older ZIM libraries

Some libraries, like libzim, have checks to ensure that the offset of the mimetypelist must be 80. Some libraries try to infer more informations based on the relative position of the mimetypelist to other parts of the same ZIM file. Older versions of ZIM libraries that follow the aforementioned behavior and are not adjusted to the new behavior (as well as any applications relying on such libraries) would not be capable of reading ZIM archives featuring a mimetypelist at a different offset.

However, this will only apply to ZIM files actually using a different offset. Even with the changes to the offset restrictions, any newly created ZIM files still using the an offset of 80 for the mimetypelist would remain backwards compatible to older ZIM libraries. As I doubt there will be a signficant release of ZIM files using a different offset for the mimetypelist in the foreseeable future, there should be more than enough time for the various applications and libraries to update to support the new behavior. Furthermore, AFAIK the vast majority of ZIM files are build using libzim (or a wrapper/binding/...) and I doubt it will change where it writes the mimetypelist any time soon.

Changes necessary for implementations of the ZIM standard

Generally speaking, a ZIM library would not need to implement any changes provided that:

It doesn't check that the mimetypelist is at offset 80 and refuse to read any such files
It reads the mimetypelist from the offset specified in the header
It does not try to infer any information from the position of the mimetypelist relative to other parts of a ZIM archive.

Fixing 1 and 2 should be mostly trivial for any library: one would simply have to remove any checks and change the fixed offset to the header field. Point 3 would be a bit more complicated to adjust, as - depending on the language used - this may require a different memory allocation method. If I am reading the specification correctly, this behavior would not be conform to said specification, as there does not seem to be any guaranteed order of the components. Of course, I could always simply misunderstand something.

The following is an evaluation of all (or at least most) libraries (and applications using their own library) listed on the Readers page in the openzim wiki regarding their compatibility for this change as well as the potentially required work.

libzim

To my understanding of the libzim source code, libzim already reads from the offset specified in the header, but refuses to read such files. This check (3 lines of code) would need to be removed. (Relevant code parts: [here](https://github.com/openzim/libzim/blob/main/src/fileheader.cpp#L119))

A bit more problematic is that libzimtries to infer information about the size of the mimetypelist by finding the start of the next part of the ZIM file (url ptr list, cluster poitner list, ...). Unless I am misunderstanding something in the specification, there is no guarantee in the zim specification that the mimetype list is not the very last component nor at any "weird" place like between cluster 3 and 4. To my understanding, the proper way to read the mimetype list would be to read until an empty zero-terminated string is encountered (like libzim already does), but not to make any assumptions over the size of the mimetypelist. (Relevant code parts: [here](https://github.com/openzim/libzim/blob/main/src/fileimpl.cpp#L326) and [here](https://github.com/openzim/libzim/blob/main/src/fileimpl.cpp#L309)).

My knowledge of the various C languages isn't particularly strong, but I think the library could be easily adjusted by removing the check and simply continuing to read until the file end (or a sufficiently large buffer). The while loop already checks for the proper end condition (empty zero-terminated string).

kiwix-js

kiwix-js does not seem to enforce the offset 80 for the mimetypelist. (Relevant code snippet: [here](https://github.com/kiwix/kiwix-js/blob/main/www/js/lib/zimfile.js#L472)).

Just like libzim, kiwix-js seems to assume that the URL pointer list must directly follow the mimetypelist directly and uses this to infer the size of the mimetypelist. Unlike libzim, there is a strict limit of length 1024 for the mimetypelist and no other components are checked (e.g. cluster pointer list). (Relevant code parts: [here](https://github.com/kiwix/kiwix-js/blob/main/www/js/lib/zimfile.js#L426)).

ZIMply

ZIMply would already support archives with a different offset (Relevant code snippets: [here](https://github.com/kimbauters/ZIMply/blob/master/zimply/zimply.py#L220) and [here](https://github.com/kimbauters/ZIMply/blob/master/zimply/zimply.py#L398))

zimpy.py

zimpy.py would already support this change (Relevant code snippets: [here](https://github.com/iiab/internet-in-a-box/blob/master/iiab/zimpy.py#L382) and [here](https://github.com/iiab/internet-in-a-box/blob/master/iiab/zimpy.py#L363)).

zimreader-java

zimreader-java relies on the mimetypelist directly following the header. Adding reader.seek(mHeader.mimeListPos) before line 119 would trivially adjust the library to support it. Then again, it is archived so probably not in active developement anymore?

zimHttpServer.pl

zimHttpServer.pl would already support such archives (Relevant Code section: [here]https://sourceforge.net/p/kiwix/tools/ci/master/tree/tools/scripts/zimHttpServer.pl#l24()). The same applies to the two forks listed in the wiki.

gozim

I think it would already support different offsets (Relevant code sections: [here](https://github.com/akhenakh/gozim/blob/master/zim.go#L78)), although it too seems to contain a hard-limit on the length of the mimetypelist of 2048.

ZIM.pm

ZIM.pm seems to already support such archives (Relevant code section: [here](https://github.com/Spiritdude/ZIM/blob/master/ZIM.pm#L96)).

The ZIM to fs dump tool in Rust

This tool relies on the mimetypelist directly following the header, but would otherwise correctly detect the end of the mimetypelist. A simple seek at the right position should make the library compatible (Relevant section: [here](https://github.com/dignifiedquire/zim/blob/master/src/zim.rs#L223)).

pyzim-tools

This library would already support such files (Relevant code snippets: [here](https://github.com/kymeria/pyzim-tools/blob/master/pyzim/structs.py#L75)).

Summary

This proposal listed potential advantages of removing the requirement to write mimetypelists at offset 80. It evaluated various ZIM libraries for their compatibility with this changes. Multiple libraries were found to be already capable of reading ZIM files with such changes. However, multiple libraries were also found to be incompatible with such a change as well. During the evaluation of these libraries, it was found that some libraries make some rather interesting assumptions of the mimetypelist (such as hard-capping it to length 1024), but most would require only simple adjustments. While this change may not provide enough advantages to justify breaking compatibility of newer ZIM files (at least those that would make use of this different behavior), it may be relatively trivial to implement the change alongside the next more important compatibility breaking change.

Acknowledgement

I want to use this opportunity to say my thanks to the openZIM team for maintaining an easy to understand and really helpful documentation of the file format specification on the wiki. Also many thanks to those who helped guide me on the kiwix slack.

mgautierfr commented 11 months ago

It appears that this requirement has been added a long time (the October 17th 2010 exactly) (https://wiki.openzim.org/w/index.php?title=ZIM_file_format&type=revision&diff=1094&oldid=1093) (https://github.com/openzim/libzim/commit/1c993ac4b53cd1fa3e7ecb94e74bfd443af60259) and it was to know if zim file has a checksum or not. (Probably without breaking the zim format version).

If mimetypePos is 72, we don't have a checksumPtr. If mimetypePos is 80, we have a checksumPtr. In fact, there was a implementation putting mimetypePos just after the header, and we have use that to give the header size. This became a requirement as else, we could have put mimetypePos anywhere but 72 even if the header has no checksumPtr.

This sound to me like a hack, but we had to stick with it since.

I think this requirement is indeed unnecessary:

First, as now we always have a checksum
Second, if we add a new field in the header, we will change the zim format version. We don't use the mimetypePos for that.

But if we do this:

We must first handle this correctly in libzm (as it is the reference library)
We will increase zim format major version as this is a breaking change (existing implementations may crash)

kelson42 commented 10 months ago

openzim / libzim

[ZIM specification] Proposal: remove the requirement that the mimetype list has to be located at offset 80 #822