Closed IMayBeABitShy closed 8 months ago
Here's the hexdump of the ZIM:
hexdump test.zim -C
00000000 5a 49 4d 04 06 00 01 00 95 ca ca c0 88 31 c2 4d |ZIM..........1.M|
00000010 b1 6c d9 3f 1c 44 db 40 0d 00 00 00 03 00 00 00 |.l.?.D.@........|
00000020 ac 0b 00 00 00 00 00 00 60 0b 00 00 00 00 00 00 |........`.......|
00000030 94 0b 00 00 00 00 00 00 50 00 00 00 00 00 00 00 |........P.......|
00000040 0a 00 00 00 ff ff ff ff 14 0c 00 00 00 00 00 00 |................|
00000050 61 70 70 6c 69 63 61 74 69 6f 6e 2f 6f 63 74 65 |application/octe|
00000060 74 2d 73 74 72 65 61 6d 2b 7a 69 6d 6c 69 73 74 |t-stream+zimlist|
00000070 69 6e 67 00 74 65 78 74 2f 68 74 6d 6c 00 74 65 |ing.text/html.te|
00000080 78 74 2f 70 6c 61 69 6e 00 69 6d 61 67 65 2f 70 |xt/plain.image/p|
00000090 6e 67 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |ng..............|
000000a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000850 00 00 00 58 00 00 00 00 02 00 00 00 00 00 00 00 |...X............|
00000860 6c 69 73 74 69 6e 67 2f 74 69 74 6c 65 4f 72 64 |listing/titleOrd|
00000870 65 72 65 64 2f 76 30 00 6c 69 73 74 69 6e 67 2f |ered/v0.listing/|
00000880 74 69 74 6c 65 4f 72 64 65 72 65 64 2f 76 30 00 |titleOrdered/v0.|
00000890 05 28 b5 2f fd 00 58 85 07 00 a2 ce 33 35 60 6b |.(./..X.....35`k|
000008a0 d3 18 30 35 04 dd 11 14 24 35 a0 0b 78 3d 48 62 |..05....$5..x=Hb|
000008b0 cd 83 df 2f db 37 6c 57 47 aa 00 00 3e f6 a3 c7 |.../.7lWG...>...|
000008c0 d4 e7 b8 cd 16 ed c4 0a 0d 13 16 2b 2c 52 1a 3a |...........+,R.:|
000008d0 70 bd 53 12 1b 8b 06 95 99 d6 7f c7 d6 c9 37 1c |p.S...........7.|
000008e0 78 dd c6 ff 2f a1 da ba 91 95 f8 df e3 10 b9 10 |x.../...........|
000008f0 d1 0a 65 11 fc 1e db e6 e3 07 08 80 3f fc 35 54 |..e.........?.5T|
00000900 65 4d 7e 89 80 64 a2 90 b9 28 1c b6 c9 61 a1 ca |eM~..d...(...a..|
00000910 41 ec 62 59 68 b5 07 ae b3 98 a7 76 36 a3 08 90 |A.bYh......v6...|
00000920 02 0d 28 07 13 9c 46 11 1f ad 6a 75 24 49 51 5e |..(...F...ju$IQ^|
00000930 52 c4 76 e3 2c a8 2e a8 66 54 5e 51 32 50 0b 16 |R.v.,...fT^Q2P..|
00000940 29 4a 11 63 35 2c e6 f9 68 1c d1 94 f3 42 2d 1a |)J.c5,..h....B-.|
00000950 a5 66 52 95 59 45 79 c1 d0 2a fe 9b fe 90 1f e4 |.fR.YEy..*......|
00000960 ef fe 8c 1f e3 b7 f8 2b fe 89 bf 20 0b 00 68 20 |.......+... ..h |
00000970 03 e9 d5 38 34 8a 72 cf 0a da 79 d2 36 dc d8 50 |...84.r...y.6..P|
00000980 18 9a 6a 5c a3 48 53 6c 98 51 01 00 00 43 00 00 |..j\.HSl.Q...C..|
00000990 00 00 00 00 00 00 00 00 00 00 68 6f 6d 65 2e 68 |..........home.h|
000009a0 74 6d 6c 00 57 65 6c 63 6f 6d 65 21 00 02 00 00 |tml.Welcome!....|
000009b0 4d 00 00 00 00 00 00 00 00 01 00 00 00 4e 61 6d |M............Nam|
000009c0 65 00 4e 61 6d 65 00 02 00 00 4d 00 00 00 00 00 |e.Name....M.....|
000009d0 00 00 00 02 00 00 00 54 69 74 6c 65 00 54 69 74 |.......Title.Tit|
000009e0 6c 65 00 02 00 00 4d 00 00 00 00 00 00 00 00 03 |le....M.........|
000009f0 00 00 00 43 72 65 61 74 6f 72 00 43 72 65 61 74 |...Creator.Creat|
00000a00 6f 72 00 02 00 00 4d 00 00 00 00 00 00 00 00 04 |or....M.........|
00000a10 00 00 00 50 75 62 6c 69 73 68 65 72 00 50 75 62 |...Publisher.Pub|
00000a20 6c 69 73 68 65 72 00 02 00 00 4d 00 00 00 00 00 |lisher....M.....|
00000a30 00 00 00 05 00 00 00 44 61 74 65 00 44 61 74 65 |.......Date.Date|
00000a40 00 02 00 00 4d 00 00 00 00 00 00 00 00 06 00 00 |....M...........|
00000a50 00 44 65 73 63 72 69 70 74 69 6f 6e 00 44 65 73 |.Description.Des|
00000a60 63 72 69 70 74 69 6f 6e 00 02 00 00 4d 00 00 00 |cription....M...|
00000a70 00 00 00 00 00 07 00 00 00 4c 61 6e 67 75 61 67 |.........Languag|
00000a80 65 00 4c 61 6e 67 75 61 67 65 00 03 00 00 4d 00 |e.Language....M.|
00000a90 00 00 00 00 00 00 00 08 00 00 00 49 6c 6c 75 73 |...........Illus|
00000aa0 74 72 61 74 69 6f 6e 5f 34 38 78 34 38 40 31 00 |tration_48x48@1.|
00000ab0 49 6c 6c 75 73 74 72 61 74 69 6f 6e 5f 34 38 78 |Illustration_48x|
00000ac0 34 38 40 31 00 ff ff 00 57 00 00 00 00 00 00 00 |48@1....W.......|
00000ad0 00 6d 61 69 6e 50 61 67 65 00 57 65 6c 63 6f 6d |.mainPage.Welcom|
00000ae0 65 21 00 ff ff 00 43 00 00 00 00 00 00 00 00 72 |e!....C........r|
00000af0 65 64 69 72 65 63 74 00 57 65 6c 63 6f 6d 65 21 |edirect.Welcome!|
00000b00 00 05 28 b5 2f fd 00 58 61 00 00 08 00 00 00 0c |..(./..Xa.......|
00000b10 00 00 00 00 00 00 00 00 00 00 58 00 00 00 00 01 |..........X.....|
00000b20 00 00 00 00 00 00 00 6c 69 73 74 69 6e 67 2f 74 |.......listing/t|
00000b30 69 74 6c 65 4f 72 64 65 72 65 64 2f 76 31 00 6c |itleOrdered/v1.l|
00000b40 69 73 74 69 6e 67 2f 74 69 74 6c 65 4f 72 64 65 |isting/titleOrde|
00000b50 72 65 64 2f 76 31 00 01 08 00 00 00 3c 00 00 00 |red/v1......<...|
00000b60 01 00 00 00 00 00 00 00 02 00 00 00 03 00 00 00 |................|
00000b70 04 00 00 00 05 00 00 00 06 00 00 00 07 00 00 00 |................|
00000b80 08 00 00 00 09 00 00 00 0a 00 00 00 0b 00 00 00 |................|
00000b90 0c 00 00 00 90 08 00 00 00 00 00 00 01 0b 00 00 |................|
00000ba0 00 00 00 00 57 0b 00 00 00 00 00 00 8a 09 00 00 |....W...........|
00000bb0 00 00 00 00 e3 0a 00 00 00 00 00 00 e3 09 00 00 |................|
00000bc0 00 00 00 00 27 0a 00 00 00 00 00 00 41 0a 00 00 |....'.......A...|
00000bd0 00 00 00 00 8b 0a 00 00 00 00 00 00 69 0a 00 00 |............i...|
00000be0 00 00 00 00 ad 09 00 00 00 00 00 00 03 0a 00 00 |................|
00000bf0 00 00 00 00 c7 09 00 00 00 00 00 00 c5 0a 00 00 |................|
00000c00 00 00 00 00 50 08 00 00 00 00 00 00 17 0b 00 00 |....P...........|
00000c10 00 00 00 00 75 8f 4c 66 32 67 a5 95 20 49 2a d9 |....u.Lf2g.. I*.|
00000c20 ea 55 52 8a |.UR.|
00000c24
@IMayBeABitShy Thank you for the detailed report. We are interested to understand the cade and help you... but the answer doesn't seem super obvious. Give us a bit of time to come back to you.
Implementing #572 might help here maybe
Not sure why it believes this is a legacy ZIM DirListing (I think I followed the standard). Expanding the collapsed object
Regarding how the PWA detects legacy ZIM listings, it should mean that this is a ZIM that uses namespaces (like A/index.html, -/some_stylesheet.css, I/an_image.webp), as opposed to being a type 1 ZIM (one that has all user content under a C/ namespace, and doesn't distinguish amongst images, stylesheets, HTML, etc.). It doesn't mean that the ZIM is non-conformant, it just means that it adheres to an earlier opemZIM spec.
Having said that, the ZIMFile object looks like it's missing the articleCount
and articlePtrPosition
info. This is what a typical mwOffliner-produced ZIM looks like (also uses legacy ZIM format):
It's possible the fields weren't populated at the time you took your snapshot, or that an older version of the PWA displayed the object too soon in console. Might be worth double-checking, though, in the latest PWA (2.7.2+), that those properties are populated (check after the landing page has loaded).
Regarding how the PWA detects legacy ZIM listings, it should mean that this is a ZIM that uses namespaces (like A/index.html, -/some_stylesheet.css, I/an_image.webp), as opposed to being a type 1 ZIM (one that has all user content under a C/ namespace, and doesn't distinguish amongst images, stylesheets, HTML, etc.). It doesn't mean that the ZIM is non-conformant, it just means that it adheres to an earlier opemZIM spec.
That is interesting, thank you for the explanation. How is this check performed? This should be a ZIM with the newer namespace behavior (minor version 1, content in C
namespace). The URLs (from the url pointer list) are as follows:
0 -> 2442: b'Chome.html'
1 -> 2787: b'Credirect'
2 -> 2531: b'MCreator'
3 -> 2599: b'MDate'
4 -> 2625: b'MDescription'
5 -> 2699: b'MIllustration_48x48@1'
6 -> 2665: b'MLanguage'
7 -> 2477: b'MName'
8 -> 2563: b'MPublisher'
9 -> 2503: b'MTitle'
10 -> 2757: b'WmainPage'
11 -> 2128: b'Xlisting/titleOrdered/v0'
12 -> 2839: b'Xlisting/titleOrdered/v1'
So the "C" namespace is used for content, "X" for the indexes, "W" for well-known entries and "M" for metadata.
I've re-checked the output of the PWA version. The article-related fields still aren't populated and I am fairly sure it is fully loaded. Also, the "random article" button took me to a metadata field, so something seems to be wrong with the article index... This should be the entry at Xlisting/titleOrdered/v1
and behave like the v0 index, with the exception that only article titles are included, right?
Hmm, there is a byte set in the ZIM file header that indicates whether the ZIM is type 0 or type 1. It's the minorVersion
field, and it is extracted here: https://github.com/kiwix/kiwix-js-windows/blob/main/www/js/lib/zimfile.js#L488 .
Your minorVersion is indeed set to type 1, but the logic that decides whether it is using legacy listing or not is here:
https://github.com/kiwix/kiwix-js-windows/blob/main/www/js/lib/zimfile.js#L366
The legacy listing is a kind of fallback, so should work for any ZIM, so it gets used if the app can't find the X/listing... By the way that should be in the X/ namespace (you wrote Xlisting
above, probalby just a typo).
I'm not sure if labelling the title listing as legacy is an inaccuracy in the PWA, or some problem with the ZIM format, though if it's not readable by Kiwix Serve, it would suggest the latter.
By the way, you can (in modern Firefox or Chromium) get a debuggable (unminified) version of the PWA by going to https://kiwix.github.io/kiwix-js-windows/ (ignore the Repo title, it's not Windows-specific). This should allow you to pause on those lines during ZIM loading in case it helps you to debug the format.
X/listing... By the way that should be in the X/ namespace (you wrote Xlisting above, probalby just a typo).
About that: the specification always uses <namespace>/<path>
, but all ZIMs I've analyzed (e.g. an askubuntu one) uses <namespace><path>
. Perhaps this is the problem? Should these paths start with a /
? The aforementioned askubuntu ZIM also has the following URLs, so I had assumed that the /
in the documentation was just for the readability:
1488425 -> 2285351818: b'Cusers_page=97'
1488426 -> 2285351849: b'Cusers_page=98'
1488427 -> 2285351880: b'Cusers_page=99'
1488428 -> 2285351911: b'MCounter'
1488429 -> 2285351936: b'MCreator'
1488430 -> 2285351961: b'MDate'
1488431 -> 2285351983: b'MDescription'
1488432 -> 2285352012: b'MFaviconPath'
1488433 -> 2285352041: b'MIllustration_48x48@1'
1488434 -> 2285352079: b'MIllustration_96x96@1'
1488435 -> 2285352117: b'MLanguage'
1488436 -> 2285352143: b'MName'
1488437 -> 2285352165: b'MPublisher'
1488438 -> 2285352192: b'MTags'
1488439 -> 2285352214: b'MTitle'
1488440 -> 2285352237: b'WmainPage'
1488441 -> 2285352259: b'Xfulltext/xapian'
1488442 -> 2285352292: b'Xlisting/titleOrdered/v0'
1488443 -> 2285352333: b'Xlisting/titleOrdered/v1'
1488444 -> 2285352374: b'Xtitle/xapian'
By the way, you can (in modern Firefox or Chromium) get a debuggable (unminified) version of the PWA by going to https://kiwix.github.io/kiwix-js-windows/ (ignore the Repo title, it's not Windows-specific). This should allow you to pause on those lines during ZIM loading in case it helps you to debug the format.
Great, thank you. I'll look into it.
It looks like that might be the problem... The PWA at least assumes the title listing is in the X/ namespace, based on the OpenZIM spec, and we certainly find it there in most ZIMs. However, as we use a fallback if we don't find the X/titleOrdered/v0 or /v1 listings, this would not show up as an obvious error.
In the listing you provide above, I notice that the /
seems to be systematically missing. Metadata are in the M/
namespace, so MIllustration_48x48@1
should definitely be M/Illustration_48x48@1
... It may just be a display fluke, however, rather than an actual coding error in the AskUbuntu ZIM.
It looks like that might be the problem... The PWA at least assumes the title listing is in the X/ namespace, based on the OpenZIM spec, and we certainly find it there in most ZIMs. However, as we use a fallback if we don't find the X/titleOrdered/v0 or /v1 listings, this would not show up as an obvious error.
Sorry, I may have formulated that a bit suboptimal. The entry is inside the X
namespace (no /
, namespaces are always 1 byte) and at pathtitleOrdered/v1
. It seems like the whole '/` thing is just a display related. At least, kiwix js finds the entry for the v1 title index correctly as far as I can tell:
_zimfile: {…}
blob: 0
cluster: 1
mimetypeInteger: 0
namespace: "X"
offset: 2839
redirect: false
redirectTarget: undefined
title: "listing/titleOrdered/v1"
url: "listing/titleOrdered/v1"
But it seems like the next function:
).then(function (metadata) {
// Note that we do not accept a listing if its size is 0, i.e. if it contains no data
// (although this should not occur, we have been asked to handle it - see kiwix-js #708)
if (metadata && metadata.size) {
that[listing.ptrName] = metadata.ptr;
that[listing.countName] = metadata.size / 4; // Each entry uses 4 bytes
highestListingVersion = Math.max(~~listing.path.replace(/.+(\d)$/, '$1'), highestListingVersion);
}
// Get the next Listing
return listingAccessor(listings.pop());
seems to receive null
as the value of metadata
. So for some reason return that.blob(dirEntry.cluster, dirEntry.blob, true);
seems to return null...
It may just be a display fluke, however, rather than an actual coding error in the AskUbuntu ZIM.
Yes, I think so too.
I think I may have found the problem. The v1 title index is in a compressed cluster. In kiwix JS, this results in a problem in the following code section:
// If only metadata were requested and the cluster is compressed, return null (this is probably a ZIM format error)
// DEV: This is because metadata are only requested for finding absolute offsets into uncompressed clusters,
// principally for finding the start and size of a title pointer listing
if (meta && compressionType[0] > 1) return null;
Where compressionType[0]=5
and meta=true
. This results in the function I've mentioned above to be called with metadata=null
. I am still investigating if this is also the cause of the problem with kiwix-serve
or unrelated. Seems like I've missed the part where these indexes must be stored uncompressed.
Ah yes, that rings a bell! Basically, X/titleOrdered/v0
is (currently) just a wrapper around the legacy titlePtrList
, so that legacy apps can still find a title list even if they don't look in the X namespace.
I'm not sure if it's permissible to compress X/titleOrdered/v1
-- maybe @mgautierfr, who is the expert here, can comment? Clearly we didn't handle that possibility in our backend (yet) because there are (or were) no ZIMs with compressed title lists. My suspicion is that the overhead of decompressing what can be potentially huge lists (even decompressing on the fly) for potentially thousands of lookups during binary search would make this quite problematic.
I'm not sure if it's permissible to compress X/titleOrdered/v1 -- maybe @mgautierfr, who is the expert here, can comment? Clearly we didn't handle that possibility in our backend (yet) because there are (or were) no ZIMs with compressed title lists.
It's unfortunately not. I've just missed the following line in the spec:
All indexes and listing items MUST be stored in uncompressed cluster.
Using xxd -r <your_dump> example.zim
to recreate the zim file from your dump.
$ zimcheck --all --details example.zim
[INFO] Checking zim file example.zim
[INFO] Zimcheck version is 3.2.0
[INFO] Verifying ZIM-archive structure integrity...
mimelistPos must be 80.
[ERROR] ZIM file's low level structure is invalid
mimelistPos must be 80.
It is kind of coherent with https://github.com/openzim/libzim/issues/822 where you asked if we could remove the constraint on mimelistPos being 80.
Are you sure you are using the right zim file (or not a patched version of zimcheck) ?
Using
xxd -r <your_dump> example.zim
to recreate the zim file from your dump.$ zimcheck --all --details example.zim [INFO] Checking zim file example.zim [INFO] Zimcheck version is 3.2.0 [INFO] Verifying ZIM-archive structure integrity... mimelistPos must be 80. [ERROR] ZIM file's low level structure is invalid mimelistPos must be 80.
It is kind of coherent with openzim/libzim#822 where you asked if we could remove the constraint on mimelistPos being 80.
Are you sure you are using the right zim file (or not a patched version of zimcheck) ?
I assure you, I most definitely did not do anything that involved writing or modifying C/C++ code ;)
The ZIM file above used 2KiB of reserved space for the mimetypelist at offset 80 (thus the many zero bytes). The problem here is that the hexdumps created with hexdump -C
can apparently not be reversed with xxd -r
. I get an invalid magic number when trying to do so, but when using xxd
to create the dump and recreate it zimcheck
passes.
It's most likely the compressed title indexes, but I am still checking if that is also the problem with kiwix-serve and not only for the PWA.
Can you upload the zim file here ? (You may have to rename it to zip
) to let github accept it.
Sure, here you go: test.zip
I confirm the PWA can't populate articleCount
or articlePtrPosition
, but falls back to the legacy v0 listing. This has anomalous effects, because it's treating the ZIM as some kind of hybrid, so shows a bunch of stuff in the title index which shouldn't be there. That's a side effect (probably of it being a v1 ZIM which can't read the X/titleOrdered/v1
listing). It explains why the random button shows metadata sometimes.
FYI In the PWA, you can access what it thinks is the title list by pressing a space in the search field. And you can access the full URL list, including namespaces, by typing space + / (space followed by /) as shown below:
It is missing the M/Counter
entry.
It is described as not mandatory in the spec but libkiwix expect it and fails if it cannot found it.
This probably pass under our radar as libzim itself creates it and so scrappers don't have to add it.
It is missing the
M/Counter
entry.
But the ZIM is still incorrectly formatted if X/listing/titleOrdered/v1
is in a compressed cluster, right? So, probably if it had that M/Counter
entry, then libzim should fall back to using the v0 title index...
But the ZIM is still incorrectly formatted if X/listing/titleOrdered/v1 is in a compressed cluster, right?
Yes. The cluster should not be compressed.
So, probably if it had that M/Counter entry, then libzim should fall back to using the v0 title index...
No, libzim is nice here (https://github.com/openzim/libzim/blob/main/src/fileimpl.cpp#L250-L254).
So it should pass and libzim will use the information in the header to locate the titleIndex (directly from the header, without using X/listing/titleOrdered/v0
)
So it should pass and libzim will use the information in the header to locate the titleIndex (directly from the header, without using
X/listing/titleOrdered/v0
)
OK, thanks for confirming -- I think libzim and the KJS backend are in accordance, then (except for the former requiring M/Counter). KJS backend falls back from v1
to v0
to titlePtrPos
in the header.
Yes. The cluster should not be compressed.
Should we not check this in zimcheck?
Yes
I can confirm the issue with kiwix-serve
has been fixed. Thank you all for your help.
Regarding the bug with the PWA/kiwix-js: Unfortunately, this seems to remain even when the X/ namespace remains properly uncompressed. However, this is a unrelated issue and I should hopefully be capable of debugging that myself.
Once again, thank you for your helpful comments and the quick fix.
Hello again,
As I've mentioned on the slack channel, I've encountered a potential bug when trying to serve a ZIM created by a custom zim writer library. I had initially assumed that this was a bug in my library, but
zimcheck
passes and both thekiwix-desktop
appimage and the PWA are able to read the ZIM file.ZIM creation
The ZIM file has been created using this library, more specificaly this file.
Zimcheck
kiwix-serve
kiwix-serve
is unable to open the file but does not give a specific error.It works fine with
askubuntu.com_en_all_2022-11.zim
.kiwix-desktop (app image)
The recent app image (invoked via
./kiwix-desktop_x86_64_2.3.1-4.appimage /tmp/test.zim
) is capable of reading the ZIM without any problems.kiwix-desktop (from apt)
This is where it gets interesting: using the old version installed via the package manager fails:
However, if we set
$ZIM_DIRENTLOOKUPCACHE=1
, it works without issues.Of course, this is the old, outdated package from the package manager and this may be entirely unrelated, but it provided the most interesting debug output so far.
kiwix PWA
As mentioned, the kiwix PWA is capable of reading the ZIM without issue. Here is the console log:
Not sure why it believes this is a legacy ZIM DirListing (I think I followed the standard). Expanding the collapsed object:
Additional info:
ZIM header and metadata:
System