kiwix / kiwix-tools

Command line Kiwix tools: kiwix-serve, kiwix-manage, ...
https://download.kiwix.org/release/kiwix-tools/
GNU General Public License v3.0
407 stars 79 forks source link

kiwix-serve is unable to add ZIM to internal library despite zimcheck passing #640

Closed IMayBeABitShy closed 8 months ago

IMayBeABitShy commented 9 months ago

Hello again,

As I've mentioned on the slack channel, I've encountered a potential bug when trying to serve a ZIM created by a custom zim writer library. I had initially assumed that this was a bug in my library, but zimcheck passes and both the kiwix-desktop appimage and the PWA are able to read the ZIM file.

ZIM creation

The ZIM file has been created using this library, more specificaly this file.

Zimcheck

zimcheck --all --details /tmp/test.zim 
[INFO] Checking zim file /tmp/test.zim
[INFO] Zimcheck version is 3.1.3
[INFO] Verifying ZIM-archive structure integrity...
[INFO] Avoiding redundant checksum test (already performed by the integrity check).
[INFO] Searching for metadata entries...
[INFO] Searching for Favicon...
[INFO] Searching for main page...
[INFO] Verifying Articles' content...
[INFO] Searching for redundant articles...
  Verifying Similar Articles for redundancies...
[INFO] Checking for redirect loops...
[INFO] Overall Test Status: Pass
[INFO] Total time taken by zimcheck: <3 seconds.

echo $?
0

kiwix-serve

kiwix-serve is unable to open the file but does not give a specific error.

kiwix-serve --verbose --port 8080 /tmp/test.zim 
Unable to add the ZIM file '/tmp/test.zim' to the internal library.

kiwix-serve --version
kiwix-tools 3.5.0

libkiwix 12.1.0
+ libzim 8.2.1
+ libxapian 1.4.22
+ libcurl 7.67.0
+ libmicrohttpd 0.9.76
+ libz 1.2.12
+ libicu 58.2.0
+ libpugixml 0.12.0

libzim 8.2.1
+ libzstd 1.5.2
+ liblzma 5.2.6
+ libxapian 1.4.22
+ libicu 58.2.0

It works fine with askubuntu.com_en_all_2022-11.zim.

kiwix-desktop (app image)

The recent app image (invoked via ./kiwix-desktop_x86_64_2.3.1-4.appimage /tmp/test.zim) is capable of reading the ZIM without any problems.

kiwix-desktop (from apt)

This is where it gets interesting: using the old version installed via the package manager fails:

kiwix-desktop /tmp/test.zim 
Compiled with Qt Version  5.15.2
Runtime Qt Version  5.15.2
add widget

Assertion failed at ../src/narrowdown.h:119
 entries.empty() || pred(entries.back(), key)[0] == true[1]
/lib/x86_64-linux-gnu/libzim.so.6(_Z15_on_assert_failIbbEvPKcS1_S1_T_T0_S1_i+0x17a) [0x7f6abee938ea]
/lib/x86_64-linux-gnu/libzim.so.6(_ZN3zim12DirentLookupINS_8FileImplEEC2EPS1_j+0x2c5) [0x7f6abee96825]
/lib/x86_64-linux-gnu/libzim.so.6(_ZN3zim8FileImpl12direntLookupEv+0x5a) [0x7f6abee91e9a]
/lib/x86_64-linux-gnu/libzim.so.6(_ZN3zim8FileImpl23getNamespaceBeginOffsetEc+0x8) [0x7f6abee91ff8]
/lib/x86_64-linux-gnu/libzim.so.6(_ZNK3zim4File23getNamespaceBeginOffsetEc+0x10) [0x7f6abee8c110]
/lib/x86_64-linux-gnu/libkiwix.so.9(_ZN5kiwix6ReaderC1ENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xe2) [0x7f6ac8230002]
/lib/x86_64-linux-gnu/libkiwix.so.9(+0x26015) [0x7f6ac8209015]
kiwix-desktop(+0x33fe3) [0x555fdb668fe3]
kiwix-desktop(+0x39af0) [0x555fdb66eaf0]
/lib/x86_64-linux-gnu/libQt5Core.so.5(+0x2e45a6) [0x7f6abfaa65a6]
/lib/x86_64-linux-gnu/libQt5WebEngineWidgets.so.5(_ZN14QWebEngineView10urlChangedERK4QUrl+0x35) [0x7f6ac81cb055]
/lib/x86_64-linux-gnu/libQt5Core.so.5(+0x2e45a6) [0x7f6abfaa65a6]
/lib/x86_64-linux-gnu/libQt5WebEngineWidgets.so.5(_ZN14QWebEnginePage10urlChangedERK4QUrl+0x35) [0x7f6ac81bcc95]
/lib/x86_64-linux-gnu/libQt5WebEngineWidgets.so.5(_ZN14QWebEnginePage6setUrlERK4QUrl+0x46) [0x7f6ac81bfbc6]
kiwix-desktop(+0x2ef0c) [0x555fdb663f0c]
kiwix-desktop(+0x28ab8) [0x555fdb65dab8]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea) [0x7f6abf406d0a]
kiwix-desktop(+0x292ca) [0x555fdb65e2ca]
terminate called after throwing an instance of 'std::runtime_error'
  what():  
Assertion failed at ../src/narrowdown.h:119
 entries.empty() || pred(entries.back(), key)[0] == true[1]
Abgebrochen

However, if we set $ZIM_DIRENTLOOKUPCACHE=1, it works without issues.

ZIM_DIRENTLOOKUPCACHE=1 kiwix-desktop /tmp/test.zim 
Compiled with Qt Version  5.15.2
Runtime Qt Version  5.15.2
add widget
session saved

Of course, this is the old, outdated package from the package manager and this may be entirely unrelated, but it provided the most interesting debug output so far.

kiwix PWA

As mentioned, the kiwix PWA is capable of reading the ZIM without issue. Here is the console log:

This page uses the non standard property “zoom”. Consider using calc() in the relevant property values, or using “transform” along with “transform-origin: 0 0”. index.html
lastPageLoad: failed init.js:207:9
Removing params.lastPageVisit because lastPageLoad failed! init.js:209:13
An iframe which has both allow-scripts and allow-same-origin for its sandbox attribute can remove its sandboxing. index.html
Instantiating WASM xz decoder bundle.min.js:12:157879
Instantiating WASM zstandard decoder bundle.min.js:12:276749
Active Service Worker found, no need to register bundle.min.js:34:72076
Setting storage type to cacheAPI bundle.min.js:12:313882
DEV: 'UnknownError' may be produced as part of localStorage capability detection bundle.min.js:12:314009
Archive type set to: open bundle.min.js:12:296737
ZIM DirListing version: 0 (legacy) 
Object { _files: (1) […], name: "test.zim", id: 0, majorVersion: 6, minorVersion: 1, entryCount: 13, articleCount: null, clusterCount: 3, urlPtrPos: 2988, titlePtrPos: 2912, … }
bundle.min.js:12:285978
Article count is: null bundle.min.js:12:286043
Initiating text/html load of C/home.html... bundle.min.js:34:108456
** HTML received for article home.html ** bundle.min.js:34:119040
Loading stylesheets... bundle.min.js:34:130096
An iframe which has both allow-scripts and allow-same-origin for its sandbox attribute can remove its sandboxing. index.html
The character encoding of a framed document was not declared. The document may appear different if viewed without the document framing it. home.html
Checking for updates to the PWA... bundle.min.js:34:25113

Not sure why it believes this is a legacy ZIM DirListing (I think I followed the standard). Expanding the collapsed object:

Object { _files: (1) […], name: "test.zim", id: 0, majorVersion: 6, minorVersion: 1, entryCount: 13, articleCount: null, clusterCount: 3, urlPtrPos: 2988, titlePtrPos: 2912, … }
_files: Array [ File ]
articleCount: null
articlePtrPos: null
clusterCount: 3
clusterPtrPos: 2964
entryCount: 13
fullTextIndex: null
fullTextIndexSize: null
id: 0
layoutPage: 4294967295
mainPage: 10
majorVersion: 6
mimeListPos: 80
mimeTypes: Map(4) { 0 → "application/octet-stream+zimlisting", 1 → "text/html", 2 → "text/plain", … }
minorVersion: 1
name: "test.zim"
titlePtrPos: 2912
urlPtrPos: 2988
zimType: "open"
<prototype>: Object { _readInteger: _readInteger(e, t), _readSlice: _readSlice(e, t), _readSplitSlice: _readSplitSlice(e, t), … }
bundle.min.js:12:285978

Additional info:

ZIM header and metadata:

===== HEADER =====

Magic number: 72173914
Version: 6 (major) / 1 (minor)
UUID: b6f29204-0218-4980-a69e-531f4a788a26
Content: 13 entries, 3 clusters
Offsets:
    Directory pointer list: 2988
    Cluster pointer list: 2964
    Mime type list: 80
    Checksum: 3092
Pages:
    Main: 10
    Layout: none

======== METADATA ==========

Creator :  pyzim
Date :  2023-10-05
Description :  The ZIM file from the pyzim writer example
Illustration_48x48@1 :  <binary content>
Language :  Eng
Name :  examplezim
Publisher :  pyzim
Title :  Example Zim

========== OTHER ============

Checksum (hex):  861be5edc15f83f2dcf73a1df65678ce

System

uname -a
Linux [devicename] 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux
IMayBeABitShy commented 9 months ago

Here's the hexdump of the ZIM:

hexdump test.zim -C
00000000  5a 49 4d 04 06 00 01 00  95 ca ca c0 88 31 c2 4d  |ZIM..........1.M|
00000010  b1 6c d9 3f 1c 44 db 40  0d 00 00 00 03 00 00 00  |.l.?.D.@........|
00000020  ac 0b 00 00 00 00 00 00  60 0b 00 00 00 00 00 00  |........`.......|
00000030  94 0b 00 00 00 00 00 00  50 00 00 00 00 00 00 00  |........P.......|
00000040  0a 00 00 00 ff ff ff ff  14 0c 00 00 00 00 00 00  |................|
00000050  61 70 70 6c 69 63 61 74  69 6f 6e 2f 6f 63 74 65  |application/octe|
00000060  74 2d 73 74 72 65 61 6d  2b 7a 69 6d 6c 69 73 74  |t-stream+zimlist|
00000070  69 6e 67 00 74 65 78 74  2f 68 74 6d 6c 00 74 65  |ing.text/html.te|
00000080  78 74 2f 70 6c 61 69 6e  00 69 6d 61 67 65 2f 70  |xt/plain.image/p|
00000090  6e 67 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |ng..............|
000000a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000850  00 00 00 58 00 00 00 00  02 00 00 00 00 00 00 00  |...X............|
00000860  6c 69 73 74 69 6e 67 2f  74 69 74 6c 65 4f 72 64  |listing/titleOrd|
00000870  65 72 65 64 2f 76 30 00  6c 69 73 74 69 6e 67 2f  |ered/v0.listing/|
00000880  74 69 74 6c 65 4f 72 64  65 72 65 64 2f 76 30 00  |titleOrdered/v0.|
00000890  05 28 b5 2f fd 00 58 85  07 00 a2 ce 33 35 60 6b  |.(./..X.....35`k|
000008a0  d3 18 30 35 04 dd 11 14  24 35 a0 0b 78 3d 48 62  |..05....$5..x=Hb|
000008b0  cd 83 df 2f db 37 6c 57  47 aa 00 00 3e f6 a3 c7  |.../.7lWG...>...|
000008c0  d4 e7 b8 cd 16 ed c4 0a  0d 13 16 2b 2c 52 1a 3a  |...........+,R.:|
000008d0  70 bd 53 12 1b 8b 06 95  99 d6 7f c7 d6 c9 37 1c  |p.S...........7.|
000008e0  78 dd c6 ff 2f a1 da ba  91 95 f8 df e3 10 b9 10  |x.../...........|
000008f0  d1 0a 65 11 fc 1e db e6  e3 07 08 80 3f fc 35 54  |..e.........?.5T|
00000900  65 4d 7e 89 80 64 a2 90  b9 28 1c b6 c9 61 a1 ca  |eM~..d...(...a..|
00000910  41 ec 62 59 68 b5 07 ae  b3 98 a7 76 36 a3 08 90  |A.bYh......v6...|
00000920  02 0d 28 07 13 9c 46 11  1f ad 6a 75 24 49 51 5e  |..(...F...ju$IQ^|
00000930  52 c4 76 e3 2c a8 2e a8  66 54 5e 51 32 50 0b 16  |R.v.,...fT^Q2P..|
00000940  29 4a 11 63 35 2c e6 f9  68 1c d1 94 f3 42 2d 1a  |)J.c5,..h....B-.|
00000950  a5 66 52 95 59 45 79 c1  d0 2a fe 9b fe 90 1f e4  |.fR.YEy..*......|
00000960  ef fe 8c 1f e3 b7 f8 2b  fe 89 bf 20 0b 00 68 20  |.......+... ..h |
00000970  03 e9 d5 38 34 8a 72 cf  0a da 79 d2 36 dc d8 50  |...84.r...y.6..P|
00000980  18 9a 6a 5c a3 48 53 6c  98 51 01 00 00 43 00 00  |..j\.HSl.Q...C..|
00000990  00 00 00 00 00 00 00 00  00 00 68 6f 6d 65 2e 68  |..........home.h|
000009a0  74 6d 6c 00 57 65 6c 63  6f 6d 65 21 00 02 00 00  |tml.Welcome!....|
000009b0  4d 00 00 00 00 00 00 00  00 01 00 00 00 4e 61 6d  |M............Nam|
000009c0  65 00 4e 61 6d 65 00 02  00 00 4d 00 00 00 00 00  |e.Name....M.....|
000009d0  00 00 00 02 00 00 00 54  69 74 6c 65 00 54 69 74  |.......Title.Tit|
000009e0  6c 65 00 02 00 00 4d 00  00 00 00 00 00 00 00 03  |le....M.........|
000009f0  00 00 00 43 72 65 61 74  6f 72 00 43 72 65 61 74  |...Creator.Creat|
00000a00  6f 72 00 02 00 00 4d 00  00 00 00 00 00 00 00 04  |or....M.........|
00000a10  00 00 00 50 75 62 6c 69  73 68 65 72 00 50 75 62  |...Publisher.Pub|
00000a20  6c 69 73 68 65 72 00 02  00 00 4d 00 00 00 00 00  |lisher....M.....|
00000a30  00 00 00 05 00 00 00 44  61 74 65 00 44 61 74 65  |.......Date.Date|
00000a40  00 02 00 00 4d 00 00 00  00 00 00 00 00 06 00 00  |....M...........|
00000a50  00 44 65 73 63 72 69 70  74 69 6f 6e 00 44 65 73  |.Description.Des|
00000a60  63 72 69 70 74 69 6f 6e  00 02 00 00 4d 00 00 00  |cription....M...|
00000a70  00 00 00 00 00 07 00 00  00 4c 61 6e 67 75 61 67  |.........Languag|
00000a80  65 00 4c 61 6e 67 75 61  67 65 00 03 00 00 4d 00  |e.Language....M.|
00000a90  00 00 00 00 00 00 00 08  00 00 00 49 6c 6c 75 73  |...........Illus|
00000aa0  74 72 61 74 69 6f 6e 5f  34 38 78 34 38 40 31 00  |tration_48x48@1.|
00000ab0  49 6c 6c 75 73 74 72 61  74 69 6f 6e 5f 34 38 78  |Illustration_48x|
00000ac0  34 38 40 31 00 ff ff 00  57 00 00 00 00 00 00 00  |48@1....W.......|
00000ad0  00 6d 61 69 6e 50 61 67  65 00 57 65 6c 63 6f 6d  |.mainPage.Welcom|
00000ae0  65 21 00 ff ff 00 43 00  00 00 00 00 00 00 00 72  |e!....C........r|
00000af0  65 64 69 72 65 63 74 00  57 65 6c 63 6f 6d 65 21  |edirect.Welcome!|
00000b00  00 05 28 b5 2f fd 00 58  61 00 00 08 00 00 00 0c  |..(./..Xa.......|
00000b10  00 00 00 00 00 00 00 00  00 00 58 00 00 00 00 01  |..........X.....|
00000b20  00 00 00 00 00 00 00 6c  69 73 74 69 6e 67 2f 74  |.......listing/t|
00000b30  69 74 6c 65 4f 72 64 65  72 65 64 2f 76 31 00 6c  |itleOrdered/v1.l|
00000b40  69 73 74 69 6e 67 2f 74  69 74 6c 65 4f 72 64 65  |isting/titleOrde|
00000b50  72 65 64 2f 76 31 00 01  08 00 00 00 3c 00 00 00  |red/v1......<...|
00000b60  01 00 00 00 00 00 00 00  02 00 00 00 03 00 00 00  |................|
00000b70  04 00 00 00 05 00 00 00  06 00 00 00 07 00 00 00  |................|
00000b80  08 00 00 00 09 00 00 00  0a 00 00 00 0b 00 00 00  |................|
00000b90  0c 00 00 00 90 08 00 00  00 00 00 00 01 0b 00 00  |................|
00000ba0  00 00 00 00 57 0b 00 00  00 00 00 00 8a 09 00 00  |....W...........|
00000bb0  00 00 00 00 e3 0a 00 00  00 00 00 00 e3 09 00 00  |................|
00000bc0  00 00 00 00 27 0a 00 00  00 00 00 00 41 0a 00 00  |....'.......A...|
00000bd0  00 00 00 00 8b 0a 00 00  00 00 00 00 69 0a 00 00  |............i...|
00000be0  00 00 00 00 ad 09 00 00  00 00 00 00 03 0a 00 00  |................|
00000bf0  00 00 00 00 c7 09 00 00  00 00 00 00 c5 0a 00 00  |................|
00000c00  00 00 00 00 50 08 00 00  00 00 00 00 17 0b 00 00  |....P...........|
00000c10  00 00 00 00 75 8f 4c 66  32 67 a5 95 20 49 2a d9  |....u.Lf2g.. I*.|
00000c20  ea 55 52 8a                                       |.UR.|
00000c24
kelson42 commented 9 months ago

@IMayBeABitShy Thank you for the detailed report. We are interested to understand the cade and help you... but the answer doesn't seem super obvious. Give us a bit of time to come back to you.

kelson42 commented 9 months ago

Implementing #572 might help here maybe

Jaifroid commented 9 months ago

Not sure why it believes this is a legacy ZIM DirListing (I think I followed the standard). Expanding the collapsed object

Regarding how the PWA detects legacy ZIM listings, it should mean that this is a ZIM that uses namespaces (like A/index.html, -/some_stylesheet.css, I/an_image.webp), as opposed to being a type 1 ZIM (one that has all user content under a C/ namespace, and doesn't distinguish amongst images, stylesheets, HTML, etc.). It doesn't mean that the ZIM is non-conformant, it just means that it adheres to an earlier opemZIM spec.

Having said that, the ZIMFile object looks like it's missing the articleCount and articlePtrPosition info. This is what a typical mwOffliner-produced ZIM looks like (also uses legacy ZIM format):

image

It's possible the fields weren't populated at the time you took your snapshot, or that an older version of the PWA displayed the object too soon in console. Might be worth double-checking, though, in the latest PWA (2.7.2+), that those properties are populated (check after the landing page has loaded).

IMayBeABitShy commented 9 months ago

Regarding how the PWA detects legacy ZIM listings, it should mean that this is a ZIM that uses namespaces (like A/index.html, -/some_stylesheet.css, I/an_image.webp), as opposed to being a type 1 ZIM (one that has all user content under a C/ namespace, and doesn't distinguish amongst images, stylesheets, HTML, etc.). It doesn't mean that the ZIM is non-conformant, it just means that it adheres to an earlier opemZIM spec.

That is interesting, thank you for the explanation. How is this check performed? This should be a ZIM with the newer namespace behavior (minor version 1, content in C namespace). The URLs (from the url pointer list) are as follows:

0 -> 2442: b'Chome.html'
1 -> 2787: b'Credirect'
2 -> 2531: b'MCreator'
3 -> 2599: b'MDate'
4 -> 2625: b'MDescription'
5 -> 2699: b'MIllustration_48x48@1'
6 -> 2665: b'MLanguage'
7 -> 2477: b'MName'
8 -> 2563: b'MPublisher'
9 -> 2503: b'MTitle'
10 -> 2757: b'WmainPage'
11 -> 2128: b'Xlisting/titleOrdered/v0'
12 -> 2839: b'Xlisting/titleOrdered/v1'

So the "C" namespace is used for content, "X" for the indexes, "W" for well-known entries and "M" for metadata.

I've re-checked the output of the PWA version. The article-related fields still aren't populated and I am fairly sure it is fully loaded. Also, the "random article" button took me to a metadata field, so something seems to be wrong with the article index... This should be the entry at Xlisting/titleOrdered/v1 and behave like the v0 index, with the exception that only article titles are included, right?

Jaifroid commented 9 months ago

Hmm, there is a byte set in the ZIM file header that indicates whether the ZIM is type 0 or type 1. It's the minorVersion field, and it is extracted here: https://github.com/kiwix/kiwix-js-windows/blob/main/www/js/lib/zimfile.js#L488 .

Your minorVersion is indeed set to type 1, but the logic that decides whether it is using legacy listing or not is here:

https://github.com/kiwix/kiwix-js-windows/blob/main/www/js/lib/zimfile.js#L366

The legacy listing is a kind of fallback, so should work for any ZIM, so it gets used if the app can't find the X/listing... By the way that should be in the X/ namespace (you wrote Xlisting above, probalby just a typo).

I'm not sure if labelling the title listing as legacy is an inaccuracy in the PWA, or some problem with the ZIM format, though if it's not readable by Kiwix Serve, it would suggest the latter.

By the way, you can (in modern Firefox or Chromium) get a debuggable (unminified) version of the PWA by going to https://kiwix.github.io/kiwix-js-windows/ (ignore the Repo title, it's not Windows-specific). This should allow you to pause on those lines during ZIM loading in case it helps you to debug the format.

IMayBeABitShy commented 9 months ago

X/listing... By the way that should be in the X/ namespace (you wrote Xlisting above, probalby just a typo).

About that: the specification always uses <namespace>/<path>, but all ZIMs I've analyzed (e.g. an askubuntu one) uses <namespace><path>. Perhaps this is the problem? Should these paths start with a /? The aforementioned askubuntu ZIM also has the following URLs, so I had assumed that the / in the documentation was just for the readability:

1488425 -> 2285351818: b'Cusers_page=97'
1488426 -> 2285351849: b'Cusers_page=98'
1488427 -> 2285351880: b'Cusers_page=99'
1488428 -> 2285351911: b'MCounter'
1488429 -> 2285351936: b'MCreator'
1488430 -> 2285351961: b'MDate'
1488431 -> 2285351983: b'MDescription'
1488432 -> 2285352012: b'MFaviconPath'
1488433 -> 2285352041: b'MIllustration_48x48@1'
1488434 -> 2285352079: b'MIllustration_96x96@1'
1488435 -> 2285352117: b'MLanguage'
1488436 -> 2285352143: b'MName'
1488437 -> 2285352165: b'MPublisher'
1488438 -> 2285352192: b'MTags'
1488439 -> 2285352214: b'MTitle'
1488440 -> 2285352237: b'WmainPage'
1488441 -> 2285352259: b'Xfulltext/xapian'
1488442 -> 2285352292: b'Xlisting/titleOrdered/v0'
1488443 -> 2285352333: b'Xlisting/titleOrdered/v1'
1488444 -> 2285352374: b'Xtitle/xapian'

By the way, you can (in modern Firefox or Chromium) get a debuggable (unminified) version of the PWA by going to https://kiwix.github.io/kiwix-js-windows/ (ignore the Repo title, it's not Windows-specific). This should allow you to pause on those lines during ZIM loading in case it helps you to debug the format.

Great, thank you. I'll look into it.

Jaifroid commented 9 months ago

It looks like that might be the problem... The PWA at least assumes the title listing is in the X/ namespace, based on the OpenZIM spec, and we certainly find it there in most ZIMs. However, as we use a fallback if we don't find the X/titleOrdered/v0 or /v1 listings, this would not show up as an obvious error.

In the listing you provide above, I notice that the / seems to be systematically missing. Metadata are in the M/ namespace, so MIllustration_48x48@1 should definitely be M/Illustration_48x48@1... It may just be a display fluke, however, rather than an actual coding error in the AskUbuntu ZIM.

IMayBeABitShy commented 9 months ago

It looks like that might be the problem... The PWA at least assumes the title listing is in the X/ namespace, based on the OpenZIM spec, and we certainly find it there in most ZIMs. However, as we use a fallback if we don't find the X/titleOrdered/v0 or /v1 listings, this would not show up as an obvious error.

Sorry, I may have formulated that a bit suboptimal. The entry is inside the X namespace (no /, namespaces are always 1 byte) and at pathtitleOrdered/v1. It seems like the whole '/` thing is just a display related. At least, kiwix js finds the entry for the v1 title index correctly as far as I can tell:

_zimfile: {…}
blob: 0
cluster: 1
mimetypeInteger: 0
namespace: "X"
offset: 2839
redirect: false
redirectTarget: undefined
title: "listing/titleOrdered/v1"
url: "listing/titleOrdered/v1"

But it seems like the next function:

).then(function (metadata) {
            // Note that we do not accept a listing if its size is 0, i.e. if it contains no data
            // (although this should not occur, we have been asked to handle it - see kiwix-js #708)
            if (metadata && metadata.size) {
                that[listing.ptrName] = metadata.ptr;
                that[listing.countName] = metadata.size / 4; // Each entry uses 4 bytes
                highestListingVersion = Math.max(~~listing.path.replace(/.+(\d)$/, '$1'), highestListingVersion);
            }
            // Get the next Listing
            return listingAccessor(listings.pop());

seems to receive null as the value of metadata. So for some reason return that.blob(dirEntry.cluster, dirEntry.blob, true); seems to return null...

It may just be a display fluke, however, rather than an actual coding error in the AskUbuntu ZIM.

Yes, I think so too.

IMayBeABitShy commented 9 months ago

I think I may have found the problem. The v1 title index is in a compressed cluster. In kiwix JS, this results in a problem in the following code section:

// If only metadata were requested and the cluster is compressed, return null (this is probably a ZIM format error)
            // DEV: This is because metadata are only requested for finding absolute offsets into uncompressed clusters,
            // principally for finding the start and size of a title pointer listing
            if (meta && compressionType[0] > 1) return null;

Where compressionType[0]=5 and meta=true. This results in the function I've mentioned above to be called with metadata=null. I am still investigating if this is also the cause of the problem with kiwix-serve or unrelated. Seems like I've missed the part where these indexes must be stored uncompressed.

Jaifroid commented 9 months ago

Ah yes, that rings a bell! Basically, X/titleOrdered/v0 is (currently) just a wrapper around the legacy titlePtrList, so that legacy apps can still find a title list even if they don't look in the X namespace.

I'm not sure if it's permissible to compress X/titleOrdered/v1 -- maybe @mgautierfr, who is the expert here, can comment? Clearly we didn't handle that possibility in our backend (yet) because there are (or were) no ZIMs with compressed title lists. My suspicion is that the overhead of decompressing what can be potentially huge lists (even decompressing on the fly) for potentially thousands of lookups during binary search would make this quite problematic.

IMayBeABitShy commented 9 months ago

I'm not sure if it's permissible to compress X/titleOrdered/v1 -- maybe @mgautierfr, who is the expert here, can comment? Clearly we didn't handle that possibility in our backend (yet) because there are (or were) no ZIMs with compressed title lists.

It's unfortunately not. I've just missed the following line in the spec:

All indexes and listing items MUST be stored in uncompressed cluster.

mgautierfr commented 9 months ago

Using xxd -r <your_dump> example.zim to recreate the zim file from your dump.

$ zimcheck --all --details example.zim
[INFO] Checking zim file example.zim
[INFO] Zimcheck version is 3.2.0
[INFO] Verifying ZIM-archive structure integrity...
mimelistPos must be 80.
  [ERROR] ZIM file's low level structure is invalid
mimelistPos must be 80.

It is kind of coherent with https://github.com/openzim/libzim/issues/822 where you asked if we could remove the constraint on mimelistPos being 80.

Are you sure you are using the right zim file (or not a patched version of zimcheck) ?

IMayBeABitShy commented 9 months ago

Using xxd -r <your_dump> example.zim to recreate the zim file from your dump.

$ zimcheck --all --details example.zim
[INFO] Checking zim file example.zim
[INFO] Zimcheck version is 3.2.0
[INFO] Verifying ZIM-archive structure integrity...
mimelistPos must be 80.
  [ERROR] ZIM file's low level structure is invalid
mimelistPos must be 80.

It is kind of coherent with openzim/libzim#822 where you asked if we could remove the constraint on mimelistPos being 80.

Are you sure you are using the right zim file (or not a patched version of zimcheck) ?

I assure you, I most definitely did not do anything that involved writing or modifying C/C++ code ;)

The ZIM file above used 2KiB of reserved space for the mimetypelist at offset 80 (thus the many zero bytes). The problem here is that the hexdumps created with hexdump -C can apparently not be reversed with xxd -r. I get an invalid magic number when trying to do so, but when using xxd to create the dump and recreate it zimcheck passes.

It's most likely the compressed title indexes, but I am still checking if that is also the problem with kiwix-serve and not only for the PWA.

mgautierfr commented 9 months ago

Can you upload the zim file here ? (You may have to rename it to zip) to let github accept it.

IMayBeABitShy commented 9 months ago

Sure, here you go: test.zip

Jaifroid commented 9 months ago

I confirm the PWA can't populate articleCount or articlePtrPosition, but falls back to the legacy v0 listing. This has anomalous effects, because it's treating the ZIM as some kind of hybrid, so shows a bunch of stuff in the title index which shouldn't be there. That's a side effect (probably of it being a v1 ZIM which can't read the X/titleOrdered/v1 listing). It explains why the random button shows metadata sometimes.

FYI In the PWA, you can access what it thinks is the title list by pressing a space in the search field. And you can access the full URL list, including namespaces, by typing space + / (space followed by /) as shown below:

image

mgautierfr commented 9 months ago

It is missing the M/Counter entry. It is described as not mandatory in the spec but libkiwix expect it and fails if it cannot found it. This probably pass under our radar as libzim itself creates it and so scrappers don't have to add it.

Jaifroid commented 9 months ago

It is missing the M/Counter entry.

But the ZIM is still incorrectly formatted if X/listing/titleOrdered/v1 is in a compressed cluster, right? So, probably if it had that M/Counter entry, then libzim should fall back to using the v0 title index...

mgautierfr commented 9 months ago

But the ZIM is still incorrectly formatted if X/listing/titleOrdered/v1 is in a compressed cluster, right?

Yes. The cluster should not be compressed.

So, probably if it had that M/Counter entry, then libzim should fall back to using the v0 title index...

No, libzim is nice here (https://github.com/openzim/libzim/blob/main/src/fileimpl.cpp#L250-L254). So it should pass and libzim will use the information in the header to locate the titleIndex (directly from the header, without using X/listing/titleOrdered/v0)

Jaifroid commented 9 months ago

So it should pass and libzim will use the information in the header to locate the titleIndex (directly from the header, without using X/listing/titleOrdered/v0)

OK, thanks for confirming -- I think libzim and the KJS backend are in accordance, then (except for the former requiring M/Counter). KJS backend falls back from v1 to v0 to titlePtrPos in the header.

kelson42 commented 9 months ago

Yes. The cluster should not be compressed.

Should we not check this in zimcheck?

mgautierfr commented 9 months ago

Yes

IMayBeABitShy commented 8 months ago

I can confirm the issue with kiwix-serve has been fixed. Thank you all for your help.

Regarding the bug with the PWA/kiwix-js: Unfortunately, this seems to remain even when the X/ namespace remains properly uncompressed. However, this is a unrelated issue and I should hopefully be capable of debugging that myself.

Once again, thank you for your helpful comments and the quick fix.