openzim / libzim

Reference implementation of the ZIM specification
https://download.openzim.org/release/libzim/
GNU General Public License v2.0
163 stars 47 forks source link

Content offset is ignored on macOS using LibZim 9.2.0 #886

Closed BPerlakiH closed 1 month ago

BPerlakiH commented 2 months ago

Under macOS Kiwix app, using LibZIM 9.2.0

We have the following code to read the content from the zim file using offset and size:

blob = item.getData(start, fmin(item.getSize() - start, end - start + 1));

Which if I got this right has the corresponding C++ function here: https://github.com/openzim/libzim/blob/main/src/item.cpp#L50C1-L50C61

My aim is to read the data in chunks, but it seems there's a bug with the offset. Reading the entire data from offset=0 to item.getSize() works as expected. Starting with offset=0 and increasing the size also works as expected.

Changing the offset, does not work, it seems to be ignored, I am keep getting back the same data. I did tried it with text/html, webp image and what I would like most to be working is webm video content.

Here are some debug prints from webm chunks:

"getURLContent: /videos/9987/video.webm 2076400 - 2076464: 1a45dfa39f4286810142f7810142f2810442f381084282847765626d4287810242858102185380670100000000f1a98a114d9b74bb4dbb8b53ab841549a96653"

"getURLContent: /videos/9987/video.webm 2076600 - 2076664: 1a45dfa39f4286810142f7810142f2810442f381084282847765626d4287810242858102185380670100000000f1a98a114d9b74bb4dbb8b53ab841549a96653"

It is the same on HTML as well read in chunks sized 100:

"Optional(\"text/html\") -> Optional(7369)"
"mime: text/html"

mType text/html from 0 to 99, size: 100
"getURLContent: /A/App/IntroPage 0 - 99: Optional(\"<!DOCTYPE html>\\n<html class=\\\"client-js\\\"><head>\\n  <meta charset=\\\"UTF-8\\\">\\n  <title>App/IntroPage</titl\")"
"got content @ isMain: false from: 0 - 99 with data: 100"

mType text/html from 100 to 199, size: 100
"getURLContent: /A/App/IntroPage 100 - 199: Optional(\"<!DOCTYPE html>\\n<html class=\\\"client-js\\\"><head>\\n  <meta charset=\\\"UTF-8\\\">\\n  <title>App/IntroPage</titl\")"
"got content @ isMain: false from: 100 - 199 with data: 100"

mType text/html from 200 to 299, size: 100
"getURLContent: /A/App/IntroPage 200 - 299: Optional(\"<!DOCTYPE html>\\n<html class=\\\"client-js\\\"><head>\\n  <meta charset=\\\"UTF-8\\\">\\n  <title>App/IntroPage</titl\")"
"got content @ isMain: false from: 200 - 299 with data: 100"

Note: I did tested it with a couple of unrelated ZIM files, and the offset was still ignored. I even hard coded the offset values, and that still resulted in the same data returned.

BPerlakiH commented 2 months ago

I am not sure if this other issue is related to this or not, but might be: https://github.com/openzim/libzim/issues/670

mgautierfr commented 1 month ago

How are you compiling libzim ? Do you use https://download.openzim.org/release/libzim/libzim_macos-x86_64-9.2.1.tar.gz (or the arm64 version) ?

Do you have the same issue with an older version ?

rgaudin commented 1 month ago

I believe @BPerlakiH is using libkiwix 13.1.0-1 release from https://download.kiwix.org/release/libkiwix/libkiwix_xcframework-13.1.0-1.tar.gz

mgautierfr commented 1 month ago

@BPerlakiH Can you test this https://tmp.kiwix.org/ci/dev_preview/test_reader/libkiwix_xcframework-2024-05-30.tar.gz ? This is a build with this branch : https://github.com/openzim/libzim/pull/889

BPerlakiH commented 1 month ago

@mgautierfr I did test this build, and it seems to have the same issue:

here is a fragment of the logs from reading a css file:

From 104448 to 105471 data: 2f2a207374796c652066726f6d2068747470733a2f2f66722e6d2e77696b6970 ...
"kiwix://91BB58AE-13DF-0100-9423-D2B8617607B0/-/inserted_style.css"
From 105472 to 106495 data: 2f2a207374796c652066726f6d2068747470733a2f2f66722e6d2e77696b6970 ...
"kiwix://91BB58AE-13DF-0100-9423-D2B8617607B0/-/inserted_style.css"
From 106496 to 107519 data: 2f2a207374796c652066726f6d2068747470733a2f2f66722e6d2e77696b6970 ...
"kiwix://91BB58AE-13DF-0100-9423-D2B8617607B0/-/inserted_style.css"
From 107520 to 108543 data: 2f2a207374796c652066726f6d2068747470733a2f2f66722e6d2e77696b6970 ...
"kiwix://91BB58AE-13DF-0100-9423-D2B8617607B0/-/inserted_style.css"
From 108544 to 109567 data: 2f2a207374796c652066726f6d2068747470733a2f2f66722e6d2e77696b6970 ...
"kiwix://91BB58AE-13DF-0100-9423-D2B8617607B0/-/inserted_style.css"
From 109568 to 110591 data: 2f2a207374796c652066726f6d2068747470733a2f2f66722e6d2e77696b6970 ...
"kiwix://91BB58AE-13DF-0100-9423-D2B8617607B0/-/inserted_style.css"
From 110592 to 111615 data: 2f2a207374796c652066726f6d2068747470733a2f2f66722e6d2e77696b6970 ...
"kiwix://91BB58AE-13DF-0100-9423-D2B8617607B0/-/inserted_style.css"
From 111616 to 111685 data: 2f2a207374796c652066726f6d2068747470733a2f2f66722e6d2e77696b6970 ...

It keeps repeating the same data, regardless of the offset.

The actual output from 1024 sized chunks, is also repeating (the offset is not moving):

/*
Problematic modules: {
    "skins.minerva.base.reset": "missing",
    "skins.minerva.content.styles": "missing",
    "ext.cite.style": "missing",
    "mobile.app.pagestyles.android": "missing"
}
*/
/*
MediaWiki:Common.css
*/
.incomplete {
    background-color:#f2edb3;
    border:2px solid #ffbd00;
    padding:10px;
    line-height:35px;
}
.incomplete p:before {
    content:"!";
    color:white;
    display:block;
    float:left;
    font-size:35px;
    line-height:35px;
    padding:1px 14px;
    background-color:#ffbd00;
    border-radius:100px;
    margin-right:10px;
}
pre,.mw-code {
    background-color:#f2f2f2;
    border:1px solid #a8a8a8;
    border-radius:2px;
    display:inline-block:overflow:scroll;
    padding:5px;
    margin-bottom:10px;
        line-height:1;
}
code {
    white-space:nowrap;
}
.page-Main_Page #firstHeading,.page-Main_Page #toc {
    display:none;
}
#tagline {
    display:none;
}
h1,h2,h3,h4,h5,h6 {
    margin-top:40px;
    margin-bottom:20px;
}
video {
    height:auto!important;
}
.thumb {
    padding:5px;
    border:1px solid #bbb;
    margin-left:10px;
}

a.exte/*
Problematic modules: {
    "skins.minerva.base.reset": "missing",
    "skins.minerva.content.styles": "missing",
    "ext.cite.style": "missing",
    "mobile.app.pagestyles.android": "missing"
}
*/
/*
MediaWiki:Common.css
*/
.incomplete {
    background-color:#f2edb3;
    border:2px solid #ffbd00;
    padding:10px;
    line-height:35px;
}
.incomplete p:before {
    content:"!";
    color:white;
    display:block;
    float:left;
    font-size:35px;
    line-height:35px;
    padding:1px 14px;
    background-color:#ffbd00;
    border-radius:100px;
    margin-right:10px;
}
pre,.mw-code {
    background-color:#f2f2f2;
    border:1px solid #a8a8a8;
    border-radius:2px;
    display:inline-block:overflow:scroll;
    padding:5px;
    margin-bottom:10px;
        line-height:1;
}
code {
    white-space:nowrap;
}
.page-Main_Page #firstHeading,.page-Main_Page #toc {
    display:none;
}
#tagline {
    display:none;
}
h1,h2,h3,h4,h5,h6 {
    margin-top:40px;
    margin-bottom:20px;
}
video {
    height:auto!important;
}
.thumb {
    padding:5px;
    border:1px solid #bbb;
    margin-left:10px;
}

a.exte/*
Problematic modules: {
    "skins.minerva.base.reset": "missing",
    "skins.minerva.content.styles": "missing",
    "ext.cite.style": "missing",
    "mobile.app.pagestyles.android": "missing"
}
*/
/*
MediaWiki:Common.css
*/
mgautierfr commented 1 month ago

In this line https://github.com/kiwix/kiwix-apple/blob/main/Model/ZimFileService/ZimFileService.mm#L201, you return the data from item.getData().data(). Should it be blob.data() instead?

BPerlakiH commented 1 month ago

🤦‍♂️ ohh.. that makes sense now, it is working as expected with those changes.. closing this ticket. Thank you @mgautierfr for your help!