openzim / libzim

Reference implementation of the ZIM specification
https://download.openzim.org/release/libzim/
GNU General Public License v2.0
171 stars 50 forks source link

Zimwriterfs search results links out of sync with autogenerated "/C/" namespace #519

Closed georgejhunt closed 3 years ago

georgejhunt commented 3 years ago

I used zimdump (commit dca3a83d48a7ac5612c4f3dbfaba89c02c66e6b4 Merge: f406219 ff61a93 Author: Matthieu Gautier mgautier@kymeria.fr

  1. ZimDumped "teded_en_all_2021-01.zim"
  2. which created a content tree with html files in an /A/ namespace.
  3. When I used zimwriterfs to recreate a zim from that file tree, the html was placed in a /C/A/ namespace -- which I can at least understand as a way of dealing with the changing namespace spec.
  4. But when I do a kiwix banner (top right) indexed search for "water", I get a list of links with each namespace set to /A/ and no text associated with the item. The link fails.
  5. In my browser, when I change the namespace from /A/ to /C/A/ in the URL that just failed, the URL succeeds.
kelson42 commented 3 years ago

An important point: running zimwriterfs on a zimdumped out directory does not recreate the original ZIM file. We need to analyse if this is a bug.

mgautierfr commented 3 years ago

As @kelson42 said, zimwriterfs is not the inverse operation of zimdump.

As your articles are in the A sub-directory and are placed in C namespace the full path is C/A/foo.html. In most recent version of libzim/kiwix-lib/kiwix-tools (master), the namespace is hidden and so the url is /A/foo.html.

However, it seems that the compatibility layer fails to locate /A/foo.html. It is probably a bug. Is it possible to share the generated zim file somewhere ?

georgejhunt commented 3 years ago

My original task was to reduce a 27GB zim file to about 10GB. So when I found that the 10 GB zim didn't work, I went to use zimwriterfs on the zimdump-ed 27GB unmodified tree. So it is taking a while to upload. It may take most of today, and may fail. I'll send you a link to my s3 space when it finishes uploading.

But thank you for looking into it.

On Mon, Mar 8, 2021 at 2:10 AM Matthieu Gautier notifications@github.com wrote:

As @kelson42 https://github.com/kelson42 said, zimwriterfs is not the inverse operation of zimdump.

As your articles are in the A sub-directory and are placed in C namespace the full path is C/A/foo.html. In most recent version of libzim/kiwix-lib/kiwix-tools (master), the namespace is hidden and so the url is /A/foo.html.

However, it seems that the compatibility layer fails to locate /A/foo.html. It is probably a bug. Is it possible to share the generated zim file somewhere ?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/openzim/zim-tools/issues/230#issuecomment-792643736, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOTQHBL7JVVI5ORHANSCX3TCSO7RANCNFSM4YXEODAQ .

kelson42 commented 3 years ago

@georgejhunt You better create a new quality profile in youtube scraper.

georgejhunt commented 3 years ago

I was not scraping youtube. I was using a kiwix zim file as source. And then I was using youtube "view_count" to selectively copy from input to output (and trim from -/assets/data.js), and repackage. But certainly, if I were scraping youtube, I'd need to set a profile that minimized download size.

On Mon, Mar 8, 2021 at 9:15 AM Kelson notifications@github.com wrote:

@georgejhunt https://github.com/georgejhunt You better create a new quality profile in youtube scraper.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openzim/zim-tools/issues/230#issuecomment-792914557, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOTQHFWR6R5VJCW2WRBN63TCUA3LANCNFSM4YXEODAQ .

kelson42 commented 3 years ago

@mgautierfr I have zimdumped and recreated a ZIM with:

$ zimwriterfs --favicon="I/favicon.jpg" --language="eng" --title="my title" --description="my descriptioon" --creator="ted" --publisher="kiwix" --welcome="A/home.html" out/ out.zim
WARNING: LZMA compression method is deprecated. Support for it will be dropped from libzim soon.
Unable to resolve symlink out/-/favicon: No such file or directory
Resolve redirect
set index

All these tools are created with latest dev git master HEAD version. The result is uploaded at: http://tmp.kiwix.org/teded_broken_suggestions.zim

For me, there is no suggestion at all. Seems definitly broken but for me not clear if this is the ZIM file of libkiwix.

mgautierfr commented 3 years ago

Are you sure you have the just merged https://github.com/openzim/zim-tools/pull/212 in zim-tools ?

kelson42 commented 3 years ago

@mgautierfr Will check, but I believe not.

georgejhunt commented 3 years ago

The upload of 27GB failed twice. So I downloaded a shorter ZIM: ted_en_playlist-9-trippy-ted-talks_2021-01.zim. Then I used my zimdump, and zimwriterfs to create: http://d.iiab.io/content/trippy-en-tedtalks.zim -- which also exhibits the problem.

Libzim hash: commit ac2cc1fbe8d91b2da9df8c79a7469e83b7b1f30c -- Feb 24,2021 zimtools hash: commit f406219cd974d2a944cccbc72a0da8616d886972 -- also Feb 24.2021

Both compiled on Ubuntu 20.04 with no apparent problems after the dependencies were present

On Tue, Mar 9, 2021 at 5:54 AM Kelson @.***> wrote:

@mgautierfr https://github.com/mgautierfr Will check, but I believe not.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openzim/zim-tools/issues/230#issuecomment-793929044, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOTQHCKWJGUKRZ6WONH6Y3TCYSDFANCNFSM4YXEODAQ .

kelson42 commented 3 years ago

@mgautierfr I have secured now that I had the Hints patch... but exactly same symptom. I have updated http://tmp.kiwix.org/teded_broken_suggestions.zim

mgautierfr commented 3 years ago

I've just tried with you trippy-en-tedtalks.zim. I have few bugs (already fixed) but none corresponding to what you describe:

But clicking on the article link in the search page correctly move to the article (link is working).

mgautierfr commented 3 years ago

@kelson42, @georgejhunt What reader are you using ? kiwix-serve, kiwix-desktop ? Which version ?

kelson42 commented 3 years ago

@mgautierfr Latest dev kiwix-serve

georgejhunt commented 3 years ago

The kiwix-serve I was using is probably 6 months old. "kiwix-serve -V" yields 3.1.2

On Wed, Mar 10, 2021 at 2:21 AM Kelson @.***> wrote:

@mgautierfr https://github.com/mgautierfr Latest dev kiwix-serve

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openzim/zim-tools/issues/230#issuecomment-795204492, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOTQHHK7Z7CKAQB2QPGSZDTC5BZRANCNFSM4YXEODAQ .

kelson42 commented 3 years ago

@georgejhunt Just to confirm than the bug has been identified and this is not a minor one and even worth it went through the CI. So, really valuable bug report. Thx. A fix will be developed within a week.

georgejhunt commented 3 years ago

Thanks for the update, and and priority

On Thu, Mar 11, 2021 at 12:57 AM Kelson @.***> wrote:

@georgejhunt https://github.com/georgejhunt Just to confirm than the bug has been identified and this is not a minor one and even worth it went through the CI. So, really valuable bug report. Thx. A fix will be developed within a week.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openzim/zim-tools/issues/230#issuecomment-796578327, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOTQHHPTK3IKMA5HWCM52LTDCAX3ANCNFSM4YXEODAQ .