openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
44 stars 4 forks source link

Issues in `MDN` zimit2 ZIM: links are opening online instead of staying in the ZIM #178

Closed benoit74 closed 8 months ago

benoit74 commented 8 months ago

I made ZIM on my local machine with latest zimit2 image (based on warc2zim 2.0.0-dev3) based on WARC files at https://tmp.kiwix.org/ci/test-warc/developer.mozilla.org_en_2024-02-04/

warc2zim command used to test locally:

warc2zim(["--name=developer.mozilla.org_en_all","--tags=mozilla","--favicon=https://drive.farm.openzim.org/developer.mozilla.org_en/favicon.png","--verbose","--publisher=openZIM","--output","/output","--url","https://developer.mozilla.org/en-US/","--title","MDN Web Docs","--description","Resources for developers, by developers", "."])

A ZIM is produced successfully, but there is still a significant issue: most (all?) links on the home page seems to point to the ZIM content (they are relative links) but when clicked the live online website is opened. Probably some Javascript is doing some nasty stuff. This has been observed in kiwix-serve, I did not tested other readers, probably doesn't make a difference.

Jaifroid commented 8 months ago

most (all?) links on the home page seems to point to the ZIM content (they are relative links) but when clicked the live online website is opened

Hmm, this "shouldn't" be possible since a sandbox was added to the Kiwix Serve iframe, unless it's been removed or you're opening the page directly outside of an iframe... Maybe you could make a small, 5-6 page ZIM including home page for testing?

benoit74 commented 8 months ago

You are right, I pushed the file to dev.library.kiwix.org (here: https://dev.library.kiwix.org/viewer#developer.mozilla.org_en_all_2024-02) and the bug is not present there.

The bug is only present when running locally. Locally I'm using Docker image of kiwix-tools 3.6.0. Is there any parameter which needs to be set to activate the sandbox?

Jaifroid commented 8 months ago

The sandbox is in the viewer iframe on Kiwix Serve, I believe, though it's also possible to set it via CSP headers server side.

Jaifroid commented 8 months ago

Another issue with this ZIM: a number of articles (which I've verified to be in the ZIM) get redirected nearly instantly to a custom Mozilla 404 page. If you're quick, you can just see a flash of the original content before it gets redirected.

Here's an example:

Go to this page:

https://dev.library.kiwix.org/viewer#developer.mozilla.org_en_all_2024-02/developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web

and click on "JavaScript Basics" in the left-hand menu. If you're quick, you'll see the page loads, but then very quickly redirects to a custom Mozilla 404 page.

This doesn't happen in the Zimit1 version of the same page here:

https://library.kiwix.org/viewer#developer.mozilla.org_en_all_2024-01/A/developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web

NB I can see the content of such pages has been cached, and is all there and correct. The DevTools Cache API page even shows a graphical view of the loaded page (see screenshot). The redirect definitely happens after DOMContentReady as a result of a dynamic Fetch request (possibly one of the many POST requests). In the zimit1 version there are several Header lookups, and also some attempt to do fuzzy matching in the Service Worker after the content has loaded.

image

mgautierfr commented 8 months ago

The fix is https://github.com/kiwix/libkiwix/pull/1036 (confirm with @benoit74)

https://dev.library.org is nightly (and so, contains the fix), docker is release and doesn't have the fix.

Jaifroid commented 8 months ago

The redirect issue I mentioned above is occurring on dev.library.org, therefore it must be something else (I thought it was related to the problem of determining the final location of a ZIM URL, hence related to this issue). I'll make it into a separate issue, then.