openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
45 stars 4 forks source link

"Results" sections of developer.mozilla.org (MDN) are not showing up #327

Open benoit74 opened 5 months ago

benoit74 commented 5 months ago

See https://github.com/webrecorder/wombat/issues/156

This is not a problem with Zimit2, problem happens also in Zimit1 and on replayweb.page

benoit74 commented 4 months ago

Despite been solved upstream, this is still not working within warc2zim: https://github.com/webrecorder/wombat/issues/156#issuecomment-2188656905

ikreymer commented 1 week ago

The wombat issue is fixed, the issue is with warc2zim adding an unneeded %-encoding, converting from:

<iframe src="https://live.mdnplay.dev/en-US/docs/Web/HTML/Element/section/runner.html?id=before ... >

to:

<iframe src="../../../../../../live.mdnplay.dev/en-US/docs/Web/HTML/Element/section/runner.html%3Fid%3Dbefore" ... >

The code on the page checks for the id query param, and is unable to find it.

benoit74 commented 1 week ago

Unfortunately, the %-encoding is not unneeded, it is indeed over-encoded, but this mostly mandatory, or at least it is the best tradeoff found so far (see https://github.com/openzim/warc2zim/issues/206 for lengthy discussions on this).

I agree this is what causes the problem here, and this issue is then indeed only a warc2zim issue.

Thank you for the analysis!

I don't know yet how we are supposed to handle this kind of situation ... but this is Kiwix team problem at least ^^

I remember that we also faced the same kind of problem somewhere else, but do not recall where.

Not that common anyway, since query parameters are more meant to be interpreted by the web server than by the client-side JS.

benoit74 commented 1 week ago

One idea: add a regex of URLs for which we want to ignore the query parameter.

To be tested, but I think that in most cases (at least here in MDN it is the case), the server simply ignore the query parameter which is only used client-side.

We can hence:

benoit74 commented 1 week ago

Here is a test ZIM demoing what I proposed above: tests_eng_mdn-page_2024-11.zim.zip (remove .zip extension added to please GitHub)

Note that the trick to rewrite the URL without escaping the ? had to be done in wombatSetup.js because the URL is built dynamically JS-side ... which adds a bit of a problem ^^

At least the ZIM works well in kiwix-serve, mostly OK in kiwix-apple (there is a very different problem, see https://github.com/kiwix/kiwix-apple/issues/1027), mostly OK in kiwix-android (there is a very different problem, see https://github.com/kiwix/kiwix-android/issues/4084), OK on Kiwix PWA (on MacOS Firefox and Android Firefox) and Kiwix JS (on MacOS Firefox).

That being said, it is going to be pretty ugly to integrate these changes inside the codebase in a generic manner (especially since we need to pass this information to JS at runtime). All thoughts are welcomed ^^ (and in the mean time, I will create again a WARC of https://farm.openzim.org/recipes/developer.mozilla.org_en and create the ZIM manually with my hacks if it is OK for you, at least to play in dev).