kiwix / kiwix-js

Fully portable & lightweight ZIM reader in Javascript
https://www.kiwix.org/
GNU General Public License v3.0
310 stars 135 forks source link

Legacy Wikipedia 0.8 ZIM from 2010 only works in jQuery mode #960

Open Jaifroid opened 1 year ago

Jaifroid commented 1 year ago

The ZIM in question is wikipedia_en_wp1-0.8_orig_2010-12.zim. While this is a legacy ZIM, it is available from download.kiwix.org (archive directory) as an (historical) corpus, and this ZIM is linked to from the online Wikipedia 0.8 home page, so maintaining the ability to read it would seem important, or, at the very least instructing the user on how to use and display this ZIM correctly in the reader.

Currently it only displays properly in jQuery mode on both Firefox and Chromium, and in fact it is only possible to navigate in the ZIM at all (other than by searching for an article) in jQuery mode. Screenshot below shows typical display of a page in SW mode on left (all CSS broken, all images broken), and jQuery mode on the right (all images and CSS display correctly). This is in the Firefox extension. The only "problem" in jQuery mode is that the active content warning is displayed (which should be fixed).

image

In SW mode, clicking any link in an article shows "404 Not Found".

Kiwix Desktop displays content from this ZIM correctly, and navigation functions fine.

Jaifroid commented 1 year ago

See https://github.com/openzim/mwoffliner/issues/1731 on why using a newer scrape of this ZIM is not possible for historical / corpus work (in sum, newer scrapes show the current, 2023, content for each article instead of the original content).

Jaifroid commented 1 year ago

The reason it is failing in SW mode is because all of the hyperlinks in this legacy ZIM are given as root-relative absolutes, i.e. in the form /A/Some_page.hml, or /I/Some_image.jpg. In jQuery mode, our regular expression for matching ZIM links ignores any forward slash at the start of a ZIM link that it recognizes, and this works across the board. However, in ServiceWorker mode, we are scrupulous in respecting the coding in the ZIM, and so these links are interpreted as-is. This of course puts the resources outside the scope of the Service Worker, and they are not caught and processed by the Service Worker.

We need a safe way to recognize this situation and offer the user a possibility for reading such ZIMs. I can think of two ways:

  1. Show a banner suggesting that the user try to read the ZIM in jQuery mode;
  2. Offer to read the ZIM in SW mode by removing or ignoring the initial forward slash.

I assume, as we still have jQuery mode, and this works well with such ZIMs, that 1. would be the most acceptable solution for now. But 2. might be necessary in a pure Service-Worker-mode future.

kelson42 commented 1 year ago

@Jaifroid This file is really old and does not realy respect the ZIM specifications anymore (because the absolute links). It should not be a problem if not supported. Where exactly have you found this ZIM to download? It should not be part of the Kiwix catalogue!

Jaifroid commented 1 year ago

@kelson42 It is linked for historical record purposes (I presume) from https://en.wikipedia.org/wiki/Wikipedia:Version_0.8/downloads. I think a historian would praise your foresight in keeping these early archives, going back to Wikipedia 0.5 from 2007. I would think very carefully before removing access to them (and I hope they're backed up!).

Jaifroid commented 1 year ago

Note that the current scrape of Wikiepedia 0.8 is not working as expected (I think), as I reported in https://github.com/openzim/mwoffliner/issues/1731. It is providing 2023 versions of the pages instead of the original content. That makes it all the more important that the original archives are kept (and made available) IMHO.