Project Gutenberg ZIMs are barely accessible for clients that do not support JavaScript in the ZIM

kelson42 commented 5 years ago

From @Jaifroid on November 19, 2018 21:13

I am not sure if mwoffliner is used to produce Project Gutenberg ZIMs, or some other scraping system, so feel free to move this issue if it is not related to mwoffliner.

Recent Project Gutenberg ZIMs now come with a proprietary interface that requires the ability to execute JavaScript in the ZIM to access any of the texts in a meaningful way. Although texts are still accessible by title in the ZIM index, not enough information is provided to recognize a text by title unless it is very famous ("Don Quijote" is OK...). Books should be listed in the ZIM index by author surname. An entry should look something like:

Cervantes Saavedra, Miguel de - Don Quijote

Currently all we have is Don Quijote. If the text is Novelas y cuentos, there is no way to tell who it's by unless I open it. This is the case at least for gutenberg_es_all_2018-10.zim.

Authors are listed in the index of this ZIM, but alphabetically by first name, which is not very useful. To find "Unamuno" I have to know his first name was "Miguel". However, there is no corresponding author page for Miguel de Unamuno in the ZIM, and the client tries to open a "page" that has to be rendered dynamically in JavaScript, which of course fails in a client that cannot run JavaScript in the ZIM.

So, is it possible to have a more meaningful and usable ZIM index for these files? Ideally, we should also have noscript versions of author pages rather than relying on dynamic construction of them.

It would be a shame to lock out users on low-end devices. Currently, no Kiwix JS version running in an extension (Chrome or Firefox), for example, or Kiwix JS UWP, can run JS in the ZIM. We have support for JS in the ZIM only for clients that can run from a localhost or other server (not from the file protocol) in Kiwix JS in Service Worker mode, so it is currently very restricted. And it looks very difficult to support JS in the ZIM with mainstream file:// protocol access in Kiwix JS. JavaScript that constructs dynamic pages would need to be patched somehow to hook into the extraction engine, and most (all mainstream) browsers do not support XMLHttpRequests when running from the file:// protocol.

Copied from original issue: openzim/mwoffliner#445

kelson42 commented 5 years ago

@Jaifroid What does "ability to execute JavaScript in the ZIM" exactly means? You talk about the "ZIM index" but what is that exactly (the welcome page listing the books, the URL index, the title index)? I have to admit that I do not really understand your ticket. Can you please open one ticket per problem? It looks your ticket talks about two problems (on is around JS and an other one is about the URLs of the article)?

Jaifroid commented 5 years ago

@kelson42 Sorry if it's poorly expressed, but I think there is one issue and some speculative solutions to the issue:

The issue: on some clients it is difficult to access the content of new Gutenberg ZIMs. The reason is that such clients cannot run the JavaScript that is embedded in the ZIM, i.e. the JS that provides the proprietary User Interface on the landing page and elsewhere;
One proposed solution (maybe it's not technically feasible) is to provide more information in the directory entries;
Another solution might be to provide noscript sections in the author pages - they should contain static versions of the links to titles by a given author, and not rely on JavaScript only to access those titles.

The proposed solutions are just speculation about how the problem might be worked around, but are not part of the core issue.

kelson42 commented 5 years ago

@Jaifroid @mossroy I still do not understand how that javascript is different for example from the one in the Wikipedia ZIM files. Can you technically explain it?

Jaifroid commented 5 years ago

I'll leave technical explanations to @mossroy.

Non-technically, in a nutshell, we do not run the JS in Wikimedia articles. But it doesn't matter, as the contents are perfectly accessible without doing that (the only JS in the articles opens and closes headings).

However, in the Gutenberg ZIMs, important pages (author pages) construct their content dynamically. It makes the ZIM inaccessible if we can't do the same, for the very simple reason that the author's surname is not in the title of each book page, so there's no way to search for it. See my answer in https://github.com/openzim/mwoffliner/issues/449 for more details about the difficulty of running JS in the ZIM.

mossroy commented 5 years ago

@kelson42 I just posted some explanations on javascript support in https://github.com/openzim/mwoffliner/issues/449#issuecomment-442471173

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

Jaifroid commented 1 year ago

I guess this issue could be closed in favour of #145. It's not really the same issue, but I suppose we're now committed to dynamic User Interfaces for ZIM archives with no static fallback. While I think it would be good to have a basic, static UI for accessing ZIM content, I guess that's not realistic now. So I recommend closing as won't fix / not planned and focusing on #145 instead.

rgaudin commented 1 year ago

I share this conclusion. I'd prefer more scrapers to work without JS but it's hardly realistic. Some of them are just dependent on JS and others, like gutenberg are built around JS to bring in valuable features like author/title search. Having a static fallback would mean extra work which can't be justified without supporting data (that we don't have).

openzim / gutenberg

Project Gutenberg ZIMs are barely accessible for clients that do not support JavaScript in the ZIM #75