openzim / mwoffliner

Mediawiki scraper: all your wiki articles in one highly compressed ZIM file
https://www.npmjs.com/package/mwoffliner
GNU General Public License v3.0
283 stars 72 forks source link

Incomplete module list fetched #1391

Open MananJethwani opened 3 years ago

MananJethwani commented 3 years ago

right now the method we use to fetch module list is using api.php making a link something like this -> ${this.apiUrl.href}action=parse&format=json&prop=${encodeURI('modules|jsconfigvars|headhtml')}&page=<articleId>

the list of modules is stored at 2 places one in a list named modules and another in head -> script -> RLPAGEMODULES ( RLPAGEMODULES is used by startup.js to get the module list to be loaded for that page in the online version)

but if we consider a page like https://en.wikipedia.org/wiki/Google and look at the modules in its HTML source code we will see 26 modules being shown Screenshot from 2021-01-30 17-29-38

but according to api.php link -> https://en.wikipedia.org/w/api.php?action=parse&format=json&prop=modules%7Cjsconfigvars%7Cheadhtml&page=google we will find only 2 modules mentioned

Screenshot from 2021-01-30 17-38-36

kelson42 commented 3 years ago

@krinkle Our offline version of the resourceLoader seems to failt (at least) because of this problem. Do you know an API call to get the complete list of the resourceLoaders modules (like in RLPAGEMODULES)... or should we parse the HTML directly to get that list?

Krinkle commented 3 years ago

@kelson42

I think your current method is what I'd recommend as well, and should provide for the proper rendering of content. I'd advise against parsing HTML for this as it would provide unneeded and incompatible styles, and as unstable/unsupported method, it might break without notice or viable alternative.

If things seem to work okay today, then there is probably nothing you need to change. Are there pages with styles that you know to be missing, or pages with incomplete functionality?

Having said that, I'm happy to explain the differences. They fall into three categories:

1) Skin for page

Some modules are added by the skin, not controlled by the wikitext parser or content handler. These come through in the API with the useskin parameter. This are for things like the sidebar, search, and logged-in functionality like watchstar interaction, Echo notifications, etc. These only work with the exact HTML of the Vector skin, as retreived from URLs like https://en.wikipedia.org/wiki/The_Example, and rendered without modification.

2) Skin for user on page

The skin can adds a couple more modules depending on the specific user. For example, gadgets you have enabled. And interactive functionality limited by particular user right, such as an quick way to mark edits as patrolled on a diff, without reloading the page.

When I view the API query with useskin=vector in my main browser, I get:

["ext.cite.ux-enhancements", "ext.scribunto.logs", "site", "mediawiki.page.ready", "mediawiki.toc", "skins.vector.legacy.js", "mediawiki.page.watch.ajax"]

When viewing the same URL in a private browsing window, I get a similar array, but without "mediawiki.page.watch.ajax" since this one is limited to logged-in users.

3) OutputPage extensions

By far most of the modulees in the screenshot by @MananJethwani are actually not from the skin, but from site-wide extensions. Such as CentralNotice banners, reader surveys, and event/performance instrumentations. These, too, are not expected to work offline, or without a skin.

If you do find a needed module or other styles to be missing on certain pages, it is likely a mistake in the server code and would be easy to fix within a day or two if reported to Phabricator! 🙂

kelson42 commented 3 years ago

@Krinkle Thank you for extended explanation. The overall situation is that we have at least a dozen of bugs related to a buggy resourceLoader offline emulation. This is not a new situation and for sure 2/3 developers have lost hairs on this problem in the past.

Currently @MananJethwani tries to make progress on this and I have recommended to him to split this big problem in smaller ones. He has written this ticket as we basically miss a reliable method to know which modules to load. I don't talk here about sking/gadget things which are non-mandatory to me. I talk about js/css resources which are important to get a proper rendering of the content.

This ticket has been open after trying to have a proper support of Mathjax, see https://github.com/openzim/mwoffliner/issues/1371 with proofwiki.org offline version. These pages need module ext.mathJax but this module is not listed in by the API call https://proofwiki.org/w/api.php?action=parse&format=json&useskin=vector&prop=modules|jsconfigvars|headhtml&page=Definition%3AThomae%20Function. IMO it should be listed, this has nothing to do with the skin, but even by using useskin=vector it is not listed. IMO it should also perfectly work offline (at least I don't see why it should not).

Any idea? Should I open a Phabricator ticket?

Krinkle commented 3 years ago

@kelson42 Yes please, that seems like a genuine bug indeed. Sorry about that 🙂

MananJethwani commented 3 years ago

@Krinkle according to https://www.mediawiki.org/wiki/ResourceLoader/Developing_with_ResourceLoader#Base_environment the base environment requires just 2 modules(mediawiki.base and jquery) and startup module but while trying to run site.js it says mw.util is not defined which seems to be defined at mediawiki.utils and seeing site.js is loaded on each page shouldn't it be included as a base module? I couldn't find it listed in module list even when we try to fetch with useskin=vector.

I also found some other modules in which code for mw is distributed and they aren't listed in module list for those pages like for page on wiki in Wiktionary, the module list is this but the ext.gadget.defaultVisiblityToggles module it uses requires base modules like 'jquery.cookie', 'mediawiki.storage', 'mediawiki.cookie' and it also require module 'ext.gadget.VisibilityToggles' for toggle button functionality which are not mentioned in module list given above. so my question is should they be present on resource loader website or be available in api.php or I am going in the wrong direction?

also I noticed that some pages which are using older version of mediawiki or different skin like on AOPS they are still using mediawiki and jquery as base modules instead of mediawiki.base as you can see In startup.js.

so can I know after exactly what version these changes were introduced or is there an automatic way for us to know what are the base modules to be loaded in these pages so that we can get the correct idea of what to scrape so the site works fine.

also let me know if I need to open a phabricator ticket regarding these issues.

Krinkle commented 3 years ago

@MananJethwani

When developers write JavaScript modules for MediaWiki, the "base modules" are the modules that they cannot (and need not) declare dependencies on, because they are the base environment. This is not about what consumers of wiki content need to do, and is not something you have to worry about. The internal names of these modules have indeed changed over the years, but it is not their name that matters. What matters is the real functionality they provide as part of the default and automatic base environment, which has not changed. There is nothing you need to do to keep this in sync, as it is all handled by startup.

mw.util is indeed not part of the base environment, and any code using mw.util must declare a dependency on it so that ResourceLoader will load it at the right time. For extensions, this happens in the extensions/MyExension/extension.json file when they register their module. For gadgets, this happens on the MediaWiki:Gadgets-definition page (docs). For user scripts and site scripts, there is a centrally defined module in MediaWiki (user, and site) and these do not have native dependencies. Instead, site scripts can use mw.loader.using(…).then(…) to load dependencies automatically when they need it.

If the ru.wikipedia.org community has a site script that fails due to undefined mw.util, it is most likely that their script file has forgotten to call mw.loader.using(). See en.wikipedia.org Common.js for an example.

I also found some other modules in which code for mw is distributed and they aren't listed in module list for those pages like for page on wiki in Wiktionary, the module list is this but the ext.gadget.defaultVisiblityToggles module it uses requires base modules like 'jquery.cookie', 'mediawiki.storage', 'mediawiki.cookie' and it also require module 'ext.gadget.VisibilityToggles' for toggle button functionality which are not mentioned in module list given above. so my question is should they be present on resource loader website or be available in api.php or I am going in the wrong direction?

(Clarification: jquery.cookie, mediawiki.storage, mediawiki.cookieare "dependencies", not "base modules".)

Gadgets are loaded in a very late output stage when serving the skinned response to a viewer on the canonical website. They are not associated with individual pages or with the wiki content. So, when downloading or parsing content via the API, they are not part of the metadata for that content.

However, the API query you use here has opted-in to "skinned" parsing, via the headhtml and useskin options. Which means you are accepting the responsibility that some modules may be coded specifically to the Vector skin and its exact current HTML layout around the content (e.g. header, sidebar, footer, etc.). In that special "skinned" mode, these additional stylesheets and modules should indeed also be included.

The reason that they are not included today is because the Gadgets extension (and a few others) are using an old hook BeforePageDisplay that is very specific to live HTML web responses and should not be run in the API context. Bug T161278 exists to improve this, either by changing the Gadgets extension to use a better hook if we have one today, or by adding one to MW Core for the subset of actions (such as queueing modules) that are safe for all consumers of the content (incl. API).

Note this issue is specific to Gadgets. If you encounter other missing modules, those likely have other causes or reasons. It is not a single issue, it just happens to cause similar effects for you. Sorry about that!

MananJethwani commented 3 years ago

@kelson42 then according to our discussion on slack I will start using startup.js to load the base modules as they are different in different versions of mediawiki

MananJethwani commented 3 years ago

also if possible could you please open upstream tickets for all the modules that are not listed in the module list and should have been these are - jquery.cookie, mediawiki.storage, mediawiki.cookie, mw.util, ext.gadget.defaultVisiblityToggles, ext.gadget.VisibilityToggles.

Krinkle commented 3 years ago

There is no problem with jquery.cookie, mediawiki.storage, mediawiki.cookie, mw.util as far as I can tell. These are indirect dependencies that will be taken care of automatically if you load something that needs them. They are never queued directly on a page, not for MW, not the API, and not for Kiwix. These will not appear in RLPAGEMODULES either.

There are many modules not related to any given page that the canonical website will load on some pages but that API consumers must not. These two lists of modules are not expected to be equal and not meant to be. There may be modules for things like guided tours, new-signup campaigns, A/B testing, eventlogging, interactive skin layout features for the sidebar or search suggestions, etc.

However, if a module related and essential to the page content is loaded on the canonical site and not listed in the API, that is a bug. I believe so far the only such case we've found is with gadgets. Specifically, that Wiktionary uses gadgets for essential functionality without a fallback. This will already be causing problems even for users of the Wiktionary website itself for people using older browsers or on slower connections. In my opinion that is a bad gadget. But, there is indeed no reason not to not allow external users of the content to try to load these gadgets as well, and they are indeed not intentionally excluded. Bug T161278 tracks this issue.

I will start using startup.js to load the base modules […] ext.gadget.defaultVisiblityToggles

I'm not sure I understand. This would only be for JavaScript, not CSS. I wasn't aware of the Kiwix viewer loading JavaScript modules as well. Especially for gadgetst that seems usually something you'd not want. I'd expect a number of those to be quite specific to the canonical site and domain, and either cause defects or not work when loaded into a different context. (Note these are developed and written by users of that site, not part of the software distributed via WMF). E.g. they may be for loading additional content from an API, or to help make edits and such. It sounds to me like if you go down this path, one might as well <iframe> the canonical online website in its entirety. I realize I know very little and please be bold in adjusting my understanding!

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

kelson42 commented 1 year ago

AFAIK, we are here fully impacted by https://phabricator.wikimedia.org/T161278

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.