openzim / mwoffliner

Mediawiki scraper: all your wiki articles in one highly compressed ZIM file
https://www.npmjs.com/package/mwoffliner
GNU General Public License v3.0
295 stars 74 forks source link

What the role of `res/inserted_style.css` #1918

Open kelson42 opened 1 year ago

kelson42 commented 1 year ago

Why we do need such a static file, not updated since ages in our code base?

Jaifroid commented 1 year ago

In currently published Wikipedia (or at least the 2023-10 themed versions), there is -/inserted_style.css. And it certainly used to be the case that inserted_style_mobile.css injected mobile styles needed to emulate the then Wikipedia mobile style, but it's not in current versions.

Of late there has been a real proliferation of css (and js) in Wikipedia ZIMs, some of them added by MWOffliner, some in original:

image

VadimKovalenkoSNF commented 1 year ago

This static file has been added in this commit. It is responsible for many CSS transformations such as aligning text content and images, paddings, list items, etc. Removing it will affect the output of any kind of Desktop renderer (WikimediaDesktop, VisualEditor, MediawikiRESTApi).

kelson42 commented 1 year ago

@VadimKovalenkoSNF Why loading/scraping dependences is not good enough? It should!?

VadimKovalenkoSNF commented 1 year ago

Current dependencies miss CSS parts that are present in the inserted_style.css. There is a comment in there with this link to request CSS tailored for the Minerva mobile skin. Shall I investigate more and check whether its possible to get exact same output in Downloader.getModuleDependencies? And what about other files in the res folder such as content.parsoid.css and mobile_main_page.css?

VadimKovalenkoSNF commented 1 year ago

Upd: I compared styles in the inserted_style.css and from the link that has been put in a comment there - I found that they have differences. I assume that inserted_style.css is outdated, probably we don't need it at all and can adjust CSS by custom styles in res/style.css

Jaifroid commented 1 year ago

It would be great to simplify / amalgamate. AFAIK, there are five different mobile styles offered by Wikipedia: Vector legacy (2010), Vector (2022), MinervaNeue, MonoBook and Timeless. I suppose, from the file -/mw/skins.minerva.base.reset|skins.minerva.content.styles|ext.cite.style|site.styles|mobile.app.pagestyles.android|mediawiki.page.gallery.styles|mediawiki.skinning.content.parsoid.css, that we're using an older Minerva mobile style.

If some of the static stylesheets could be amalgamated to a single one, it would begin to allow the possibility of easily switching between styles.

kelson42 commented 1 year ago

@VadimKovalenkoSNF Please remove then and do the necessary for not creating regressions.

VadimKovalenkoSNF commented 1 year ago

@kelson42 , I've noticed that inserted_style.css consists of thousands CSS rules. It is not possible to quickly separate which rules are needed for mwoffliner and which are not. And does it make sense, eventually? All articles loaded by the Desktop renderer will receive the necessary modules + apply styles from inserted_style.css. I mentioned, that removing inserted_style.css will affect Desktop output, but is this crucial since that CSS file was created long before Wikimedia Desktop (e.g. page/html)?

kelson42 commented 1 year ago

@VadimKovalenkoSNF if it affect the rendering, then I guess mwoffliner does no scrape all the css dependencies... rigth?

VadimKovalenkoSNF commented 1 year ago

@kelson42 It affects rendering because inserted_style.css is a legacy set of CSS rules that applies to the output regardless of other modules for the MediawikiDesktop or WikimediaRESTApi (Desktop) article.

Let's refer to this doc - https://www.mediawiki.org/wiki/API:Styling_content Regarding it, RESTbase content API handles modules automatically. Check example from the doc: https://en.wikipedia.org/api/rest_v1/page/html/Albert_Einstein

Open devtools, find load.php in the <link> tag. You will find a request to the ResourceLoader with the list with specific modules to this article, here it is.

But in the opposite, mwoffliner gets modules by separate request (we remove any links in abstract renderer). And for the article above this request will look like this.

I compared modules list from both links, and found that the they are different.

List of modules from RESTBase (for a given article):

  1. ext.phonos.styles
  2. ext.phonos.icons
  3. ext.cite.parsoid.styles
  4. ext.cite.styles
  5. ext.tmh.player.styles
  6. mediawiki.skinning.content.parsoid
  7. mediawiki.skinning.interface
  8. site.styles

List of modules from Action API (for a given article):

  1. ext.cite.ux-enhancements
  2. ext.phonos.init
  3. mediawiki.page.media
  4. ext.tmh.player
  5. ext.scribunto.logs
  6. ext.cite.styles
  7. ext.phonos.styles
  8. ext.phonos.icons
  9. ext.tmh.player.styles

There is an inconsistency between them. I assume, that we need to parse modules from link tag rather than parse them by using Action API - action API will retrieve modules for the regular desktop wikis (e.g en.wikipedia.org, not what we are getting by using page/html).

As for inserted_style.css - though we have some common styles for such tags as <body>, <h1>-<h6>, #content, @media there, I don't think that we need to preserve it at all if we want to achieve modules consistency between renders.

kelson42 commented 1 year ago

@VadimKovalenkoSNF please remove that file and do whatever is necessary to display properly articles. Dependemces should be scraped from upstream.

kelson42 commented 1 year ago

Module lists don't have to be consistent between end-points... this is why module retrieval has to be in renders.