openzim / mwoffliner

Mediawiki scraper: all your wiki articles in one highly compressed ZIM file
https://www.npmjs.com/package/mwoffliner
GNU General Public License v3.0
275 stars 72 forks source link

Consider using Wikimedia Enterprise dumps #1777

Open kelson42 opened 1 year ago

kelson42 commented 1 year ago

Wikimedia Enterprise proposes HTML full dumps based on Parsoid output.

Using them might allow us to save a significant amount of requests to the Wikimedia API and therefore speed-up and get a robuster scraping.

Few problems/questions already known:

A prototype shoudl be made once the backend mgmt has been done (milestone 1.13.0.

Jaifroid commented 1 year ago

Haven't they always offered full dumps? Is it the fact that it's based on Parsoid output that is new?

Transforming from desktop to mobile and including mobile CSS isn't too hard (actually easier than going the other way: KJSWL does these transforms and I'd be happy to share). There are still bugs in the mobile output that we currently scrape, such as misplaced headers and the crude way info boxes are forcibly shifted down a paragraph, so basing the scrape off the desktop output and transforming if necessary might mean we have more control over these things.

Krinkle commented 1 year ago

To my knowledge, the regular dumps at https://dumps.wikimedia.org/ are limited to unparsed wikitext in terms of content. There are numerous XML and SQL snapshots with varying levels of depth in terms of structured metadata, no content, current revision content, or full history, etc, but in all of them the content is in stored source form (i.e. unparsed wikitext).

Long ago (from 2006-2008) there was an experiment to generate HTML dumps, e.g. a small example can be found at https://dumps.wikimedia.org/other/static_html_dumps/September_2007/fy/. The format of these dumps was full page views as served by MediaWiki (e.g. including Monobook skin). It worked by invoking a CLI maintenance script that would invoke the ViewAction for each page and then store the output buffer for each article. I'm not entirely sure which implementation it used, but my guess would be the DumpHTML extension (source) which was finally archived in 2021 (T280185).

Screenshot of fywiki-2007/a/m/s/Amsterdam.html
Jaifroid commented 1 year ago

To my knowledge, the regular dumps at https://dumps.wikimedia.org/ are limited to unparsed wikitext in terms of content.

Ah yes, of course, I'd forgotten that. If they are now going to generate HTML dumps, then it could be very helpful to Kiwix, because it would be a more confined temporal snapshot of the encyclopaedia, and would clearly overcome problems caused by timeouts or network conditions (though images would still need to be scraped).

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

Hshaikh-wmf commented 11 months ago

Wondering if there is any movement on this issue.