openzim / mwoffliner

Mediawiki scraper: all your wiki articles in one highly compressed ZIM file
https://www.npmjs.com/package/mwoffliner
GNU General Public License v3.0
289 stars 73 forks source link

Further simplify fr.wiktionary content by removing duplicates #1226

Open Popolechien opened 4 years ago

Popolechien commented 4 years ago

With the objective of creating an app that has a somewhat manageable size, there are a lot of duplicates within the current http://library.kiwix.org/wiktionary_fr_app_nopic zim that can be removed, e.g. all variations around past tense or plural that bring little to no value. Luckily all such variations have pretty much the same structure, with the article starting with Form de _X, Y or Z_(the underlying template/wikicode being {{S|type|language code|flexion}} as a level 3 section title).

Capture d’écran 2020-07-21 à 14 36 48

Wiktionarians seem to have made it a rule to not give a definition of such words, so I guess they won't exactly be missed if we remove them.

Alternatively, all root words (of interest) start with the Étymologie Level 3 section title, which should make it easier to parse.

Capture d’écran 2020-07-21 à 14 39 32

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.