itkach / mw2slob

A tool to convert MediaWiki content to dictionaries in slob format
GNU General Public License v3.0
19 stars 4 forks source link
aard2 dictionary offline slob wikipedia

** Installation

~mw2slob~ requires Python 3.7 (or Python 3.6 + ~dataclasses~) or newer and depends on the following components:

** Usage

*** With WE HTML Dumps

+BEGIN_SRC sh

  # get site's metadata ("siteinfo")
  mw2slob siteinfo http://en.wiktionary.org > enwikt.si.json
  # compile dictionary
  mw2slob dump --siteinfo enwikt.si.json ./enwiktionary-NS0-20220120-ENTERPRISE-HTML.json.tar.gz -f wikt common
#+END_SRC

Note ~-f wikt common~ argument that specifies content filters to
use when compiling this dictionary. Content filter is a text file
containing list of CSS selectors (one per line). HTML elements matching
these selectors will be removed during compilation. `mw2slob`
includes several filters (see [[./mw2slob/filters]]) that work well
for most wikipedias and wiktionaries.

Wikimedia Enterprise HTML Dumps are available only for some
[[https://www.mediawiki.org/wiki/Manual:Namespace][namespaces]]. For most wikipedias the main namespace ~0~ - articles - is
typically the only one of interest to most users. Wiktionaries, on the
other hand, often make use of other such namespaces. For example,
in English Wiktionary many articles include links to articles from
~Wiktionary~ or ~Appendix~ namespaces, so it makes sense to
include their content into compiled dictionary and make these
links internal dictionary links rather than link to Wiktionary web
site.

These namespaces are not available as html dumps, but can be
obtained via Mediawiki API via ~mwscrape~. Let's say we want to
compile English Wiktionary and include the following namespaces in
addition to the main articles: ~Appendix~, ~Wiktionary~, ~Rhymes~,
~Reconstruction~ and ~Thesaurus~ (sampling random articles
indicates that these namespaces are often referenced).

First, we examine siteinfo (saved in ~enwikt.si.json~) and find
that ids for these namespaces are:

| Wiktionary     |   4 |
| Appendix       | 100 |
| Rhymes         | 106 |
| Thesaurus      | 110 |
| Reconstruction | 118 |

Then we download rendered articles for these namespaces with ~mwscrape~:

#+BEGIN_SRC sh
  mwscrape https://en.wiktionary.org --db enwikt-wiktionary --namespace 4
  mwscrape https://en.wiktionary.org --db enwikt-appendix --namespace 100
  mwscrape https://en.wiktionary.org --db enwikt-rhymes --namespace 106
  mwscrape https://en.wiktionary.org --db enwikt-thesaurus --namespace 110
  mwscrape https://en.wiktionary.org --db enwikt-reconstruction --namespace 118
#+END_SRC

Each takes some time, but these are relatively small and don't take
too long.

Finally, compile the dictionary:

#+BEGIN_SRC sh
  mw2slob dump --siteinfo enwikt.si.json \
          ./enwiktionary-NS0-20220420-ENTERPRISE-HTML.json.tar.gz \
          http://localhost:5984/enwikt-wiktionary \
          http://localhost:5984/enwikt-appendix \
          http://localhost:5984/enwikt-rhymes \
          http://localhost:5984/enwikt-thesaurus \
          http://localhost:5984/enwikt-reconstruction \
          -f wikt common --local-namespace 4 100 106 110 118
#+END_SRC

Note that `mw2slob dump` takes CouchDB URLs of the databases we
created with ~mwscrape~ in addition to the dump file name.

Also note the `--local-namespace` parameter. This tells the
compiler to make the links for these namespaces internal
dictionary links, just like cross-article links, otherwise they
would be converted to web links.

See ~mw2slob dump --help~ for complete list of options.

*** With ~mwscrape~ database

Assuming CouchDB server runs at localhost on port ~5984~ and has mwscrape database created with ~mwscrape simple.wikipedia.org~ and named /simple-wikipedia-org/, to create a slob using /common/ and /wiki/ content filters:

+BEGIN_SRC sh

mw2slob scrape http://127.0.0.1:5984/simple-wikipedia-org -f common wiki

+END_SRC

See ~mw2slob scrape --help~ for complete list of options