tatuylonen / wiktextract

Wiktionary dump file parser and multilingual data extractor
Other
741 stars 82 forks source link
dictionary extractor lua multilingual parser scribunto templates wikitext wiktionary wiktionary-parser

Wiktextract

This is a utility and Python package for extracting data from Wiktionary.

2024-04-24: Kaikki.org raw download files with newline-separated json object data will be changed at some point in the future to use the suffix .jsonl for clarity. This will break download links, so please be aware. For more about .jsonl, please see https://jsonlines.org/

2024-06-24: The above change has now been committed, and if the kaikki.org html generation process succeeds we should see changes soon.

Please report issues on github and we'll try to address them reasonably soon.

The current extracted version is available for browsing and download at: https://kaikki.org/dictionary/. I plan to maintain an automatically updating version of the data at this location. For most people the preferred way to get the extracted Wiktionary data will be to just take it from the web site.

Note: extracting all data for all languages from the English Wiktionary may take from an hour to several days, depending on your computer. Expanding Lua modules is not cheap, but it enables superior extraction quality and maintainability! You may want to look at the pre-expanded downloads instead of running it yourself.

Overview

This is a Python package and tool for extracting information from English Wiktionary (enwiktionary) data dumps. Note that the English Wiktionary contains extensive dictionaries and inflectional information for many languages, not just English. Only its glosses and internal tagging are in English.

One thing that distinguishes this tool from any system I'm aware of is that this tool expands templates and Lua macros in Wiktionary. That enables much more accurate rendering and extraction of glosses, word senses, inflected forms, and pronunciations. It also makes the system much easier to maintain. All this results in much higher extraction quality and accuracy.

This tool extracts glosses, parts-of-speech, declension/conjugation information when available, translations for all languages when available, pronunciations (including audio file links), qualifiers including usage notes, word forms, links between words including hypernyms, hyponyms, holonyms, meronyms, related words, derived terms, compounds, alternative forms, etc. Links to Wikipedia pages, Wikidata identifiers, and other such data are also extracted when available. For many classes of words, a word sense is annotated with specific information such as what word it is a form of, what is the RGB value of the color it represents, what is the numeric value of a number, what SI unit it represents, etc.

This tool extracts information for all languages that have data in the English wiktionary. It also extracts translingual data and information about characters (anything that has an entry in Wiktionary).

This tool reads the enwiktionary-<date>-pages-articles.xml.bz2 dump file and outputs JSON-format dictionaries containing most of the information in Wiktionary. The dump files can be downloaded from https://dumps.wikimedia.org.

This utility will be useful for many natural language processing, semantic parsing, machine translation, and language generation applications both in research and industry.

The tool can be used to extract machine translation dictionaries, language understanding dictionaries, semantically annotated dictionaries, and morphological dictionaries with declension/conjugation information (where this information is available for the target language). Dozens of languages have extensive vocabulary in enwiktionary, and several thousand languages have partial coverage.

The wiktwords script makes extracting the information for use by other tools trivial without writing a single line of code. It extracts the information specified by command options for languages specified on the command line, and writes the extracted data to a file or standard output in JSON format for processing by other tools.

While there are currently no active plans to support parsing non-English wiktionaries, I'm considering it. Now that this builds on wikitextprocessor and expands templates and Lua macros, it would be fairly straightforward to build support for other languages too - and even make the extraction configurable so that only a configuration file would need to be created for processing a Wiktionary in a new language.

As far as we know, this is the most comprehensive tool available for extracting information from Wiktionary as of December 2020.

If you find this tool and/or the pre-extracted data helpful, please give this a star on github!

Pre-extracted data

For most people, it may be easiest to just download pre-expanded data. Please see https://kaikki.org/dictionary/rawdata.html. The raw wiktextract data, extracted category tree, extracted templates and modules, as well as a bulk download of audio files for pronunciations in both .ogg and .mp3 formats are available.

There is a also download link at the bottom of every page and a button to view the JSON produced for each page. You can download all data, data for a specific language, data for just a single word, or data for a list of related words (e.g., a particular part-of-speech or words relating to a particular topic or having a particular inflectional form). All downloads are in JSON Lines format (each line is a separate JSON object). The bigger downloads are also available in compressed form.

Some people have asked for the full data as a single JSON object (instead of the current one JSON object per line format). I've decided to keep it as a JSON object per line, because loading all the data into Python requires about 120 GB of memory. It is much easier to process the data line-by-line, especially if you are only interested in a part of the information. You can easily read the files using the following code:

import json

with open("filename.json", encoding="utf-8") as f:
    for line in f:
        data = json.loads(line)
        ... parse the data in this record

If you want to collect all the data into a list, you can read the file into a list with:

import json

lst = []
with open("filename.json", encoding="utf-8") as f:
    for line in f:
        data = json.loads(line)
        lst.append(data)

You can also easily pretty-print the data into a more human-readable form using:

print(json.dumps(data, indent=2, sort_keys=True, ensure_ascii=False))

Here is a pretty-printed example of an extracted word entry for the word thrill as an English verb (only one part-of-speech is shown here):

{
  "categories": [
    "Emotions"
  ],
  "derived": [
    {
      "word": "enthrill"
    }
  ],
  "forms": [
    {
      "form": "thrills",
      "tags": [
        "present",
        "simple",
        "singular",
        "third-person"
      ]
    },
    {
      "form": "thrilling",
      "tags": [
        "present"
      ]
    },
    {
      "form": "thrilled",
      "tags": [
        "participle",
        "past",
        "simple"
      ]
    }
  ],
  "head_templates": [
    {
      "args": {},
      "expansion": "thrill (third-person singular simple present thrills, present participle thrilling, simple past and past participle thrilled)",
      "name": "en-verb"
    }
  ],
  "lang": "English",
  "lang_code": "en",
  "pos": "verb",
  "senses": [
    {
      "glosses": [
        "To suddenly excite someone, or to give someone great pleasure; to electrify; to experience such a sensation."
      ],
      "tags": [
        "ergative",
        "figuratively"
      ]
    },
    {
      "glosses": [
        "To (cause something to) tremble or quiver."
      ],
      "tags": [
        "ergative"
      ]
    },
    {
      "glosses": [
        "To perforate by a pointed instrument; to bore; to transfix; to drill."
      ],
      "tags": [
        "obsolete"
      ]
    },
    {
      "glosses": [
        "To hurl; to throw; to cast."
      ],
      "tags": [
        "obsolete"
      ]
    }
  ],
  "sounds": [
    {
      "ipa": "/\u03b8\u0279\u026al/"
    },
    {
      "ipa": "[\u03b8\u027e\u032a\u030a\u026a\u026b]",
      "tags": [
        "UK",
        "US"
      ]
    },
    {
      "ipa": "[\u03b8\u027e\u032a\u030a\u026al]",
      "tags": [
        "Ireland"
      ]
    },
    {
      "ipa": "[t\u032a\u027e\u032a\u030a\u026al]",
      "tags": [
        "Ireland"
      ]
    },
    {
      "rhymes": "-\u026al"
    },
    {
      "audio": "en-us-thrill.ogg",
      "mp3_url": "https://upload.wikimedia.org/wikipedia/commons/transcoded/d/db/En-us-thrill.ogg/En-us-thrill.ogg.mp3",
      "ogg_url": "https://upload.wikimedia.org/wikipedia/commons/d/db/En-us-thrill.ogg",
      "tags": [
        "US"
      ],
      "text": "Audio (US)"
    }
  ],
  "translations": [
    {
      "code": "nl",
      "lang": "Dutch",
      "sense": "suddenly excite someone, or to give someone great pleasure; to electrify",
      "word": "opwinden"
    },
    {
      "code": "fi",
      "lang": "Finnish",
      "sense": "suddenly excite someone, or to give someone great pleasure; to electrify",
      "word": "syk\u00e4hdytt\u00e4\u00e4"
    },
    {
      "code": "fi",
      "lang": "Finnish",
      "sense": "suddenly excite someone, or to give someone great pleasure; to electrify",
      "word": "riemastuttaa"
    },
...
    {
      "code": "tr",
      "lang": "Turkish",
      "sense": "slight quivering of the heart that accompanies a cardiac murmur",
      "word": "\u00e7arp\u0131nt\u0131"
    }
  ],
  "wikipedia": [
    "thrill"
  ],
  "word": "thrill"
}

Getting started

Installing

Use container:

$ podman run -it --rm ghcr.io/tatuylonen/wiktextract --help

Install from source:

On Linux (example from Ubuntu 20.04), you may need to first install the build-essential and python3-dev packages with apt update && apt install build-essential python3-dev python3-pip lbzip2.

git clone https://github.com/tatuylonen/wiktextract.git
cd wiktextract
python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install -e .

Use pip install command's --force-reinstall and -e option to reinstall the wikitextprocessor package from source in editable mode if you want to update both packages' code with git pull.

Running tests

This package includes tests written using the unittest framework. The test dependencies can be installed with the command python -m pip install -e .[dev].

To run the tests, use the following command in the top-level directory:

make test

(Unfortunately the test suite for wiktextract is not yet very comprehensive. The underlying lower-level toolkit, wikitextprocessor, has much more extensive test coverage.)

Expected performance

Extracting all data for all languages from English Wiktionary takes about 1.25 hours on a 128-core dual AMD EPYC 7702 system. The performance is expected to be approximately linear with the number of processor cores, provided you have enough memory (about 10GB/core or 5GB/hyperthread recommended).

You can control the number of parallel processes to use with the --num-processes option; the default is to use the number of available cores/hyperthreads.

You can download the full pre-extracted data from kaikki.org. The pre-extraction is updated regularly with the latest Wiktionary dump. Using the pre-extracted data may be the easiest option unless you have special needs or want to modify the code.

Using the command-line tool

The wiktwords script is the easiest way to extract data from Wiktionary. Just download the data dump file from dumps.wikimedia.org and run the script. The correct dump file the name enwiktionary-<date>-pages-articles.xml.bz2.

An example of a typical invocation for extracting all data would be:

wiktwords --all --all-languages --out data.json enwiktionary-20230801-pages-articles.xml.bz2

If you wish to modify the code or test processing individual pages, the following may also be useful:

  1. Pass a path to save database file that you can use for quickly processing individual pages:
wiktwords --db-path en_20230801.db enwiktionary-20230801-pages-articles.xml.bz2
  1. To process a single page and produce a human-readable output file for debugging:
wiktwords --db-path en_20230801.db --all --all-languages --out outfile --page page_title

The following command-line options can be used to control its operation:

Calling the library

While this package has been mostly intended to be used using the wiktwords command, it is also possible to call this as a library. Underneath, this uses the wikitextprocessor module. For more usage examples please read the wiktwords.py and wiktionary.py files.

This code can be called from an application as follows:

from wiktextract import (
    WiktextractContext,
    WiktionaryConfig,
    parse_wiktionary,
)
from wikitextprocessor import Wtp

config = WiktionaryConfig(
    dump_file_lang_code="en",
    capture_language_codes=["en", "mul"],
    capture_translations=True,
    capture_pronunciation=True,
    capture_linkages=True,
    capture_compounds=True,
    capture_redirects=True,
    capture_examples=True,
    capture_etymologies=True,
    capture_descendants=True,
    capture_inflections=True,
)
wxr = WiktextractContext(Wtp(), config)

RECOGNIZED_NAMESPACE_NAMES = [
    "Main",
    "Category",
    "Appendix",
    "Project",
    "Thesaurus",
    "Module",
    "Template",
    "Reconstruction"
]

namespace_ids = {
    wxr.wtp.NAMESPACE_DATA.get(name, {}).get("id")
    for name in RECOGNIZED_NAMESPACE_NAMES
}
with open("output.json", "w", encoding="utf-8") as f:
    parse_wiktionary(wxr, dump_path, None, False, namespace_ids, f)

The capture arguments default to True, so they only need to be set if some values are not to be captured (note that the wiktwords program sets them to False unless the --all or specific capture options are used).

parse_wiktionary()

def parse_wiktionary(
    wxr: WiktextractContext,
    dump_path: str,
    num_processes: Optional[int],
    phase1_only: bool,
    namespace_ids: Set[int],
    out_f: TextIO,
    human_readable: bool = False,
    override_folders: Optional[List[str]] = None,
    skip_extract_dump: bool = False,
    save_pages_path: Optional[str] = None,
) -> None:

The parse_wiktionary function will call word_cb(data) for words and redirects found in the Wiktionary dump. data is information about a single word and part-of-speech as a dictionary and may include several word senses. It may also be a redirect (indicated by the presence of a "redirect" key in the dictionary). It is in the same format as the JSON-formatted dictionaries returned by the wiktwords tool.

Its arguments are as follows:

This call gathers statistics in wxr.config. This function will automatically parallelize the extraction. page_cb will be called in the parent process, however.

parse_page()

def parse_page(
    wxr: WiktextractContext, page_title: str, page_text: str
) -> List[Dict[str, str]]:

PARTS_OF_SPEECH

This is a constant set of all part-of-speech values (pos key) that may occur in the extracted data. Note that the list is somewhat larger than what a conventional part-of-speech list would be.

class WiktextractContext(object)

The WiktextractContext object is used to hold the wikitextprocessor- specific Wtp context object and the wiktextract's WiktionaryConfig objects, and XXX in the future it will hold actual context that doesn't belong in Wtp and XXX WiktionaryConfig will be most probably integrated into the WiktextractContext object proper.

The constructor is called simply by supplying a Wtp and WiktionaryConfig object:

# Blanks slate for testing, usually
wxr = WiktextractContext(Wtp(), WiktionaryConfig())

or

# separately initialized config with a bunch of arguments like in the
# example in the -> class WiktionaryConfig(object)-section below
wxr = WiktextractContext(wtp, config)

if it is more convenient

class WiktionaryConfig(object)

The WiktionaryConfig object is used for specifying what data to collect from Wiktionary and is also used for collecting statistics during extraction. Currently, it is a field of the WiktextractContext context object.

The constructor:

def __init__(
    self,
    dump_file_lang_code="en",
    capture_language_codes=["en", "mul"],
    capture_translations=True,
    capture_pronunciation=True,
    capture_linkages=True,
    capture_compounds=True,
    capture_redirects=True,
    capture_examples=True,
    capture_etymologies=True,
    capture_inflections=True,
    capture_descendants=True,
    verbose=False,
    expand_tables=False,
):

The arguments are as follows:

Format of extracted redirects

Some pages in Wiktionary are redirects. For these, word_cb will be called with data in a special format. In this case, the dictionary will have a redirect key, which will contain the page title that the entry redirects to. The title key contains the word/term that contains the redirect. Redirect entries do not have pos or any of the other fields. Redirects also are not associated with any language, so all redirects are always returned regardless of the captured languages (if extracting redirects has been requested).

Format of the extracted word entries

Information returned for each word is a dictionary. The dictionary has the following keys (others may also be present or added later):

There may also be other fields.

Note that several of the field on the word entry level indicate information that has not been sense-disambiguated. Such information may apply to one or more of the senses. Currently only the most trivial cases are disambiguated; however, it is anticipated that more disambiguation may be performed in the future. It is also possible for the same key to be provided in a sense and in the word entry; in that case, the data in the sense has been sense-disambiguated and the data in the word entry has not (and may not be apply to any particular sense, regardless of whether the sense has some related sense-disambiguated information).

Word senses

Each word entry may have multiple glosses under the senses key. Each sense is a dictionary that may contain the following keys (among others, and more may be added in the future):

Pronunciation

Pronunciation information is stored under the sounds key. It is a list of dictionaries, each of which may contain the following keys, among others:

Note that Wiktionary audio files are available for bulk download at https://kaikki.org/dictionary/rawdata.html. Files in the download are named with the last component of the URL in ogg_url and/or mp3_url. Downloading them individually takes serveral days and puts unnecessary load on Wikimedia servers.

Translations

Translations are stored under the translations key in the word's data (if not sense-disambiguated) or in the word sense (if sense-disambiguated). They are stored in a list of dictionaries, where each dictionary has the following keys (and possibly others):

Etymologies

Etymological information is stored under the etymology_text and etymology_templates keys in the word's data. When multiple parts-of-speech are listed under the same etymology, the same data is copied to each part-of-speech entry under that etymology.

The etymology_text field contains the contents of the whole etymology section cleaned into human-readable text (i.e., templates have been expanded and HTML tags removed, among other things).

The etymology_templates field contains a list of templates from the etymology section. Some common templates considered not relevant for etymological information have been removed (e.g., redlink category and isValidPageName). The list also includes nested templates referenced from templates directly used in the etymology description. Each template in the list is a dictionary with the following keys:

Descendants

If a word has a "Descendants" section, the descendants key will appear in the word's data. It contains a list of objects corresponding to each line in the section, where each object has the following keys:

descendants data will also appear for the special case of "Derived terms" and "Extensions" sections for words that are roots in reconstructed languages, as these sections have the same format.

Linkages to other words

Linkages (synonyms, antonyms, hypernyms, derived words, holonyms, meronyms, derived, related, coordinate_terms) are stored in the word's data if not sense-disambiguated, and in the word sense if sense-disambiguated. They are lists of dictionaries, where each dictionary can contain the following keys, among others:

Category tree data format

The --categories-file option extracts the Wiktionary category tree as JSON into the specified file. The data is extracted from the Wiktionary Lua modules by evaluating them.

The data written to the JSON file is a dictionary, with the top-level keys roots and nodes.

Roots is a list of top-level nodes that are not children of other nodes. Fundamental is the normal top-level node; other roots may reflect errors in the hierarchy in Wiktionary. While not a root, the category all topics contains the subhierarchy of topical categories (e.g., food and drink, nature, sciences, etc.).

Nodes is a dictionary mapping lowercased category name to a dictionary containing data about the category. For each category, the following fields may be present:

The categories are returned as they are in the original Wiktionary category data. Language-specific categories are generally not included. However, there is a category {{{langcat}}} that appears to contain a lot of the categories that have language-specific variants. Also, the category tree data does not contain language prefixes (the tree is defined in Wiktionary without prefixes and then replicated for each language).

Related packages

The wikitextprocessor is a generic module for extracting data from Wiktionary, Wikipedia, and other WikiMedia dump files. wiktextract is built using this module.

When using a version of wiktextract from github, please also setup wikitextprocessor so that they have rough parity. The pypi versions of these packages are usually out-of-date, and mixing a newer version with an older one will lead to bugs. These packages are being developed in parallel.

The wiktfinnish package can be used to interpret Finnish noun declinations and verb conjugations and for generating Finnish inflected word forms.

Publications

If you use Wiktextract or the extracted data in academic work, please cite the following article:

Tatu Ylonen: Wiktextract: Wiktionary as Machine-Readable Structured data, Proceedings of the 13th Conference on Language Resources and Evaluation (LREC), pp. 1317-1325, Marseille, 20-25 June 2022.

Linking to https://kaikki.org or the relevant sub-pages would also be greatly appreciated.

Related tools

A few other tools also exist for parsing Wiktionaries. These include Dbnary, Wikiparse, and DKPro JWKTL.

Contributing and reporting bugs

Please report bugs and other issues on github. I also welcome suggestions for improvement.

Please email to ylo at clausal.com if you wish to contribute or have patches or suggestions.

License

Copyright (c) 2018-2020 Tatu Ylonen. This package is free for both commercial and non-commercial use. It is licensed under the MIT license. See the file LICENSE for details. (Certain files have different open source licenses)