tatuylonen / wikitextprocessor

Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. For data extraction, bulk syntax checking, error detection, and offline formatting.
Other
90 stars 23 forks source link
mediawiki scribuntu wikipedia wikitext wiktionary

wikitextprocessor

This is a Python package for processing WikiMedia dump files for Wiktionary, Wikipedia, etc., for data extraction, error checking, offline conversion into HTML or other formats, and other uses. Key features include:

This module is primarily intended as a building block for other packages that process Wikitionary or Wikipedia data, particularly for data extraction. You will need to write code to use this.

For pre-existing extraction modules that use this package, please see:

Getting started

Installing

Install from source:

git clone --recurse-submodules --shallow-submodules https://github.com/tatuylonen/wikitextprocessor.git
cd wikitextprocessor
python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install -e .

Running tests

This package includes tests written using the unittest framework. The test dependencies can be installed with command python -m pip install -e .[dev].

To run the tests, use the following command in the top-level directory:

make test

To run a specific test, use the following syntax:

python -m unittest tests.test_[module].[Module]Tests.test_[name]

Python's unittest framework help and options can be accessed through:

python -m unittest -h

Obtaining WikiMedia dump files

This package is primarily intended for processing Wiktionary and Wikipedia dump files (though you can also use it for processing individual pages or other files that are in wikitext format). To download WikiMedia dump files, go to the dump download page. We recommend using the <name>-<date>-pages-articles.xml.bz2 files.

API documentation

Usage example:

from functools import partial
from typing import Any

from wikitextprocessor import Wtp, WikiNode, NodeKind, Page
from wikitextprocessor.dumpparser import process_dump

def page_handler(wtp: Wtp, page: Page) -> Any:
    wtp.start_page(page.title)
    # process parse tree
    tree = wtp.parse(page.body)
    # or get expanded plain text
    text = wtp.expand(page.body)

wtp = Wtp(
    db_path="en_20230801.db", lang_code="en", project="wiktionary"
)

# extract dump file then save pages to SQLite file
process_dump(
    wtp,
    "enwiktionary-20230801-pages-articles.xml.bz2",
    {0, 10, 110, 828},  # namespace id, can be found at the start of dump file
)

for _ in map(
    partial(page_handler, wtp), wtp.get_all_pages([0])
):
    pass

The basic operation is as follows:

Most of the functionality is hidden behind the Wtp object. WikiNode objects are used for representing the parse tree that is returned by the Wtp.parse() function. NodeKind is an enumeration type used to encode the type of a WikiNode.

class Wtp

def __init__(
    self,
    db_path: Optional[Union[str, Path]] = None,
    lang_code="en",
    template_override_funcs: Dict[str, Callable[[Sequence[str]], str]] = {},
    project: str = "wiktionary",
):

The initializer can usually be called without arguments, but recognizes the following arguments:

def read_by_title(
    self, title: str, namespace_id: Optional[int] = None
) -> Optional[str]:

Reads the contents of the page with the specified title from the cache file. There is usually no need to call this function explicitly, as Wtp.process() and Wtp.reprocess() normally load the page automatically. This function does not automatically call Wtp.start_page().

Arguments are:

This returns the page contents as a string, or None if the page does not exist.

def parse(
    self,
    text: str,
    pre_expand=False,
    expand_all=False,
    additional_expand=None,
    do_not_pre_expand=None,
    template_fn=None,
    post_template_fn=None,
) -> WikiNode:

Parses wikitext into a parse tree (WikiNode), optionally expanding some or all the templates and Lua macros in the wikitext (using the definitions for the templates and macros in the cache files, as added by Wtp.process() or calls to Wtp.add_page().

The Wtp.start_page() function must be called before this function to set the page title (which may be used by templates, Lua macros, and error messages). The Wtp.process() and Wtp.reprocess() functions will call it automatically.

This accepts the following arguments:

This returns the parse tree. See below for a documentation of the WikiNode class used for representing the parse tree.

def node_to_wikitext(self, node)

Converts a part of a parse tree back to wikitext.

def expand(self, text, template_fn=None, post_template_fn=None,
           pre_expand=False, templates_to_expand=None,
           expand_parserfns=True, expand_invoke=True)

Expands the selected templates, parser functions and Lua macros in the given Wikitext. This can selectively expand some or all templates. This can also capture the arguments and/or the expansion of any template as well as substitute custom expansions instead of the default expansions.

The Wtp.start_page() function must be called before this function to set the page title (which may be used by templates and Lua macros). The Wtp.process() and Wtp.reprocess() will call it automatically. The page title is also used in error messages.

The arguments are as follows:

def start_page(self, title)

This function should be called before starting the processing of a new page or file. This saves the page title (which is frequently accessed by templates, parser functions, and Lua macros). The page title is also used in error messages.

The Wtp.process() and Wtp.reprocess() functions will automatically call this before calling the page handler for each page. This needs to be called manually when processing wikitext obtained from other sources.

The arguments are as follows:

def start_section(self, title)

Sets the title of the current section on the page. This is automatically reset to None by Wtp.start_page(). The section title is only used in error, warning, and debug messages.

The arguments are:

def start_subsection(self, title)

Sets the title of the current subsection of the current section on the page. This is automatically reset to None by Wtp.start_page() and Wtp.start_section(). The subsection title is only used in error, warning, and debug messages.

The arguments are:

def add_page(self, title: str, namespace_id: int, body: Optional[str] = None,
             redirect_to: Optional[str] = None, need_pre_expand: bool = False,
             model: str = "wikitext") -> None:

This function is used to add pages, templates, and modules for processing. There is usually no need to use this if Wtp.process() is used; however, this can be used to add templates and pages for testing or other special processing needs.

The arguments are:

The Wtp.analyze_templates() function needs to be called after calling Wtp.add_page() before pages can be expanded or parsed (it should preferably only be called once after adding all pages and templates).

def analyze_templates(self)

Analyzes the template definitions in the cache file and determines which of them should be pre-expanded before parsing because they affect the document structure significantly. Some templates in, e.g., Wiktionary expand to table start tags, table end tags, or list items, and parsing results are generally much better if they are expanded before parsing. The actual expansion only happens if pre_expand or some other argument to Wtp.expand() or Wtp.parse() tells them to do so.

The analysis is heuristic and is not guaranteed to find every such template. In particular, it cannot detect templates that call Lua modules that output Wikitext control structures (there are several templates in Wiktionary that call Lua code that outputs list items, for example). Such templates may need to be identified manually and specified as additional templates to expand. Luckily, there seem to be relatively few such templates, at least in Wiktionary.

This function is automatically called by Wtp.process() at the end of phase 1. An explicit call is only necessary if Wtp.add_page() has been used by the application.

Error handling

Various functions in this module, including Wtp.parse() and Wtp.expand() may generate errors and warnings. Those will be displayed on stdout as well as collected in Wtp.errors, Wtp.warnings, and Wtp.debugs. These fields will contain lists of dictionaries, where each dictionary describes an error/warning/debug message. The dictionary can have the following keys (not all of them are always present):

The fields containing the error messages will be cleared by every call to Wtp.start_page() (including the implicit calls during Wtp.process() and Wtp.reprocess()). Thus, the page_handler function often returns these lists together with any information extracted from the page, and they can be collected together from the values returned by the iterators returned by these functions. The Wtp.to_return() function maybe useful for this.

The following functions can be used for reporting errors. These can also be called by application code from within the page_handler function as well as template_fn and post_template_fn functions to report errors, warnings, and debug messages in a uniform way.

def error(self, msg, trace=None)

Reports an error message. The error will be added to Wtp.errors list and printed to stdout. The arguments are:

def warning(self, msg, trace=None)

Reports a warning message. The warning will be added to Wtp.warnings list and printed to stdout. The arguments are the same as for Wtp.error().

def debug(self, msg, trace=None)

Reports a debug message. The message will be added to Wtp.debugs list and printed to stdout. The arguments are the same as for Wtp.error().

def to_return(self)

Produces a dictionary containing the error, warning, and debug messages from Wtp. This would typically be called at the end of a page_handler function and the value returned along with whatever data was extracted from that page. The error lists are reset by Wtp.start_page() (including the implicit calls from Wtp.process() and Wtp.reprocess()), so they should be saved (e.g., by this call) for each page. (Given the parallelism in the processing of the pages, they cannot just be accumulated in the subprocesses.)

The returned dictionary contains the following keys:

class WikiNode

The WikiNode class represents a parse tree node and is returned by Wtp.parse(). This object can be printed or converted to a string and will display a human-readable format that is suitable for debugging purposes (at least for small parse trees).

The WikiNode objects have the following fields:

class NodeKind(enum.Enum)

The NodeKind type is an enumerated value for parse tree (WikiNode) node types. Currently the following values are used (typically these need to be prefixed by Nodekind., e.g., NodeKind.LEVEL2):

Expected performance

This can generally process a few Wiktionary pages per second per processor core, including expansion of all templates, Lua macros, parsing the full page, and analyzing the parse. On a multi-core machine, this can generally process a few dozen to a few hundred pages per second, depending on the speed and the number of the cores.

Most of the processing effort goes to expanding Lua macros. You can elect not to expand Lua macros, but they are used extensively in Wiktionary and for important information. Expanding templates and Lua macros allows much more robust and complete data extraction, but does not come cheap.

Contributing and bug reports

Please create an issue on github to report bugs or to contribute!