tatuylonen / wikitextprocessor

Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. For data extraction, bulk syntax checking, error detection, and offline formatting.
Other
90 stars 23 forks source link

assert error at src/parse.py ln 2287 #282

Closed kylefoley76 closed 2 months ago

kylefoley76 commented 2 months ago

I'm using python 3.11. I've run the unit tests and they worked with output Ran 925 tests in 16.480s OK

I'm now using the API as posted in the README section the first few lines of which is:

from functools import partial
from typing import Any

from wikitextprocessor import Wtp, WikiNode, NodeKind, Page
from wikitextprocessor.dumpparser import process_dump

def page_handler(wtp: Wtp, page: Page) -> Any:
    # process parse tree
    tree = wtp.parse(page.body)
    # or get expanded plain text
    text = wtp.expand(page.body)

wtp = Wtp(
    db_path="en_20230801.db", lang_code="en", project="wiktionary"
)

When my code executes the following function, the assertion throws an error:

def parse_encoded(ctx: "Wtp", text: str) -> WikiNode:
    """Parses the text, which should already have been encoded using magic
    characters (see Wtp._encode()).  Parses the encoded string and returns
    the parse tree."""
    assert ctx.title is not None  # ctx.start_page() must have been called
    node = WikiNode(NodeKind.ROOT, 0)
    node.largs = [[ctx.title]]
    ctx.beginning_of_line = True
    ctx.wsp_beginning_of_line = False
    ctx.linenum = 1
    ctx.pre_parse = False

I tried to track down the source of the error and the following appears to be the culprit: at line 827 of core.py we find line:

text = TEMPLATES_RE.sub(repl_templ, text)

The first text my code comes across of which the first 100 characters is:

{{also|Dictionary}}
==English==
{{was wotd|2022|December|12}}
===Etymology===
{{root|en|ine-pro|*de

and the TEMPLATES_RE code is as follows:

TEMPLATES = (
    r"\{" + MAGIC_NOWIKI_CHAR + r"?\{((?:"
    r"[^{}]{?|"  # lone possible { and also default "any"
    r"}(?=[^{}])|"  # lone `}`, (?=...) is not consumed (lookahead)
    r"-{}-|"  # GitHub issue #59 Chinese wiktionary special `-{}-`
    r"}{|"  # latex argument: "<math>\frac{1}{2}</math>"
    r")+?)\}" + MAGIC_NOWIKI_CHAR + r"?\}"
)

TEMPLATES_RE = re.compile(TEMPLATES)

The above code has me stumped, I haven't worked with code that complicated before. In any case, my output is:

􂁙
==English==
􂁚
===Etymology===
􂁛
From 􂁜, 􂁝 from 􂁞, from 􂁟, from 􂁠, from 􂁡, perfect past participle

The inhibits the code from finding a title and leads to the assertion error.

􂁙
==English==
􂁚
===Etymology===
􂁛
From 􂁜, 􂁝 from 􂁞, from 􂁟, from 􂁠, from 􂁡, perfect past participle of 􂁢 + 􂁣. 􂁤.
===Pronunciation===
* 􂁥 􂁦
* 􂁧
* 􂁨 􂁩, 􂁪
* 􂁫
* 􂁬
* 􂁭
===Noun===
􂁮
􂁁
# A 􂁂 with a list of 􂁃s from one or more languages, normally ordered 􂁄ly, explaining each word's 􂁅 (􂁯), and sometimes also containing information on its 􂁆, 􂁇, 􂁰, 􂁈, and 􂁉, as well as other data.
#: 􂁱
#: 􂁲
#: 􂁳
#: 􂁴
#* 􂁵
# 􂊻 A 􂁊 dictionary of a standardised language held to only contain words that are properly part of the language.
#* 􂁷
#* 􂁸
# 􂁹 Any work that has a 􂁋 of 􂁌 organized alphabetically; e.g., 􂁍al dictionary, 􂁎 dictionary.
# 􂁺 An 􂁏, a data structure where each value is referenced by a particular key, analogous to words and definitions in a dictionary (sense 1).
#: 􂁻
#* 􂁼
====Alternative forms====
* 􂁽 􂁾
* 􂁿 􂂀
====Hyponyms====
􂂁
====Derived terms====
􂂂
====Related terms====
* 􂂃
====Translations====
􂊽
====See also====
* 􂊫
* 􂊬
* 􂊭
* 􂊮
* 􂊯
===Verb===
􂊰
# 􂊱 To 􂁖 in a dictionary.
# 􂊱 To 􂁗 to a dictionary.
#* 􂊲
#* 􂊳
# 􂊴 To 􂁘 a dictionary.
#* 􂊵
===Further reading===
* 􂊶
* 􂊷
* 􂊸
===Anagrams===
* 􂊹
xxyzz commented 2 months ago

I guess you mean the assert here? https://github.com/tatuylonen/wikitextprocessor/blob/edd475d7850ed85021454d577920b0ba9914c5fc/src/wikitextprocessor/parser.py#L2285

as the comment says Wtp.start_page() must be called, I'll update the usage document.

The example code is updated: https://github.com/tatuylonen/wikitextprocessor/commit/60ae2cb1cb28afa2170572111ceec9495266538d

kristian-clausal commented 2 months ago

This seems to be answered, so I'll close this for now.