tatuylonen / wikitextprocessor

Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. For data extraction, bulk syntax checking, error detection, and offline formatting.
Other
93 stars 23 forks source link

Proposal: move analyze template code to wiktextract extractor code #316

Closed xxyzz closed 4 days ago

xxyzz commented 5 days ago

Currently only en edition uses Wtp.analyze_templates() to find which template needs pre-expand, some non-en editions like zh and de edition override some heading templates need pre-expand and also need to change the page text to heading wikitext. But for nl edition, all sections are expanded from templates, override all of them would create a long override JSON file difficult to maintain also they have some if functions to create category links.

I'd like to suggest move Wtp._analyze_template() to wiktextract package's en edition folder, pass this function to dumpparser.process_dump() then pass it to Wtp.analyze_template(). I also want only return a bool type from _analyze_template() because I think we could have the same result by changing this line to expand all templates used in a pre-expanded template:

https://github.com/tatuylonen/wikitextprocessor/blob/59b8406ffb5149720701f2f8b2aae732f731ea39/src/wikitextprocessor/core.py#L1664-L1666

t = expand_recurse(
    encoded_body, new_parent, expand_all or template_page.need_pre_expand
)

I think the conditions to check whether a template needs pre-expand vary between editions and can't be shared without unintended result. For example, in nl edition, we only need to check the template name starts and ends with "=" or "-", but en code checks if it has lists or unclosed HTML tags.

kristian-clausal commented 5 days ago

Sounds good. It's unfortunate that we can't reuse the code and that the Wiktionary editions are so splintered, but we can't do much about that.