tatuylonen / wikitextprocessor

Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. For data extraction, bulk syntax checking, error detection, and offline formatting.
Other
93 stars 23 forks source link

`{{43e}}` not expanded #298

Closed LeMoussel closed 1 month ago

LeMoussel commented 1 month ago
from wikitextprocessor import Wtp
from wiktextract.wxr_context import WiktextractContext
from wiktextract.config import WiktionaryConfig

wiki_config=WiktionaryConfig()
wiki_config.dump_file_lang_code = 'fr'
wiki_config.capture_language_codes = ["fr", "mul"]
wxr = WiktextractContext(
        wtp=Wtp(
            db_path="fr-wiki-latest.db",
            lang_code="fr",
            project="wikipedia",
        ),
        config=wiki_config,
)

wxr.wtp.start_page("Test page")
ret = wxr.wtp.expand("{{43e}}")
print(ret)

Print [[:Modèle:43e]]

The result should be 43e

Ref: Modèle:43e

kristian-clausal commented 1 month ago

With the same code, I am getting

<abbr class="abbr" title="Quarante-troisième">43<sup>e</sup></abbr>

on an old dump from March. The [[:Namespace:templatename]] seems to indicate it can't find the template in question; is this a dump from the recentish time when Wikimedia dump files were corrupted/missing files?

EDIT: If you have used the --pages-dir parameter when creating the dump file, you should have easy access to check if there is a $pagesdir/Modèle/43e.txt file.

LeMoussel commented 1 month ago

OK. I will regenerate the files with a recent dump of the wikimedia database.