tatuylonen / wikitextprocessor

Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. For data extraction, bulk syntax checking, error detection, and offline formatting.
Other
93 stars 23 forks source link

Unescape "*" to "*" in `mw.uri.anchorEncode()` #276

Closed xxyzz closed 5 months ago

xxyzz commented 5 months ago

Still don't know how MediaWiki implements anchorEncode, I found this function could unescape "*" in Wiktionary's Lua debug console.

This change fixes the "The specified language Proto-Turkic is unattested, while the given word is not marked with '*' to indicate that it is reconstructed." Lua errors in "Reconstruction" pages

MediaWiki code: https://github.com/wikimedia/mediawiki-extensions-Scribunto/blob/755b549fe66628a2891e9a61a9abade238dd0e9b/includes/Engines/LuaCommon/UriLibrary.php#L29-L33 https://github.com/wikimedia/mediawiki/blob/6592072169f1c25d43723e0956c701855aa4c6ab/includes/parser/CoreParserFunctions.php#L1058-L1062

Lua error: https://kaikki.org/dictionary/All%20languages%20combined/errors/details-The-specified-language-Proto-Turkic-is-yS4Aqcfj.html#LUA-error-in--invoke--links-templates----l_term_t----notself-1---parent---Template-l-self----1---trk-pro---2------usgay-----

More details:

Lua code: https://en.wiktionary.org/wiki/Module:links#L-365--L-375

-- Find embedded links and ensure they link to the correct section.
local function process_embedded_links(text, data, plain)
    -- Process the non-linked text.
    text = data.lang:makeDisplayText(text, data.sc[1], true)  -- "*" is escaped to "*" at here

    -- If the text begins with * and another character, then act as if each link begins with *. However, don't do this if the * is contained within a link at the start. E.g. `|*[[[foo]]](https://en.wiktionary.org/wiki/foo)` would set all_reconstructed to true, while `|[[[*foo]]](https://en.wiktionary.org/wiki/*foo)` would not.
    local all_reconstructed = false
    if not plain then
        -- anchorEncode removes links etc.
        if anchorEncode(text):sub(1, 1) == "*" then   -- anchorEncode convert "*" to "*"
            all_reconstructed = true