tatuylonen / wikitextprocessor

Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. For data extraction, bulk syntax checking, error detection, and offline formatting.
Other
93 stars 23 forks source link

Fix: mistaken implementation of mw.text.decode #244

Closed kristian-clausal closed 7 months ago

kristian-clausal commented 7 months ago

Our implementation of mw.text.decode was broken due to a misreading of https://www.mediawiki.org/wiki/Extension:Scribunto/Lua_reference_manual#mw.text.decode

Here it is said that

If boolean decodeNamedEntities is omitted or false, the only named entities recognized are &lt; (<), &gt; (>), &amp; (&), &quot; (") and &nbsp; (the non-breaking space, U+00A0). Otherwise, the list of HTML5 named entities to recognize is loaded from PHP's get_html_translation_table function.

Tatu read this as "if decodeNamedEntities is omitted or false, decode only these values", but it turns out that decodeNamedEntities is only required to decode name entities that aren't in this list; all number values are still decoded. I have no idea why this is like this, and I had the same misunderstanding when first looking at the documentation, but the Scribunto source code was clearer.

This was causing a ton of weird Lua errors in etymology templates, because there are bits of code that rely on the presence of a *asterisk, used in historical linguistics to denote a form of a word that is only reconstructed and not attested.

So when these entities were encoded into HTML entities (for some reason) and then the decode failed because our implementation was wonky, it broke the etymology template and it took me the hole day of hunt-and-peck print debugging to finally find where the fault was.

xxyzz commented 7 months ago

Should add the Scribunto code link and some tests, the description of the problem is confusing...

And please also add the Lua error message and error page so we could know what error this commit was trying to fix.

kristian-clausal commented 7 months ago

A-Fa-f is not a typo, it's for hexadecimal numerals!

This is not scribunto HTML entities, these are literal HTML-entities, so the 'X' is allowed!

Revert the changes.

kristian-clausal commented 7 months ago

The path of the error is that HTML entities for asterisks (&#xsomething) were being generated in the module code, but our decoder was not able to decode any HTML-entity numerals because it was written wrong. This lead to parts of the module code to split the string it got on the # and only use the & instead of *protoword, which broke it.

kristian-clausal commented 7 months ago

Looking at the mw.text.decode implementation's code for Scribunto, you are correct that it doesn't accept capital 'X'. The module code has some patterns that have [xX]? in this location, but that's wrong on their part.

xxyzz commented 7 months ago

I only looked the Scribunto code and didn't realize it's for hex string, I guess "a-z" also works so the Scribunto developers keep the wrong pattern.

kristian-clausal commented 7 months ago

Our pattern is just more discerning, theirs is a Lua pattern which has to be simpler.

&(lt|gt|amp|quot|nbsp|#[xX]?[0-9A-Fa-f]+);

Is allowed python regex, and in this case it is used only when the named entities are restricted, while theirs is

'(&(#?x?)([a-zA-Z0-9]+);)'

Because Lua patterns don't have pipe-separated alternatives (IIRC????), so they have to do this. It's slightly less correct, because it would accept &#xQWERTY.

xxyzz commented 7 months ago

I'll create a pr to restore your previous pattern. And could you explain how this pr fixes the error in page galar?

kristian-clausal commented 7 months ago

"The path of the error is that HTML entities for asterisks (&#xsomething) were being generated in the module code, but our decoder was not able to decode any HTML-entity numerals because it was written wrong. This lead to parts of the module code to split the string it got on the # and only use the & instead of *protoword, which broke it."

The etymology modules turn * into HTML entities on purpose, then that HTML entity failed to decode because it was encoded as a numeral, and the string was later split in the '#' leaving a string with only '&' instead of a string like '*protoword'. This is what happened. The decoding failing meant that a string was broken, which caused errors.

xxyzz commented 7 months ago

Sorry I missed that comment. IIUC, encode and decode of the * character in the "inh" template argument happens somewhere in Wiktionary 's Lua module code but not relevant to our encode and decode arguments Python code? I think I made some changes to the encode and decode template arguments sometime ago.

kristian-clausal commented 7 months ago

Our implementation of decode was broken because it would not decode numerals when it should (=always).

The argument given to decode() is about whether you only allow "nbsp" and the other listed named entities to be decoded, or also other named entities from a larger list. For example, if decodeNamedEntities is set to true, then is when decode() will decode &plusmn;; but otherwise, it will not, so when decodeNamedEntities is not set or false it wouldn't.

But the original implementation would not decode numeral entities when decodeNamedEntities was not set because of misunderstanding the documentation. Numerals should always be decoded.

xxyzz commented 7 months ago

I mean it's not our code that convert the argument like here: https://github.com/tatuylonen/wikitextprocessor/blob/5f6ddbb846f305add1052e855e2b8fd380fde2f7/src/wikitextprocessor/luaexec.py#L478, at first I though it might be code like this causes the error but I guess this is irrelevant.

kristian-clausal commented 7 months ago

Yeah, the issue was that Wiktionary-side modules converted stuff (including '*') into HTML-entities; the '*' is weird, but it's done in a function called 'risky' something or other, so maybe there's a reason.

kristian-clausal commented 7 months ago

While on my lunch-break, I realized why the pattern in Scribunto, (&(#?x?)([a-zA-Z0-9]+);) doesn't break when it has 'X' instead of 'x'... The [a-zA-Z] also includes 'X'. The x? doesn't do anything in the pattern, :smile: .