Closed kristian-clausal closed 7 months ago
Should add the Scribunto code link and some tests, the description of the problem is confusing...
And please also add the Lua error message and error page so we could know what error this commit was trying to fix.
A-Fa-f is not a typo, it's for hexadecimal numerals!
This is not scribunto HTML entities, these are literal HTML-entities, so the 'X' is allowed!
Revert the changes.
The path of the error is that HTML entities for asterisks (&#xsomething
) were being generated in the module code, but our decoder was not able to decode any HTML-entity numerals because it was written wrong. This lead to parts of the module code to split the string it got on the #
and only use the &
instead of *protoword
, which broke it.
Looking at the mw.text.decode implementation's code for Scribunto, you are correct that it doesn't accept capital 'X'. The module code has some patterns that have [xX]?
in this location, but that's wrong on their part.
I only looked the Scribunto code and didn't realize it's for hex string, I guess "a-z" also works so the Scribunto developers keep the wrong pattern.
Our pattern is just more discerning, theirs is a Lua pattern which has to be simpler.
&(lt|gt|amp|quot|nbsp|#[xX]?[0-9A-Fa-f]+);
Is allowed python regex, and in this case it is used only when the named entities are restricted, while theirs is
'(&(#?x?)([a-zA-Z0-9]+);)'
Because Lua patterns don't have pipe-separated alternatives (IIRC????), so they have to do this. It's slightly less correct, because it would accept &#xQWERTY
.
I'll create a pr to restore your previous pattern. And could you explain how this pr fixes the error in page galar?
"The path of the error is that HTML entities for asterisks (&#xsomething) were being generated in the module code, but our decoder was not able to decode any HTML-entity numerals because it was written wrong. This lead to parts of the module code to split the string it got on the # and only use the & instead of *protoword, which broke it."
The etymology modules turn *
into HTML entities on purpose, then that HTML entity failed to decode because it was encoded as a numeral, and the string was later split in the '#' leaving a string with only '&' instead of a string like '*protoword'. This is what happened. The decoding failing meant that a string was broken, which caused errors.
Sorry I missed that comment. IIUC, encode and decode of the *
character in the "inh" template argument happens somewhere in Wiktionary 's Lua module code but not relevant to our encode and decode arguments Python code? I think I made some changes to the encode and decode template arguments sometime ago.
Our implementation of decode was broken because it would not decode numerals when it should (=always).
The argument given to decode() is about whether you only allow "nbsp" and the other listed named entities to be decoded, or also other named entities from a larger list. For example, if decodeNamedEntities
is set to true, then is when decode() will decode ±
; but otherwise, it will not, so when decodeNamedEntities
is not set or false it wouldn't.
But the original implementation would not decode numeral entities when decodeNamedEntities was not set because of misunderstanding the documentation. Numerals should always be decoded.
I mean it's not our code that convert the argument like here: https://github.com/tatuylonen/wikitextprocessor/blob/5f6ddbb846f305add1052e855e2b8fd380fde2f7/src/wikitextprocessor/luaexec.py#L478, at first I though it might be code like this causes the error but I guess this is irrelevant.
Yeah, the issue was that Wiktionary-side modules converted stuff (including '*') into HTML-entities; the '*' is weird, but it's done in a function called 'risky' something or other, so maybe there's a reason.
While on my lunch-break, I realized why the pattern in Scribunto, (&(#?x?)([a-zA-Z0-9]+);)
doesn't break when it has 'X' instead of 'x'... The [a-zA-Z] also includes 'X'. The x?
doesn't do anything in the pattern, :smile: .
Our implementation of mw.text.decode was broken due to a misreading of https://www.mediawiki.org/wiki/Extension:Scribunto/Lua_reference_manual#mw.text.decode
Here it is said that
Tatu read this as "if decodeNamedEntities is omitted or false, decode only these values", but it turns out that
decodeNamedEntities
is only required to decode name entities that aren't in this list; all number values are still decoded. I have no idea why this is like this, and I had the same misunderstanding when first looking at the documentation, but the Scribunto source code was clearer.This was causing a ton of weird Lua errors in etymology templates, because there are bits of code that rely on the presence of a *asterisk, used in historical linguistics to denote a form of a word that is only reconstructed and not attested.
So when these entities were encoded into HTML entities (for some reason) and then the decode failed because our implementation was wonky, it broke the etymology template and it took me the hole day of hunt-and-peck print debugging to finally find where the fault was.