table html getting caught up in "form" listings for Azerbaijani suffixes

tatuylonen / wiktextract

Wiktionary dump file parser and multilingual data extractor

Other

822 stars 88 forks source link

table html getting caught up in "form" listings for Azerbaijani suffixes #146

Closed jmviz closed 1 year ago

jmviz commented 2 years ago

As of the latest raw data (2022-08-01), there are 120 entries that list forms each of whose form contains fragments of table html. These entries all seem to come from wiktionary pages that use the template az-suffix-forms (you can see its 107 transclusions here). Here is an example of the wiktextract json for the forms field of "-nən" (wiktextract, wiktionary):

"forms": [{"form": "class=\"floatright\" cellpadding=\"5\" cellspacing=\"0\" style=\"background: #ffffff", "tags": ["canonical"]}, {"form": "border: 1px #aaa solid", "tags": ["canonical"]}, {"form": "border-collapse: collapse", "tags": ["canonical"]}, {"form": "margin-top: .5em", "tags": ["canonical"]}, {"form": "\" rules=\"all\" |- ! rowspan=2 style=\"\" | ! colspan=2 style=\"background:#DEDEDE", "tags": ["canonical"]}, {"form": "text-align:center\" | preceding vowel |- ! style=\"background:#DEDEDE", "tags": ["canonical"]}, {"form": "text-align:center\" | A", "tags": ["canonical"]}, {"form": "I", "tags": ["canonical"]}, {"form": "O", "tags": ["canonical"]}, {"form": "U ! style=\"background:#DEDEDE", "tags": ["canonical"]}, {"form": "text-align:center\" | E", "tags": ["canonical"]}, {"form": "Ə", "tags": ["canonical"]}, {"form": "İ", "tags": ["canonical"]}, {"form": "Ö", "tags": ["canonical"]}, {"form": "Ü |- ! style=\"background:#EFEFEF", "tags": ["canonical"]}, {"form": "text-align:center\" | postconsonantal | -nan | -nən |- ! style=\"background:#EFEFEF", "tags": ["canonical"]}, {"form": "text-align:center\" | postvocalic | -ynan | -ynən |}", "tags": ["canonical"]}]

kristian-clausal commented 2 years ago

Oh wow. At first I kind of hoped this could be fixed by moving the {{az-suffix-forms|a=nan|ə=nən|v=y}} template around, but no, it's the floating table causing this. This needs to be looked at by Tatu.

Hopefully this can be fixed on our side, I do not want to wade into trying to change the template on Wiktionary...

jmviz commented 2 years ago

Found some more information on this. I noticed a lot of DEBUG messages of the form:

DEBUG: heuristically added missing } to template arg ! at ['term', 'template']

Where template was one of three templates: az-suffix-forms, tl-conj-table, tl-conj. But tl-conj redirects to tl-conj-table, so really just two.

This debug message is emitted from wikitextprocessor repl_arg_err(), which is only called here.

Looking at the raw wikitext of these two templates, it seems that this might be getting triggered when a table is started with {{{!}} while inside an {{#if:. This happens once in tl-conj-table, and twice in az-suffix-forms (once when starting a table as the "then" parameter of the if, and once when starting a table as the "else" parameter). This matches up with the fact that one DEBUG message is emitted for each invocation of tl-conj-table, while two DEBUG messages are emitted for each invocation of az-suffix-forms.

Interestingly, I spot checked the wiktextract output for a few of the terms on which tl-conj-table is used, and there were no obvious errors like there are with az-suffix-forms. So maybe the error has to do with there being two occurrences of {{{!}} in az-suffix-forms?

kristian-clausal commented 2 years ago

We're pretty sure it's an issue with the html output: the table is actually floating on the right, and the code can't handle it properly. The heuristically added } is pretty common, so I really hope it's not an issue that we have to trawl through.

jmviz commented 2 years ago

Looked into this more. Previously I was looking at the debug log for a local partial extraction. For completion's sake, I went through the latest wiktextract-error-data.json from the kaikki.org raw data page and looked for every exact match of heuristically added missing } to template arg !. Here is each template that causes it (i.e. the last element of the debug object's path array) , together with the number of times each causes it:

request box: 17036
az-suffix-forms: 242
tl-conj: 280
ko-conj-adj: 4
ko-conj-verb: 2
tl-conj-table: 3

So there are only a few templates that ever cause this exact imputation. I looked at the source of each of these templates, and they all have a table being started with {{{!}} inside an {{#if:. So, I believe every time this imputation is being called, it's actually a false positive, and causes a valid wikitext table to be broken.

The az-suffix-forms template is generally in the PoS section, not in a child inflection section -- so the table it generates should be ignored by wiktextract, which calls clean_node on text in PoS sections, which removes tables. But because the table wikitext was broken, the regex to look for and remove tables in clean_value doesn't remove anything. So the broken table is read as normal text, and wiktextract adds a bunch of "canonical" forms of split-up html thinking it's reading a very long head line.

If you comment out the wikitextprocessor code that does this } imputation, the az-suffix-forms table now gets correctly parsed and ignored. If you then change a page so that this template is in an ====Inflection==== section, the table now gets correctly parsed and forms are added.

I also looked at dalhin which uses tl-conj-table in its Conjugation section. If you go to edit this template, and add a } to the end of the one {{{!}}, then preview this change on page dalhin, you can see that the Ability/involuntary (maka-/ma-) verb forms table gets broken. I checked dalhin in wiktextract's latest raw data, and all the forms that are listed in the broken table are missing from its forms in the json. But running wiktextract locally, with that one imputation in the wikitextprocessor code commented out, the table is correctly parsed and all those forms are not missing.

jmviz commented 2 years ago

I also checked to see when else that } imputation gets called:

  17567       "msg": "heuristically added missing } to template arg !",
  11536       "msg": "heuristically added missing } to template arg 3",
    896       "msg": "heuristically added missing } to template arg PAGENAME",
     48       "msg": "heuristically added missing } to template arg 2",
     22       "msg": "heuristically added missing } to template arg 6",
      8       "msg": "heuristically added missing } to template arg erg_pl_cu",
      8       "msg": "heuristically added missing } to template arg book",
      7       "msg": "heuristically added missing } to template arg 5",
      3       "msg": "heuristically added missing } to template arg document",
      2       "msg": "heuristically added missing } to template arg chapter",
      2       "msg": "heuristically added missing } to template arg 1",
      1       "msg": "heuristically added missing } to template arg pageref",
      1       "msg": "heuristically added missing } to template arg ll",

So it gets called elsewhere, and it may very well be acting correctly in the other cases, I haven't looked.

tatuylonen commented 2 years ago

Generally the "heuristically added missing } to template arg"... message means there is a syntax error in the template (or we couldn't parse it). It seems that Mediawiki heuristically accepts some errors (they are so common that I needed to implement some kludges to work reasonably with them).

However, in this case it is clearly a bug. I made some changes to handling of {{!}}, and the az-suffix-forms issue now looks fixed. (Basically, it no longer applies the heuristic to template arguments if there is a ! inside the braces.)

However, the other heuristically added } issues are not the same. Looking at is-decl-noun-n-s (for {{{3}} case), it actually seems to contain an error and I think the heuristic was correct there. Likewise, kl-suffix (for {{{PAGENAME}}) seems to have a bug in the template. The bugs in the templates should be fixed by editing Wiktionary.

I currently have a website regeneration running without this change (the first attempt failed after merging the non-English edition changes from @xxyzz), but it should get included in the next run after that. Hopefully by Monday we'll see these included on https://kaikki.org.

tatuylonen commented 2 years ago

I added the "fix on wiktionary" label because several of these issues need to be fixed on Wiktionary, as far as I know. Coyuld someone please review the list posted by @jmviz and look at which template the errors occur in, and fix the templates where applicable. The {{!}} cases were probably due to the bug I just (hopefully) fixed, but the others are probably something that needs to be fixed manually in Wiktionary. The diagnostic page https://kaikki.org/dictionary/errors/debug.html will be useful in doing this.

kristian-clausal commented 2 years ago

EDIT: Ignore this post, this is basically what jmviz has already written... Somehow I was able to miss that post.

I think I've figured it out. The issue is the heuristically added }.

In the Azerbaijani table in az-suffix-forms, we have this bit of template code (twice, because the table changes depending on the template parameters): {{#if:{{{1|}}}|{{{!}} class="floatright" cellpadding="5" cellspacing="0" style="background: #ffffff; border: 1px #aaa solid; border-collapse: collapse; margin-top: .5em;" rules="all"

The issue here is that {{{!}} looks naively like a broken template argument, when it isn't. The first { in {{{!}} is part of a larger block that is closed by a paired } later on.

To test this, I disabled the repl_arg_err heuristic in wikitextprocessor/core.py:493-499 by commenting out these lines, and when I ran wiktwords on -nən again, the garbage data was gone.

If you're testing this yourself, note that the table is not parsed because it is found inside the Suffix section of the table and is ignored. If you move the az-suffix-form template call into its own Conjugation (or whatever, a random section that is parsed for tables), the table will be parsed and you will get a bunch of debug messages relating to badly formatted headers and some table data.

This looks to be a pain to fix, if much of our code relies on this to fix issues with broken arguments. But technically our heuristics are in the wrong here.

kristian-clausal commented 1 year ago

I've worked on a fix for this, and I think I've got it to work. Not going to commit it until next week, I don't want to get "Wiktextract failed" e-mails on the weekend, and need to see how the issue with the escape % stuff in Lua worked out.

Quick summary: If the template breaks the section it is put in, we need to yank it from there before it does anything. This is currently done for all floating divs by just forcefully regexing them out of existence in clean_node or thereabouts, but that means we can't parse that template's data.

As a precursor to get this to work, I needed to create some parameters for Wtp.process(), the main function that starts the whole "process this whole wikitext project". It does the preprocessing, including pre-expanding some templates that are heuristically determined to change how things would parse downstream... And the table in az-suffix-forms is included in that. So what happens? After the pre-expansion happens, the data contains a nameless table that has the contents of the az-suffix-forms template... But not by the template's name, just HTML. It can't be rolled back after that point, it's baked in and impossible to extract by name; it's in wikt-cache and can't be overridden so that the template's name is kept (I think, tried that pretty early-on). Which is annoying, because to test if it worked I had to recreate the cache.

So the new parameters to Wtp.process() let us specify templates by names that should and should not be pre-expanded. These are passed down until we hit the part where Wtp.need_pre_expand (ctx.need_pre_expand) is generated, which is the set containing the list of whether a template should be pre-expanded after heuristics. We just remove anything in DO_NOT_PRE_EXPAND from that, and this way we can stop az-suffix-forms and other named templates from expanding.

@xxyzz I also moved some code specific for Chinese wiktionary (the langhd stuff) into wiktextract, so that wikitextprocessor is as agnostic as possible. We should probably do that to most or all "xy-wiktionary"-specific code in wikitextprocessor. Should probably create something like language_specific, except for different wiktionaries, but that seems like a lot of work.

Anyhow, now that az-suffix-forms is not being expanded, we can extract it from the parse-tree before it is parsed in parse_part_of_speech, add it to a temporary LEVEL5 node and throw it to parse_inflection()! It actually worked!!

I'm just posting this so that I remember to commit on Monday.