tatuylonen / wikitextprocessor

Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. For data extraction, bulk syntax checking, error detection, and offline formatting.
Other
94 stars 23 forks source link

Section node in template parameter shouldn't break parsing template node #310

Closed xxyzz closed 1 month ago

xxyzz commented 1 month ago

There are some TEMPLATE not properly closed debug messages in fr edition: https://kaikki.org/frwiktionary/errors/subpage6/details-TEMPLATE-not-properly-closed.html

Here is part of wikitext from page https://fr.wiktionary.org/wiki/Conjugaison:français/bayer

{{Onglets conjugaison
| onglet1  =/ba.je/ avec « y » (je baye)
| contenu1 ={{fr-conj-1-yer-ye|ba|ba}}
| onglet2  =/ba.je/ avec « i » (je baie)
| contenu2 ={{fr-conj-1-yer-ie|ba|b|a|ɛ|ɛ}}
=== {{S|références}} ===
* Frères Bescherelle (sans spécifier la prononciation des finales en ''-aie'')
| onglet3  =/be.je/ avec « y » (je baye)
}}

I think the code at here: https://github.com/tatuylonen/wikitextprocessor/blob/66545a619dac146e47ea81e9cc13872b890ad97e/src/wikitextprocessor/parser.py#L997-L1003

shouldn't remove nodes in parse stack if the node needs a close token. In other words, section heading node start token shouldn't break parsing other nodes if it's inside a node.

Maybe the code should be:

    while any(x.kind in KIND_TO_LEVEL for x in ctx.parser_stack):
        node = ctx.parser_stack[-1]
        if KIND_TO_LEVEL.get(node.kind, 99) < level:
            break
        if node.kind in MUST_CLOSE_KINDS:
            break
        _parser_pop(ctx, True)

I have tested the code and all tests in both packages passed.

xxyzz commented 1 month ago

There is another problem of parsing section node in template arg if it's directly after the arg's =:

from page: https://fr.wiktionary.org/wiki/Conjugaison:italien/ammorbidire

{{Onglets conjugaison
|onglet1=Forme active
|contenu1=== Conjugaison de ammorbidire avec l’auxiliaire avere ==
{{it-avere-3isc|ammorbid|am.mɔr|bi|d}}
}}

but it currently doesn't affect fr extractor code thus has low priority.

kristian-clausal commented 1 month ago

How is it possible that people keep on coming up with new ways to break our parser... Obviously it should be corrected on our side, and if your solution doesn't break tests it's good to go imo.

xxyzz commented 1 month ago

I find the not in ("span",) part was added from https://github.com/tatuylonen/wikitextprocessor/commit/fd53f00085322de6498eb840762aae95acbae3e5

and the commit message says it was for fixing an error in page https://en.wiktionary.org/wiki/seis#Scots, I could only guess at that time there was a dangling "span" tag mistakenly added a section node as its child but this kind of error should be fixed in wikitext.

I'll create a pr.

Update: I tested the code on en page "seis", there was no error and extracted data seem fine. I think this change should be safe.

kristian-clausal commented 1 month ago

Template arguments might be different in wikitext from HTML nodes. We should use the sandbox to check if a span is terminated by a section... And indeed it does:

<span class="test">
==English==

test span</span>

Results in:

<p><span class="test">
</span></p>
<div class="mw-heading mw-heading2 ext-discussiontools-init-section"><h2 id="English" data-mw-thread-id="h-English"><span data-mw-comment-start="" id="h-English"></span>English<span data-mw-comment-end="h-English"></span></h2></div>
<p>test span
</p>

The span element is cut and terminated, orphaning the </span> .

xxyzz commented 1 month ago

Err... for extra caution I'll add another if branch and leave the old code alone.

But this could happen to any tag, it's still "broken"(or unintended) wikitext and should be fixed on Wiktionary.

xxyzz commented 1 month ago

This only fix part of the problem...

    def test_section_in_template(self):
        # https://fr.wiktionary.org/wiki/Conjugaison:français/bayer
        # GH issue #310
        self.ctx.start_page("")
        root = self.ctx.parse("""{{Onglets conjugaison
| contenu1 = 1
| contenu2 = 2
=== section ===
text
| contenu3 = 3
}}""")
        self.assertEqual(len(root.children), 1)
        template = root.children[0]
        second_arg_list = template.template_parameters.get("contenu2")
        self.assertEqual(len(second_arg_list), 2)
        heading_node = second_arg_list[1]
        self.assertIsInstance(heading_node, WikiNode)
        self.assertEqual(heading_node.kind, NodeKind.LEVEL3)
        self.assertEqual(template.template_parameters.get("contenu3"), ["3"])

"| contenu3 = 3" is parsed as the text child of level 3 node.