Closed xxyzz closed 1 month ago
There is another problem of parsing section node in template arg if it's directly after the arg's =
:
from page: https://fr.wiktionary.org/wiki/Conjugaison:italien/ammorbidire
{{Onglets conjugaison
|onglet1=Forme active
|contenu1=== Conjugaison de ammorbidire avec l’auxiliaire avere ==
{{it-avere-3isc|ammorbid|am.mɔr|bi|d}}
}}
but it currently doesn't affect fr extractor code thus has low priority.
How is it possible that people keep on coming up with new ways to break our parser... Obviously it should be corrected on our side, and if your solution doesn't break tests it's good to go imo.
I find the not in ("span",)
part was added from https://github.com/tatuylonen/wikitextprocessor/commit/fd53f00085322de6498eb840762aae95acbae3e5
and the commit message says it was for fixing an error in page https://en.wiktionary.org/wiki/seis#Scots, I could only guess at that time there was a dangling "span" tag mistakenly added a section node as its child but this kind of error should be fixed in wikitext.
I'll create a pr.
Update: I tested the code on en page "seis", there was no error and extracted data seem fine. I think this change should be safe.
Template arguments might be different in wikitext from HTML nodes. We should use the sandbox to check if a span is terminated by a section... And indeed it does:
<span class="test">
==English==
test span</span>
Results in:
<p><span class="test">
</span></p>
<div class="mw-heading mw-heading2 ext-discussiontools-init-section"><h2 id="English" data-mw-thread-id="h-English"><span data-mw-comment-start="" id="h-English"></span>English<span data-mw-comment-end="h-English"></span></h2></div>
<p>test span
</p>
The span element is cut and terminated, orphaning the </span>
.
Err... for extra caution I'll add another if
branch and leave the old code alone.
But this could happen to any tag, it's still "broken"(or unintended) wikitext and should be fixed on Wiktionary.
This only fix part of the problem...
def test_section_in_template(self):
# https://fr.wiktionary.org/wiki/Conjugaison:français/bayer
# GH issue #310
self.ctx.start_page("")
root = self.ctx.parse("""{{Onglets conjugaison
| contenu1 = 1
| contenu2 = 2
=== section ===
text
| contenu3 = 3
}}""")
self.assertEqual(len(root.children), 1)
template = root.children[0]
second_arg_list = template.template_parameters.get("contenu2")
self.assertEqual(len(second_arg_list), 2)
heading_node = second_arg_list[1]
self.assertIsInstance(heading_node, WikiNode)
self.assertEqual(heading_node.kind, NodeKind.LEVEL3)
self.assertEqual(template.template_parameters.get("contenu3"), ["3"])
"| contenu3 = 3" is parsed as the text child of level 3 node.
There are some
TEMPLATE not properly closed
debug messages in fr edition: https://kaikki.org/frwiktionary/errors/subpage6/details-TEMPLATE-not-properly-closed.htmlHere is part of wikitext from page https://fr.wiktionary.org/wiki/Conjugaison:français/bayer
I think the code at here: https://github.com/tatuylonen/wikitextprocessor/blob/66545a619dac146e47ea81e9cc13872b890ad97e/src/wikitextprocessor/parser.py#L997-L1003
shouldn't remove nodes in parse stack if the node needs a close token. In other words, section heading node start token shouldn't break parsing other nodes if it's inside a node.
Maybe the code should be:
I have tested the code and all tests in both packages passed.