tatuylonen / wikitextprocessor

Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. For data extraction, bulk syntax checking, error detection, and offline formatting.
Other
94 stars 23 forks source link

Expansion goes into an infinite loop with a certain template #313

Closed kristian-clausal closed 1 month ago

kristian-clausal commented 1 month ago

On certain Simple English wiktionary pages I noticed the glosses had extract newlines after templates, but not on SE Wiktionary proper, so obviously our bug. Turned out this was an artifact that was generated only by specific templates, like Simple English wiktionary Template:ti verb, which looks like:

[[Category:Intransitive verbs]]</includeonly><noinclude>
{{documentation}}
<!-- Categories, interwikis and TemplateData goes to the /doc subpage -->
</noinclude>

At first I thought this was an issue with newlines or whitespace at the ends of templates that we weren't removing properly (all wikitext source files seem to have a terminal newline by default, and templates get whitespace stripped away at the end), but this doesn't seem to be the case.

I decided to make a test

import unittest
from unittest.mock import patch

from wikitextprocessor import Page, Wtp
from wikitextprocessor.parser import print_tree

from wiktextract.clean import clean_value
from wiktextract.config import WiktionaryConfig
from wiktextract.page import parse_page
from wiktextract.thesaurus import close_thesaurus_db
from wiktextract.wxr_context import WiktextractContext

class Temp(unittest.TestCase):
    def setUp(self) -> None:
        self.wxr = WiktextractContext(Wtp(lang_code="simple"), WiktionaryConfig())

    def tearDown(self) -> None:
        self.wxr.wtp.close_db_conn()
        close_thesaurus_db(
            self.wxr.thesaurus_db_path,
            self.wxr.thesaurus_db_conn,  # type:ignore[arg-type]
        )

    @patch(
        "wikitextprocessor.Wtp.get_page",
        return_value=Page(title="Template:ti verb", namespace_id=10,
        body="""(''[[transitive|<span style="color:green">transitive</span>]] & [[intransitive|<span style="color:green">intransitive</span>]]'')<includeonly>[[Category:Transitive verbs]]
[[Category:Intransitive verbs]]</includeonly><noinclude>
{{documentation}}
<!-- Categories, interwikis and TemplateData goes to the /doc subpage -->
</noinclude>
"""),
    )
    def test_temp1(self, mock_page) -> None:
        self.wxr.wtp.start_page("excrete")
        data = parse_page(
            self.wxr,
            "excrete",
            """== English ==
=== Verb ===
# {{ti verb}} Foooo.
""",
        )
        print("\n\n\n/////")
        print(data)
        self.fail()

and the template expansion, or something similar seems to go into an infinite loop:

excrete: ERROR: too deep recursion during template expansion at ['excrete', 'ti verb', 'documentation', 'documentation', 'documentation', 'documentation', 'documentation', 'documentation', 'documentation', 'documentation', 'documentation', 'documentation', 'documentation', 'documentation', 'documentation', 'documentation', 'documentation', 'documentation', 'documentation', 'documentation', 'documentation', 'documentation', 'documentation', 'documentation' [...]

/////
[{'word': 'excrete', 'lang_code': 'en', 'lang': 'English', 'pos': 'verb', 'senses': [{'glosses': ['(transitive & intransitive)\n(transitive & intransitive)\n(transitive & intransitive)\n(transitive & intransitive)\n(transitive & intransitive)\n(transitive & intransitive)\n(transitive & intransitive)\n(transitive & intransitive)\n(transitive & intransitive)\n(transitive & intransitive)\n(transitive & intransitive)\n(transitive & intransitive)\n(transitive & intransitive)\n[...]
\n Foooo.'], 'categories': ['Intransitive verbs', 'Transitive verbs']}]}]

Using the Simple English extractor leads to a shorter log, but using the English extractor gives the same kind of output except longer (because it tries to do more and thus expands or tries to expand stuff several times, I think.

But in production these repetitions of "(transitive & intransitive)" don't happen, there's just the newline.

kristian-clausal commented 1 month ago

Ah, the issue was with my test, nevermind. I forgot to remove the {{documentation}} template. I'm still trying to figure out why certain templates have the extra newline, though.

kristian-clausal commented 1 month ago

It's the newline inside the onlyinclude. I'll make a new issue.