lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.64k stars 397 forks source link

CPython 3.11.7 breaks `regex` module compatible pattern width calculations #1376

Closed MichaelSquires closed 7 months ago

MichaelSquires commented 7 months ago

Describe the bug

This CPython change recently got released in CPython 3.11.7: https://github.com/python/cpython/pull/109859. You'll see in Lib/re/_parser.py that MAXWIDTH (1 << 64) is now the maximum width of a regex pattern, not MAXREPEAT (2**32 - 1). Lark uses MAXREPEAT (utils.py#L154) to assume the maximum width when it can't use the re module to calculate the width. This was all fine until CPython 3.11.7 was released earlier today and it started breaking our custom grammar.

Our specific issue is that terminal ordering changed because the lengths of the regex patterns are used to determine priority. Since 3.11.7, regex patterns that are only compatible with the regex module have a much shorter length than re module compatible patterns.

This issue (or very similar to it) was discussed just a few weeks ago on github: https://github.com/mrabarnett/mrab-regex/issues/513.

To Reproduce

Here's a minimized test. In this example, NUM uses a named capture group without the "P" ((?P<NUMVAL>\d)) which is supported by the regex module only. Because it's not supported by the builtin re module, it's length is assumed to be (min, max) of (1, MAXREPEAT) where MAXREPEAT = 2^^32 -1. For STR, it is re module compatible so the get_width() call succeeds and its length is calculated as (1, MAXWIDTH) where MAXWIDTH = 1 << 64.

    def test_terminal_widths(self):
        g = r"""
        start: expr
        expr: NUM | STR
        NUM: /(?<NUMVAL>\d)+/
        STR: /\w+/
        """
        p = Lark(g, regex=True, parser='lalr')
        term0 = p.terminals[0]
        term1 = p.terminals[1]
        self.assertEqual(term0.priority, term1.priority)
        self.assertEqual(term0.pattern.min_width, term1.pattern.min_width)
        self.assertEqual(term0.pattern.max_width, term1.pattern.max_width)

Please let me know if you have any questions about this.

Thanks!

MegaIng commented 7 months ago

I made a PR with a fix, although I would suggest not relying on automatically generated priories. Just adding an explicit priority would also fix this for your grammar.

MichaelSquires commented 7 months ago

Wow, thanks for the quick turnaround!

We'll look into setting explicit priorities too, thanks for the suggestion 😄 .

MichaelSquires commented 7 months ago

This issue is resolved with #1377 being merged in, I'll go ahead and close it. Thank you for your help/support!

@erezsh @MegaIng Any estimates on when we might see a release with this fix?

erezsh commented 7 months ago

@MichaelSquires I'm currently on vacation, so no promises, but I'll try to do it next week.

MichaelSquires commented 7 months ago

@MichaelSquires I'm currently on vacation, so no promises, but I'll try to do it next week.

Thanks for letting me know. Enjoy your vacation!

MichaelSquires commented 6 months ago

Hi @erezsh, I hope you enjoyed the holidays! Wondering if you think we'll see a new release with this bugfix soon?

erezsh commented 6 months ago

@MichaelSquires Thanks for the reminder. I'll try to do it this Monday.

erezsh commented 6 months ago

Released version 1.1.9

Sorry for the delay!

MichaelSquires commented 6 months ago

Excellent news! Thank you for your support!