Incorrect escaping of regexes specified for @@whitespace

6r1d commented 9 months ago

Hello.

I'm starting to use TatSu and needed clarification about handling square brackets. TatSu tends to ignore them sometimes and recognize at different times for some reason.

I aim to render a subset of Markdown, but I'll start with a simplified grammar to discuss the issue. (I'm using a unit separator as a rare character, since other ways to disable whitespace handling were more confusing. If there's a more straightforward and reliable way to tell TatSu to recognize the whitespace as characters it should treat as a part of the text, that'll be useful to know, too.)

@@grammar::Markdown

@@whitespace :: /[␟]/

start = pieces $ ;

text = text:/[a-z]+/ ;

piece = text;

pieces = {piece}*
    ;

This is the test code which leads TatSu to ignore the [], not fail with an error.

import tatsu

with open("./grammar.txt", "r") as grammar_file:
    grammar = grammar_file.read()

class MarkdownSemantics:

    def pieces(self, ast):
        return ''.join(ast)

parser = tatsu.compile(grammar)

markdown_str = "[]"
ast = parser.parse(markdown_str, semantics=MarkdownSemantics())
print(ast)

If I set the markdown_str as something else, like () or {}, TatSu will fail. Individual square brackets, [ or ], won't lead to an exception.

Traceback (most recent call last):
  File "~/parsers/lib/python3.11/site-packages/tatsu/contexts.py", line 241, in parse
    return rule()
           ^^^^^^
  File "~/parsers/lib/python3.11/site-packages/tatsu/grammars.py", line 840, in parse
    return self._parse_rhs(ctx, self.exp)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/parsers/lib/python3.11/site-packages/tatsu/grammars.py", line 852, in _parse_rhs
    return ctx._call(ruleinfo)
           ^^^^^^^^^^^^^^^^^^^
  File "~/parsers/lib/python3.11/site-packages/tatsu/contexts.py", line 609, in _call
    result = self._recursive_call(ruleinfo)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/parsers/lib/python3.11/site-packages/tatsu/contexts.py", line 640, in _recursive_call
    return self._invoke_rule(ruleinfo, key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/parsers/lib/python3.11/site-packages/tatsu/contexts.py", line 688, in _invoke_rule
    ruleinfo.impl(self)
  File "~/parsers/lib/python3.11/site-packages/tatsu/grammars.py", line 418, in parse
    ctx.last_node = [s.parse(ctx) for s in self.sequence]
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/parsers/lib/python3.11/site-packages/tatsu/grammars.py", line 418, in <listcomp>
    ctx.last_node = [s.parse(ctx) for s in self.sequence]
                     ^^^^^^^^^^^^
  File "~/parsers/lib/python3.11/site-packages/tatsu/grammars.py", line 220, in parse
    ctx._check_eof()
  File "~/parsers/lib/python3.11/site-packages/tatsu/contexts.py", line 771, in _check_eof
    self._error(
  File "~/parsers/lib/python3.11/site-packages/tatsu/contexts.py", line 545, in _error
    raise self._make_exception(item, exclass=exclass)
tatsu.exceptions.FailedExpectingEndOfText: (1:1) Expecting end of text :
{}
^
start

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "~/parsers/tatsu/parser.py", line 14, in <module>
    ast = parser.parse(markdown_str, semantics=MarkdownSemantics())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/parsers/lib/python3.11/site-packages/tatsu/grammars.py", line 1065, in parse
    return ctx.parse(text, config=config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/parsers/lib/python3.11/site-packages/tatsu/contexts.py", line 247, in parse
    raise self._furthest_exception from e
  File "~/parsers/lib/python3.11/site-packages/tatsu/contexts.py", line 794, in _option
    yield
  File "~/parsers/lib/python3.11/site-packages/tatsu/contexts.py", line 872, in _repeat
    self._isolate(block)
  File "~/parsers/lib/python3.11/site-packages/tatsu/contexts.py", line 851, in _isolate
    block()
  File "~/parsers/lib/python3.11/site-packages/tatsu/grammars.py", line 544, in <lambda>
    return ctx._closure(lambda: self.exp.parse(ctx))
                                ^^^^^^^^^^^^^^^^^^^
  File "~/parsers/lib/python3.11/site-packages/tatsu/grammars.py", line 785, in parse
    return rule()
           ^^^^^^
  File "~/parsers/lib/python3.11/site-packages/tatsu/grammars.py", line 840, in parse
    return self._parse_rhs(ctx, self.exp)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/parsers/lib/python3.11/site-packages/tatsu/grammars.py", line 852, in _parse_rhs
    return ctx._call(ruleinfo)
           ^^^^^^^^^^^^^^^^^^^
  File "~/parsers/lib/python3.11/site-packages/tatsu/contexts.py", line 609, in _call
    result = self._recursive_call(ruleinfo)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/parsers/lib/python3.11/site-packages/tatsu/contexts.py", line 640, in _recursive_call
    return self._invoke_rule(ruleinfo, key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/parsers/lib/python3.11/site-packages/tatsu/contexts.py", line 679, in _invoke_rule
    raise memo
  File "~/parsers/lib/python3.11/site-packages/tatsu/contexts.py", line 794, in _option
    yield
  File "~/parsers/lib/python3.11/site-packages/tatsu/contexts.py", line 814, in _optional
    yield
  File "~/parsers/lib/python3.11/site-packages/tatsu/contexts.py", line 883, in _closure
    block()
  File "~/parsers/lib/python3.11/site-packages/tatsu/grammars.py", line 544, in <lambda>
    return ctx._closure(lambda: self.exp.parse(ctx))
                                ^^^^^^^^^^^^^^^^^^^
  File "~/parsers/lib/python3.11/site-packages/tatsu/grammars.py", line 785, in parse
    return rule()
           ^^^^^^
  File "~/parsers/lib/python3.11/site-packages/tatsu/grammars.py", line 840, in parse
    return self._parse_rhs(ctx, self.exp)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/parsers/lib/python3.11/site-packages/tatsu/grammars.py", line 852, in _parse_rhs
    return ctx._call(ruleinfo)
           ^^^^^^^^^^^^^^^^^^^
  File "~/parsers/lib/python3.11/site-packages/tatsu/contexts.py", line 609, in _call
    result = self._recursive_call(ruleinfo)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/parsers/lib/python3.11/site-packages/tatsu/contexts.py", line 640, in _recursive_call
    return self._invoke_rule(ruleinfo, key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/parsers/lib/python3.11/site-packages/tatsu/contexts.py", line 688, in _invoke_rule
    ruleinfo.impl(self)
  File "~/parsers/lib/python3.11/site-packages/tatsu/grammars.py", line 785, in parse
    return rule()
           ^^^^^^
  File "~/parsers/lib/python3.11/site-packages/tatsu/grammars.py", line 840, in parse
    return self._parse_rhs(ctx, self.exp)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/parsers/lib/python3.11/site-packages/tatsu/grammars.py", line 852, in _parse_rhs
    return ctx._call(ruleinfo)
           ^^^^^^^^^^^^^^^^^^^
  File "~/parsers/lib/python3.11/site-packages/tatsu/contexts.py", line 620, in _call
    self._error('Expecting <%s>' % ruleinfo.name)
  File "~/parsers/lib/python3.11/site-packages/tatsu/contexts.py", line 545, in _error
    raise self._make_exception(item, exclass=exclass)
tatsu.exceptions.FailedParse: (1:1) Expecting <text> :
{}
^
text
piece
pieces
start

Such parser behavior causes issues with recognising the Markdown URLs for me, so any help is welcome.

6r1d commented 9 months ago

For the context, this is a full grammar I am experimenting with.

@@grammar::Markdown

@@whitespace :: /[␟]+/

start = pieces $ ;

newline = '\n';

text = text:/[a-zA-Z\d \-\_#:]+/ ;

raw_link_prefix = 'http://' | 'https://' ;

raw_link = protocol:raw_link_prefix url:link_string ;
internal_link = protocol:raw_link_prefix url:link_string ;

link_string = /[a-zA-Z\d\$\-\_\.\+\!\*\'\/\&\?\=\%]+/ ;

piece
    =
    | newline
    | raw_link
    | link
    | code_inline
    | bold
    | italic
    | text
    ;

pieces = {piece}*
    ;

link = '[' content:pieces '](' url:internal_link ')';

code_inline
    =
    mode:'`' content:text '`'
    ;

italic = mode:'*' content:pieces '*'
    ;

bold = mode:'**' content:pieces '**'
    ;

My goal is to also be able to parse Markdown like this:

[`issue_a#no`](https://example.com)
[`issue_b#no`](https://example.org)

Instead, the error I hit is also:

tatsu.exceptions.FailedToken: (1:1) expecting '\n' :
()
^
newline
piece
pieces
start

apalala commented 9 months ago

Sorry, but questions about learning to use TatSu, PEG, and parsing in general must go to StackOverflow

appetrosyan commented 9 months ago

It doesn't look like a problem with learning either PEG, parsing or TatSu, but a genuine bug.

6r1d commented 9 months ago

@apalala, if you think my EBNF above should ignore the [ and ], please explain how it can happen in the StackOverflow question I wrote to show my point. I am inclined to believe it's actually a bug in TatSu and I provided additional data for testing.

dnicolodi commented 9 months ago

I'm not sure I would call it a bug, but there is a disconnect between how the @@whitespace directive is documented and how it works: it is documented to take a regular expression, but it is interpreted to be a list of characters to skip over, which is translated into a regular expression here https://github.com/neogeny/TatSu/blob/0437dddb21417f724d150c5a9bfe74731d51fe1b/tatsu/buffering.py#L75-L87

appetrosyan commented 9 months ago

I'm not sure I would call it a bug,

This is at least a bug in the documentation. I have been using parser generators for a while now, and I would say that if not a bug, this is at least (shall we say) bad API design (to say that it accepts a regular expression), and then provide a red herring.

If I were you, I'd re-open this issue and call it something like "better document @@whitespace", and make everyone happy.

Blaming your users is not a good look, especially when the docs lie.

apalala commented 9 months ago

REOPENED

See the discussion on SO.

https://stackoverflow.com/q/77548440/545637

apalala commented 9 months ago

I apologize for not having paid closer attention to this report.

6r1d commented 9 months ago

I apologize for not having paid closer attention to this report.

I'm glad TatSu got better in the end, and that's what matters :-)

apalala commented 9 months ago

I'm glad TatSu got better in the end, and that's what matters :-)

That matters :-)

But, unlike many others, the report contained a unit test. I should have just run it :-\

https://github.com/neogeny/TatSu/blob/master/test/grammar/directive_test.py#L42-L52

appetrosyan commented 9 months ago

That matters :-)

I'm glad we all parted amicably.

A word of somewhat solicited advice from one maintainer to another.

If in doubt, keep the issue open. Even if the user is being actively hostile, they'll warm up to you, as soon as you show that you're on their side, and that you want to fix their problem, as much as they do.

neogeny / TatSu

Incorrect escaping of regexes specified for @@whitespace #330