neogeny / TatSu

竜 TatSu generates Python parsers from grammars in a variation of EBNF
https://tatsu.readthedocs.io/
Other
403 stars 48 forks source link

Generated parser still parsing whitespaces although they are disabled #337

Closed Nafaryus27 closed 4 months ago

Nafaryus27 commented 4 months ago

I have a grammar in which I needed to disable whitespace parsing so I put @@WHITESPACE::None. When using the parser with :

parser = tatsu.compile(grammar)
ast = parser.parse(text)

It behaves as expected, not parsing whitespaces and letting the rules I defined do their job.

Now, I tried generating parser code, so using first python -m tatsu grammar.ebnf --outfile parser.py, importing parser.py and then using the MyParser class that has been generated :

from parser import MyParser

parser = MyParser()
ast = parser.parse(text)

However, this gave me an error where it could not find a rule (something like rule= " some string "; so a basic string rule containing whitespaces)

Now, I noticed that in the generated parser code, when it creates the ParserConfig (in both Buffer and Parser class), it looks like this :

config = ParserConfig.new(
            config,
            owner=self,
            whitespace=None,
            nameguard=None,
            ignorecase=False,
            namechars='',
            parseinfo=False,
            comments_re=None,
            eol_comments_re=None,
            keywords=KEYWORDS,
            start='start',
        )

and by changing whitespace=None to whitespace='' (in both class) it fixes the issue and behaves as expected, not parsing the whitespaces.

Though, I don't want to have to go into the generated code and modify this each time I regenerate the parser, so I search a bit and found that there is a --whitespace parameter that we can pass to tatsu when using the command line, but even if I put --whitespace '' it still puts None in the generated code.

I also found that specifying the whitespace='' argument when using the parser also solves the issue:

from parser import MyParser

parser = MyParser(whitespace='')
ast = parser.parse(text)

However I find this to be just a temporary solution as I would prefer to have everything regarding the parser config/rules to be in the grammar. Also, since there is already the @@whitespace directive in the grammar, one would expect that it works no matter the way we use the parser.

apalala commented 4 months ago

Please post a minimal grammar to test this?

Please also provide the version of TatSu you're using?

Also, have you tried using this in the grammar?

@@whitespace :: ''

This problem is probably caused by the configuration protocol (ParserConfig) treating None as an absent value, and not as the desired value.

A possible solution may be to make Parserconfig.whitespace = '', so no whitespace processing is done by default. It may be useful to disallow @@whitespace :: None to avoid confusion.

Nafaryus27 commented 4 months ago

I'm using version 5.12

I tried @@whitespace::'' but it cannot work as it's not a regexp, but even with // (for an empty regexp) it does not worked either.

You can use this for example : example.txt :

This is a test

grammar.ebnf:

@@whitespace::None

start = "This is a" test;
test = " test";

When using:

import tatsu

with open("example.txt", "r") as f:
    text = f.read()

with open("grammar.ebnf", "r") as f:
    grammar = f.read()

parser = tatsu.compile(grammar)
ast = parser.parse(text)

print(ast)

It gives the correct result :

('This is a', ' test')

However when using the generated parser (python3 -m tatsu grammar.ebnf --outfile parser.py) using python parser.py example.txt gives this error (full error on pastebin):

tatsu.exceptions.FailedToken: example.txt(1:11) expecting ' test' :
This is a test
          ^
test
start

Which shows clearly that the parser skipped over the whitespace before "test", although it was not supposed to.

I also found that this behavior might have already been known since in parser_semantics.py there is :

def grammar(self, ast, *args):
        directives = {d.name: d.value for d in flatten(ast.directives)}
        keywords = list(flatten(ast.keywords)) or []

        if directives.get('whitespace') in {'None', 'False'}:
            # NOTE: use '' because None will _not_ override defaults in configuration
            directives['whitespace'] = ''

Which I guess is why there is no issue when using the parser with tatsu.compile(...)

So maybe do a similar thing as above in the parser code generator, or allow '' as a possible value for @@whitespace::

Also, an other solution would be to set @@whitespace:: to an unused character (like '␟' or some other weird unicode character) but that's not very elegant...

apalala commented 4 months ago

I'll solve this on my next pass over TatSu.

If there's a pull request (that includes a unit test) before that, I'll merge it.

Nafaryus27 commented 4 months ago

Thanks !