neogeny / TatSu

竜 TatSu generates Python parsers from grammars in a variation of EBNF
https://tatsu.readthedocs.io/
Other
403 stars 48 forks source link

Cannot change default comment regexp #293

Closed RobertBaruch closed 1 year ago

RobertBaruch commented 1 year ago

See also #249. Summary: Either #249 was not actually fixed, or the documentation on how to specify comment regexps (docs/syntax.rst) is incorrect. Note also that the workaround in #249 still fixes this issue.

Tested using pip install tatsu (v 5.8.3) and Python 3.10.6.

Test grammar (comments.peg)

file::File = lines:{line}+ $ ;

line::Line = comment:comment | comment2:comment2 | blank:blank ;

comment::Comment = content:COMMENT ;
comment2::Comment2 = content:COMMENT2 ;

blank::Blank = content:NEWLINE ;

NEWLINE = '\n' ;
COMMENT = /#[^\n]*\n/ ;
COMMENT2 = /%[^\n]*\n/ ;

Main:

import tatsu
from tatsu.model import ModelBuilderSemantics
import json

def main():
    with open('comments.peg') as f:
        txt = f.read()
    parser = tatsu.compile(txt, semantics=ModelBuilderSemantics(), comments_re=None, eol_comments_re=None)

    with open('test.peg') as f:
        txt = f.read()
    model = parser.parse(txt, whitespace='', comments_re=None, eol_comments_re=None)

    print(json.dumps(model.asjson(), indent=4))

if __name__ == "__main__":
    main()

Test file (comments.peg):

# comment here

% different comment

# another comment

Resulting output:

{
    "__class__": "File",
    "lines": [
        {
            "__class__": "Line",
            "blank": {
                "__class__": "Blank",
                "content": "\n"
            }
        },
        {
            "__class__": "Line",
            "blank": {
                "__class__": "Blank",
                "content": "\n"
            }
        },
        {
            "__class__": "Line",
            "comment2": {
                "__class__": "Comment2",
                "content": "% different comment\n"
            }
        },
        {
            "__class__": "Line",
            "blank": {
                "__class__": "Blank",
                "content": "\n"
            }
        },
        {
            "__class__": "Line",
            "blank": {
                "__class__": "Blank",
                "content": "\n"
            }
        }
    ]
}

Expected output:


{
    "__class__": "File",
    "lines": [
        {
            "__class__": "Line",
            "comment": {                           <<<<<------------
                "__class__": "Comment",
                "content": "# comment here\n"
            }
        },
        {
            "__class__": "Line",
            "blank": {
                "__class__": "Blank",
                "content": "\n"
            }
        },
        {
            "__class__": "Line",
            "comment2": {
                "__class__": "Comment2",
                "content": "% different comment\n"
            }
        },
        {
            "__class__": "Line",
            "blank": {
                "__class__": "Blank",
                "content": "\n"
            }
        },
        {
            "__class__": "Line",
            "comment": {                           <<<<<------------
                "__class__": "Comment",
                "content": "# another comment\n"
            }
        }
    ]
}```
apalala commented 1 year ago

In the original grammar you posted you're taking care of parsing comments, and not using TatSu facilities. It's likely that the regular expressions used in the grammar are not correct.

These are the kind of queries that should be posted on StackOverflow under the tatsu tag.