Indenter doesn't work intuitively with ignored tokens

ColonelThirtyTwo commented 3 years ago

Describe the bug

Indenter is cited as the way to parse whitespace-sensitive languages, but it has unintiutive and obtrusive behavior when there is an ignored token (ex a comment) in the middle of a newline sequence.

To Reproduce

Example:

GRAMMAR = r"""
    %import common.WS_INLINE
    %import common(C_COMMENT, CPP_COMMENT, SIGNED_NUMBER)

    %declare _INDENT _DEDENT
    %ignore WS_INLINE
    %ignore CPP_COMMENT
    %ignore C_COMMENT
    _NL: /(\r?\n[\t ]*)+/

    VARNAME: /[_a-zA-Z][_a-zA-Z0-9\/]*/

    start: [_NL] stmt*
    stmt: VARNAME _NL [_INDENT stmt+ _DEDENT]
"""

from lark import Lark
from lark.indenter import Indenter

class TreeIndenter(Indenter):
    NL_type = '_NL'
    OPEN_PAREN_types = []
    CLOSE_PAREN_types = []
    INDENT_type = '_INDENT'
    DEDENT_type = '_DEDENT'
    tab_len = 4

parser = Lark(GRAMMAR, postlex=TreeIndenter(), parser="lalr", start="start")
data = """
foo
bar
    /* */
    baz
"""
for tok in parser.lex(data):
    print(repr(tok))
print(parser.parse(data).pretty())

Remove the /* */ from the data variable and it parses ok.

Happens because the ignored comment splits up the _NL tokens, and the indenter does not coalesce them.

erezsh commented 3 years ago

The recommended solution is to capture comments in the newline.

Something like:

_NL: (NEWLINE | COMMENT)+

If you have a better way, we'll be happy to hear it.

ColonelThirtyTwo commented 3 years ago

@erezsh Thanks for the quick response. Looks like that works.

Might be worth putting that in the docs somewhere - seems like a gotcha. I think that the Indenter could be altered to coalesce adjacent _NL tokens too, so that the alteration isn't needed.

erezsh commented 3 years ago

I think you're right, it might be possible to do so. It would prevent languages in which a blank space is a dedent, but perhaps those aren't very common.

Anyway, in terms of performance, the current solution works best.

We can keep this issue open, while I consider the best option. If the code remains the same, I agree we should mention this somewhere in the docs.

MegaIng commented 3 years ago

@erezsh It might in general be worth considering creating a FAQs/best practices/common misunderstanding page.

erezsh commented 3 years ago

@MegaIng What would you put there? It doesn't seem like there are a lot of repetitive issues, since most of them are solved in the code.

MegaIng commented 3 years ago

Not necessarily repetitive (right now), but tricks that can't really be fixed and should be documented somewhere. Of course, we can just keep them in issues, but I think adding a page in the docs for them is worth it. #857 (which is actually a 'duplicate' of #517), #841, this issue, #838, #833, etc. (+ stuff from gitter, which is even less searchable than github issues). Most of them were just answered with a short text, explaining what is going on and how to fix the grammar. These could all just be formulated into a "FAQ" page. If you don't want this it's fine, but I think it is worth it.

erezsh commented 3 years ago

@MegaIng I have no objection. I you want to write such a page I'll add it.

Maybe some of those can fit in https://lark-parser.readthedocs.io/en/latest/how_to_use.html or https://lark-parser.readthedocs.io/en/latest/recipes.html

julie777 commented 2 years ago

@erezsh

The recommended solution is to capture comments in the newline.

Something like:
_NL: (NEWLINE | COMMENT)+

That doesn't seem to work in my case where the comments are SH_COMMENTS because the comment eats the \n before NEWLINE sees it.

Any suggestions for a grammar that

uses SH or Python style comments
uses indentation for structure and therefore uses Indenter

erezsh commented 2 years ago

@julie777 That's what the official Python grammar does.. https://github.com/lark-parser/lark/blob/master/lark/grammars/python.lark

lark-parser / lark

Indenter doesn't work intuitively with ignored tokens #863