Is it possible for the post-lexer to consume two tokens and then yield?

lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.

MIT License

4.88k stars 414 forks source link

Is it possible for the post-lexer to consume two tokens and then yield? #375

Closed elektito closed 5 years ago

elektito commented 5 years ago

I want to have a post-lexer that's essentially looks like this (with more bits of code inserted in between!)?

class PostLexer:
    def __init__(self):
        self.always_accept = ()

    def process(self, stream):
        for tok in stream:
            tok2 = next(stream)
            yield tok
            yield tok2

The way this code is, the parser doesn't work correctly. I get "no terminal defined for..." errors where I do not if I swap the tok2 = next(stream) line with the yield tok line.

erezsh commented 5 years ago

I assume you're using LALR.

LALR by default uses the contextual lexer, which depends on the state of the parser to tokenize. Changing the order means it tries to determine a terminal before the parser advanced to the next state.

If you have to do it this way, for whatever reason, you can try to use lexer="standard". It will revert to the traditional YACC/PLY lexer, which doesn't care about the parser state. However, that means that you're losing a bit of parsing power, and might experience more collisions.

elektito commented 5 years ago

Thanks. Yes, switching to standard parser does seem to fix this issue, although I have no idea whether it might degrade parsing power later, for me. The real reason I'm doing this relates to how the END statement and the likes of "END IF" have conflicts. I posted a question about the grammar here on Stack Overflow and apparently this is not something that can be easily and cleanly fixed in an LALR parser.

The plan was to detect the standalone END command and convert it to something else in the post-lexer, this I find cleaner than converting "END IF" to "ENDIF" and so on. In order to do that however, I have to look ahead.

Also, Earley could be a solution here, but Lark's implementation seems to resolve ambiguities in a non-deterministic manner, something that I am mortally afraid of, when it comes to programming, so I decided to switch to LALR and all hell broke loose!

BTW, all that said, this is a really cool library and the best I've found so far, so thank you!

erezsh commented 5 years ago

There is a reply there from by sepp2k which describes the problem correctly, and also offers the right solution. I propose that you try it, and if it works (as it should), accept is as an answer.

so thank you!

You're welcome :)

erezsh commented 5 years ago

I'll copy the solution part of his answer:

You can fix this, somewhat hackishly, by turning end if into a single token like this:

ENDIF_KW: /end[ \t\f]+if/i

And then using ENDIF_KW instead of END_KW IF_KW.

elektito commented 5 years ago

Yes. I guess that's what I'm going to do then. Thanks.