lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.86k stars 413 forks source link

Disallow :| #1077

Open j6k4m8 opened 2 years ago

j6k4m8 commented 2 years ago

Some terms to make this more searchable:

Disallow the colon and pipe operators next to each other. definition operator and vertical bar adjacent :|

(Please forgive me if this has already been reported; it is very challenging to search GitHub issues for single-character non-letter operators!)

Describe the bug

I just spent a good deal of time debugging what turned out to be a rogue definition line like this:

...
my_rule: | part_1 | part_2
...

in other words, there was a rule definition, immediately followed by a vertical pipe:

:|

My suggestion is to disallow this syntax, because the result was a completely unpredictable parse, since sometimes my tree would parse correctly and sometimes my_rule would be empty. Perhaps it's possible there are times when this is desireable(?) I can't imagine when that would be... but perhaps it's worth warning the user at least, so it's clear that such a "weird" rule exists! (I imagine this is not uncommon, especially when users refactor their grammars.)

(HUGE kudos for the v_args decorator, which made this much easier to track down!)

erezsh commented 2 years ago

I agree there is some potential for confusion there.

I've already suggested before that we switch to an explicit empty rule, like (), instead of allowing implicit ones. And of course the same can be achieved by doing my_rule: part_1? | part_2 or other variations on this theme.

Unfortunately, I don't know how to bring about such a change right now, because it might break a lot of existing grammars. Perhaps I should have used v1.0 to do it, but what's done is done.

I'm not sure if adding a warning is the right reaction to this. But I'll try to keep an open mind.

Perhaps it's possible there are times when this is desireable

I think it's quite common, although usually it would be written as part_1 | part_2 | which is perhaps clearer?

HUGE kudos for the v_args decorator

Thanks. Mind if I ask how you used it to track it down?

j6k4m8 commented 2 years ago

Understood, and thank you for the quick response @erezsh! Maybe nothing actionable here then ā€” feel free to close! (Either way, I hope the issue might be googleable for others now!)

I've only used Lark in a small handful of projects so far now, and I'm not sure I've ever really managed to write what I'd consider a bullet-proof Transformer anyway, so it's very likely my opinions here are moot :)

v_args helped me track down the problem to a matter of the incorrect number of arguments being passed to my rule in the Transformer: I knew then that I was making it all the way to the correct rule, but NOT passing the correct subtree (and was, in fact, passing an empty subtree). I had experienced the same issue previously when I had something like this:

rule        : part_1
            | part_1 optional_part_2 

(simplified example just for explanation purposes, I realize one can use ? for this here)


@v_args(inline=True)
class MyTransformer(Transformer):

    def rule(self, part_1, optional_part_2):
        ...

Which would throw a wrong-number-of-args exception if it matched part_1 alone. (A "duh" moment, but super helpful to track down!)


While I'm kudos'ing, I also want to say a BIG thanks for the online IDE, which shortened my development cycle considerably!!

erezsh commented 2 years ago

We can leave it open for a while, and give other users a chance to respond. We can always close it later.

Maybe it's actionable in the sense of adding an explicit_empty_rule flag to Lark. But since it will have to be off by default for now, it probably wouldn't have prevented you from tripping on this.

Glad you liked the online IDE! It can still be improved in many ways, but I'm glad it's already helping as it is.

evtn commented 2 years ago

I'm not sure if adding a warning is the right reaction to this

I think it would be okay to issue a warning if implicit empty rule is used, so it won't break old grammars, but will prevent from doing it accidentally. Adding an explicit empty rule to avoid warnings (and I think not ()) would be nice in that case.

erezsh commented 2 years ago

What would you use to signify empty? I thought () is pretty elegant, and it's what Python uses for empty tuple.

j6k4m8 commented 2 years ago

I do like (), for exactly that Pythonic "emptiness" mnemonic reason. // might also be good ā€” "the regular expression that matches the empty string" ā€” which I think is disallowed in grammars right now, so that might be a good way to ensure backwards compat?

Or of course, something more explicit:

%import common.EMPTY
erezsh commented 2 years ago

These are actually 3 different constructs

The difference between the 2nd and 3rd is that an empty rule will be "parsed" and create an empty tree node (like Tree('my_empty', [])), while the empty derivation gets optimized away during preprocessing, and doesn't even get sent to the parser.

j6k4m8 commented 2 years ago

šŸ¤¦ right right, comments! silly mistake :)

Although,

rule: opt1 | opt2 | // 

magically already works ;)

Of the two options, I think it makes sense to create a tree node so that it can be handled by the developer, rather than be totally transparent?

erezsh commented 2 years ago

Funny!

I think it makes sense to create a tree node

For that you can use the 2nd form. But there is a performance penalty for sending it to the parser, instead of optimizing it away in the preprocess. And another penalty for creating a tree node. It's small, but it can accumulate, especially since empty derivations are more common than it might seem at first. (not to mention the inconvenience)

evtn commented 2 years ago

What would you use to signify empty? I thought () is pretty elegant, and it's what Python uses for empty tuple.

I guess I don't have a better proposal for an empty rule. Although my grammar generator is perfectly compatible with () right now.

On second thought, I think it would be the best way to do that.