lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.77k stars 404 forks source link

Localized ignore (Grammar Composition) #1272

Open erezsh opened 1 year ago

erezsh commented 1 year ago

Suggestion

The global ignore mechanism is the only thing stopping us from having perfect grammar composition. (in theory, at least)

If we allow each rule to have different ignore terminals, we would be able to import any rule from any grammar.

It can replace the global ignore mechanism, if we rewrite the global %ignore directive as applying to each rule individually.

Optional:

We can also add new syntax for a localize ignore, for example:

conf_file: conf_statement+ %ignore CONF_COMMENT

We could keep the global ignore mechanism as optional, but I don't see any reason to do so.

MegaIng commented 1 year ago

Didn't we previously try something into this direction? Wasn't the problem that is ambiguous what should be ignored on the borders between rules when they have different ignore sets?

I.e. does the %ignore apply between the rules in the rule, or also before and after the rule?

erezsh commented 1 year ago

That's a good point, but I think it's reasonable to say that each rule is surrounded by its own ignore terminals on both sides. And since ignoring something twice is the same as doing so once, having two rules with the same ignore set means you only need to match once. So you end up with:

i1 S1 i1  i2 S2 i2     when i1 != i2
i1 S1 i1 S2 i1         when i1 == i2

and of course the logic chains.

erezsh commented 1 year ago

I think the main challenge in implementing this would be that in LR parsers it's not clear which rule we are parsing until we reach a reduce. So if the two rules being considered have different ignore sets, there is no trivial way to resolve this at the rule level.

But I think it should work if we do it at the symbol level.

One option is to have each symbol aware of its ignore set. That can get a little tricky when trying to support things like i1 i2 S2, but maybe we can just concat the ignores into one regex, and get i1,2 S2.

Second option is to concat the regexes of the ignores to each terminal directly . The re engine should be able to handle it, and we actually get the added bonus of higher performance, because we're doing less re calls. (I did some experiments with the json parser, and concating the ignore makes it ~20% faster.)

However, it might have some unintended consequences, for example it might complicate the "unless" mechanism we use for keywords, or make it slower since we'll have to match the ignored twice.

Edit: actually for both options, probably the main challenge would be making sure that the "ignore_before" and "ignore_after" sets are minimal, and we're not matching the same regex twice on the borders of rules. But sounds like it should be possible.