lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.63k stars 397 forks source link

Unexpected behavior of rule priority. #502

Open marcofavorito opened 4 years ago

marcofavorito commented 4 years ago

Hi,

I've noticed an unexpected behaviour of rule priority in the default parser "earley".

I'm going to describe how to reproduce the issue step-by-step, by implementing a simple propositional calculus grammar.

l = Lark('''start: formula formula: or_formula | and_formula | not_formula | atom_formula or_formula: formula "|" formula and_formula: formula "&" formula not_formula: "!" formula atom_formula: WORD

%import common.WORD // imports from terminal library %ignore " " // Disregard spaces in text''', parser="earley") p = l.parse("!a & b | c") print(p.pretty())

This example will work as expected by printing the following tree:

start formula or_formula formula and_formula formula not_formula formula atom_formula a formula atom_formula b formula atom_formula c

The priorities of ambiguous rules are fulfilled: first `not`, then `and`, then `or`.
Incidentally, the order of the rules reflects the order of priority of the rule, although I didn't find any reference to that order (I looked at this section: [https://lark-parser.readthedocs.io/en/latest/grammar/#rules](https://lark-parser.readthedocs.io/en/latest/grammar/#rules). BTW I haven't tried so hard in finding the answer in the docs, I apologize if this information was already there)

- After shuffling the rules, the behaviour breaks, as expected:

```python
from lark import Lark

l = Lark('''start: formula
formula:   not_formula | and_formula | or_formula | atom_formula
or_formula: formula "|" formula
and_formula: formula "&" formula
not_formula: "!" formula
atom_formula: WORD

%import common.WORD   // imports from terminal library
%ignore " "           // Disregard spaces in text''', parser="earley")
p = l.parse("!a & b | c")
print(p.pretty())

this gives:

start
  formula
    not_formula
      formula
        and_formula
          formula
            atom_formula    a
          formula
            or_formula
              formula
                atom_formula    b
              formula
                atom_formula    c

Rules can be assigned priority only when using Earley (future versions may support LALR as well). Priority can be either positive or negative. In not specified for a terminal, it's assumed to be 1 (i.e. the default).

So I tried the following:

from lark import Lark

l = Lark('''start: formula
formula:   not_formula | and_formula | or_formula | atom_formula
or_formula.3: formula "|" formula
and_formula.4: formula "&" formula
not_formula.5: "!" formula
atom_formula.6: WORD

%import common.WORD   // imports from terminal library
%ignore " "           // Disregard spaces in text''', parser="earley")
p = l.parse("!a & b | c")
print(p.pretty())

I was expecting the correct behaviour. But I've got the same output as the previous step:

start
  formula
    not_formula
      formula
        and_formula
          formula
            atom_formula    a
          formula
            or_formula
              formula
                atom_formula    b
              formula
                atom_formula    c

So it seems the priorities over rules are ignored by the parser.

I also tried interpreting the priority in the opposite way (the lower the value, the highest the priority), but I've got the same result. (I also suggest to make this bit clearer in the linked docs section).

Is this the expected behaviour given the state of the software? Did I apply the rule priorities wrongly? Am I missing something?

erezsh commented 4 years ago

Thank your for this bug report!

I will reply in three parts: An excuse, an explanation, and a solution.

So, first of all, the excuse :)

The SPPF implementation of Earley is a new and external contribution to Lark. It passes all the tests that I have, and it's technically very reliable (there's only a few known edge-cases). However, not as much care has been placed on making it accessible to users, which manifests as a lack in documentation, and in correct behaviors that are somewhat counter-intuitive.

Well, I pledge to improve the documentation, so now let's jump to the explanation. Earley resolves ambiguity (and hence, order of operations) using a sum priority. That means, it sums up the priority of all the subtrees, and chooses the derivation with the highest priority. If all priorities are equal (or None), it chooses according the the order of the rule, something which you have already noticed.

Unfortunately, when all the rules involved recurse into each other, as in your grammar, a funny-but-logical effect occurs: All the branches have the same sum priority. That is because you are always summing up the same elements, and as we know, the order doesn't matter.

Hopefully that was clear enough, which leads me to my proposed solution. I added a new option to Lark: priority = "decay", where instead of summing up all the priorities, it keeps halving the priority as it descends higher into the tree (or lower, depending on your perspective). That creates a formula like this pseudo-code:

node.sum_priority = sum( n.priority * (1**-n.depth) for n in bfs(node.children) )

(previously, the formula was the same as if depth==1 always)

This implementation is available now in the branch priority_decay, and it seems to work for your problem. Use it like this:

l = Lark(..., parser="earley", priority='decay')

I haven't added it to the main branch yet, because I think this needs a bit of discussion. Is this always the desired behavior? After all, it's less predictable, and can create other funny behavior (a very important node isn't chosen just because it's too far down the tree).

Also, if decay is given as an option, it should probably make sense to also control the degree of decay. Maybe users would like it to be 0.99 rather than 0.5, etc.

But this is edging on the strange and unpredictable territory of arithmetic disambiguation, and I don't have enough experience with it to have an intuition of how it will affect users, casual and expert alike.

So, any thoughts?

@night199uk I'm especially interested in your opinion. Is it a good solution? Should this be the default? After all, sum is just decay=1. Also, perhaps invert can simply be represented by negative decay? (but that's not so obvious)

JeppeKlitgaard commented 3 years ago

@erezsh

I've run into a similar issue, which is fixed by the feature you added in the priority_decay branch. Thank you so much for doing that by the way! It is fantastic to see a project with such an active and involved author!

Is there any chance that this could be merged into master?

I think adding a decay rate is also a very good idea, though for clarity I think this is best done as a separate argument that is then only used when priority='decay'. While it would be neat to wrap it all into a single argument, it might confuse some users.

On the other hand, if that argument is then very well documented in the documentation, it might actually benefit users in understanding how the priority system works for Earley, which is currently non-obvious.