lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.77k stars 404 forks source link

Ambiguities and Priorities in 1.2.1 earley #1443

Open MaxOstrowski opened 1 month ago

MaxOstrowski commented 1 month ago

Since lark version 1.2.1, there seem to be (more?) ambiguities produced for our syntax. While the syntax is ambiguous, we used prioritiies to fix this before.

%import common.INT
%import common.FLOAT
?start: (condition"."?)?

...

unsigned_integer.1: INT
unsigned_float    : FLOAT 

So whenever a condition ends with a number, like: x = 5. there are 2 interpretations, either the dot comes from the start rule, and 5 is an INTEGER, or the DOT comes from a float 5. and the dot from the start rule is omitted. The priority unsigned_integer.1: INT fixed this before.

  1. Is it normal that we still get 2 different parse trees (earley parser)
  2. We implemented the following rule in the transformer to simply select the preferred ast:
    def _ambig(self, args: list[Any]) -> Any:
        """Take most preferred version."""
        return args[0]

    but it feels like a big waste of performance that all possible ambiguities are produced. Is there a way to enforce only producing the most preferred ast?

Might be related to #1441.

erezsh commented 1 month ago

Hi @MaxOstrowski ,

What arguments are you giving to Lark? Specifically as the "ambiguity" parameter?

MaxOstrowski commented 1 month ago

We tried all of them, none of them made any difference, so we thought that we might have used it wrong? Lark.open(file, ambiguity="resolve")

erezsh commented 1 month ago

Thank you for reporting this bug.

Looks like the issue is caused by a bugfix I included in v1.2.1.

I just created a PR with a fix: https://github.com/lark-parser/lark/pull/1444

If you can test it, let me know if it fixed the issue.

I will probably make a new release later today.

erezsh commented 1 month ago

Okay, new version released - https://github.com/lark-parser/lark/releases/tag/1.2.2

It should be now fixed. Let me know otherwise.

Sorry for the inconvenience!

erezsh commented 1 month ago

Actually I will re-open. Perhaps your problem isn't entirely solved.

P.S. regarding the efficiency concerns - Earley must be aware of all the ambiguities in order to parse the text correctly. Lark stores the ambiguities in the SPPF structure, which is fairly efficient.

I think the final step can be rewritten a bit more efficiently, because we currently produce the trees even for the root solutions that we don't return. It just takes a bit of reshuffling, I'll try and do it tomorrow.

MaxOstrowski commented 1 month ago

Regarding our problem, with the newest version (1.2.2) we do not get any ambiguous solutions anymore. Thanks a lot!