lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.77k stars 404 forks source link

Dynamic Earley does not support weights on terminals since V0.7.0 #383

Closed servoz closed 5 years ago

servoz commented 5 years ago

We use lark in the populse_db project. Since version 0.7.0, our project has been crashing with the exception ValueError "Dynamic Earley does not support weights on terminals". I confess that I am not an expert in this field and I would like to know what I can change in the use of lark to avoid this exception. We currently remain below 0.7.0 until we can bypass this exception ...

erezsh commented 5 years ago

Actually, Earley never supported weights on terminals, it just didn't throw an error. It's possible that just removing the weights will solve your issue without affecting the parse result.

servoz commented 5 years ago

I'm gonna sound stupid, but what's weights on terminals ?

erezsh commented 5 years ago

It basically just changes the priority on which terminal to choose in case of collision. That's only relevant for the standard and contextual lexers, and not the default Earley lexer which tries all combinations.

You can still have priority in rules, so having them on terminals isn't necessary and complicates the implementation (which is already non-trivial).

servoz commented 5 years ago

To make sure I understood correctly. In this part of grammar: ... quoted_field_name.1 : QUOTED_FIELD_NAME QUOTED_FIELD_NAME.1 : "{" /[^}]/ "}" ... To remove the priority, I change to: ... quoted_field_name : QUOTED_FIELD_NAME QUOTED_FIELD_NAME : "{" /[^}]/ "}" ...

In fact, I remove all .n from TERM.n, everywhere in the grammar. Is that correct?

erezsh commented 5 years ago

I would keep quoted_field_name.1, as it might be meaningful, and only change to QUOTED_FIELD_NAME

servoz commented 5 years ago

Ok, I start to be less confused, thanks to you !

May be a last question, please:

I must remove all n=2, because I see in the lark/parser_frontends.py module:

            if t.priority != 1:
                raise ValueError("Dynamic Earley doesn't support weights on terminals", t, t.priority)

So, if I am not wrong, I think I can keep (or remove) all n = 1 but I MUSTremove all n=2 (with Dynamic Earley) ?. For example in the following part of a grammar:

...
?literal.2 : KEYWORD_LITERAL -> keyword_literal
         | ESCAPED_STRING   -> string
         | SIGNED_NUMBER    -> number
         | DATE             -> date
         | TIME             -> time
         | DATETIME         -> datetime
         | list

KEYWORD_LITERAL.2 : "NULL"i
                  | "TRUE"i
                  | "FALSE"i
...

Must become:

...
?literal : KEYWORD_LITERAL -> keyword_literal
         | ESCAPED_STRING   -> string
         | SIGNED_NUMBER    -> number
         | DATE             -> date
         | TIME             -> time
         | DATETIME         -> datetime
         | list

KEYWORD_LITERAL: "NULL"i
                  | "TRUE"i
                  | "FALSE"i
...

I am right ?

erezsh commented 5 years ago

Yes, it's a quirk really, but that should solve it. Again, literal.2 is fine.

servoz commented 5 years ago

Great! Thank you very much for your support! I will report here the result (old grammar, new grammar), in case it may help someone else like me who is a little confused with these notions! Right now I'm so busy, but ASAP ....

erezsh commented 5 years ago

Great. Let me know if you need more help

servoz commented 5 years ago

I removed the weights on the terminals and, as expected, there is no longer any exception raised. However ... the execution time is now very long! Execution times for exactly the same operation:

filter_grammar = '''
?start : filter

?filter : "ALL"i                         -> all
        | conditions
        | negation
        | "(" filter ")"
        | filter BOOLEAN_OPERATOR filter -> conditions

?conditions : condition (BOOLEAN_OPERATOR condition)*

negation : "NOT"i condition
         | "NOT"i "(" filter ")"

BOOLEAN_OPERATOR : "AND"i
                 | "OR"i

CONDITION_OPERATOR : "=="i
                   | "!="i
                   | "<="i
                   | ">="i
                   | ">"i
                   | "<"i
                   | "IN"i
                   | "ILIKE"i
                   | "LIKE"i

condition : operand CONDITION_OPERATOR operand

?operand : literal
         | field_name

field_name.1 : FIELD_NAME
             | quoted_field_name

quoted_field_name.1 : QUOTED_FIELD_NAME

?literal.2 : KEYWORD_LITERAL -> keyword_literal
         | ESCAPED_STRING   -> string
         | SIGNED_NUMBER    -> number
         | DATE             -> date
         | TIME             -> time
         | DATETIME         -> datetime
         | list

DATE : INT "-" INT "-" INT
TIME : INT ":" INT (":" INT ("." INT)?)?
DATETIME : DATE "T" TIME

KEYWORD_LITERAL.2 : "NULL"i
                  | "TRUE"i
                  | "FALSE"i

FIELD_NAME.1 : ("_"|LETTER) ("_"|LETTER|DIGIT)*
QUOTED_FIELD_NAME.1 : "{" /[^}]/* "}"

list : "[" [literal ("," literal)*] "]"

%import common.INT
%import common.ESCAPED_STRING
%import common.SIGNED_NUMBER
%import common.LETTER
%import common.DIGIT

%import common.WS
%ignore WS
'''
filter_grammar = '''
?start : filter

?filter : "ALL"i                         -> all
        | conditions
        | negation
        | "(" filter ")"
        | filter BOOLEAN_OPERATOR filter -> conditions

?conditions : condition (BOOLEAN_OPERATOR condition)*

negation : "NOT"i condition
         | "NOT"i "(" filter ")"

BOOLEAN_OPERATOR : "AND"i
                 | "OR"i

CONDITION_OPERATOR : "=="i
                   | "!="i
                   | "<="i
                   | ">="i
                   | ">"i
                   | "<"i
                   | "IN"i
                   | "ILIKE"i
                   | "LIKE"i

condition : operand CONDITION_OPERATOR operand

?operand : literal
         | field_name

field_name.1 : FIELD_NAME
             | quoted_field_name

quoted_field_name.1 : QUOTED_FIELD_NAME

?literal.2 : KEYWORD_LITERAL -> keyword_literal
         | ESCAPED_STRING   -> string
         | SIGNED_NUMBER    -> number
         | DATE             -> date
         | TIME             -> time
         | DATETIME         -> datetime
         | list

DATE : INT "-" INT "-" INT
TIME : INT ":" INT (":" INT ("." INT)?)?
DATETIME : DATE "T" TIME

KEYWORD_LITERAL : "NULL"i
                | "TRUE"i
                | "FALSE"i

FIELD_NAME : ("_"|LETTER) ("_"|LETTER|DIGIT)*
QUOTED_FIELD_NAME : "{" /[^}]/* "}"

list : "[" [literal ("," literal)*] "]"

%import common.INT
%import common.ESCAPED_STRING
%import common.SIGNED_NUMBER
%import common.LETTER
%import common.DIGIT

%import common.WS
%ignore WS
'''

37 times longer...

erezsh commented 5 years ago

That is very strange! The new version is supposed to be much faster...

If you could get your grammar working with LALR, speed isn't going to be a concern.

If you can get it into a smaller grammar (a minimal example) that is still much slower with the new version, I could try and look into it.

servoz commented 5 years ago

First tests with the LALR parser and the our current grammar seems to give good results. So, for now, we will stay with LALR ! Thanks again for your help !