lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.88k stars 414 forks source link

Customizing error messages #410

Closed hniksic closed 5 years ago

hniksic commented 5 years ago

I would like to use lark to build a user-facing parser with nice error messages. (Not fancy error recovery, just readable error messages.) Consider, for example, this grammar and script:

import lark, sys

_grammar = r'''
start: toplevel_statement+
?toplevel_statement: empty | hello
empty: ENDLINE
hello: "hello" ESCAPED_STRING ENDLINE
ENDLINE: LF | ";"
%import common.ESCAPED_STRING
%import common.WS
%import common.LF
%ignore /[ \t\f\v\r]/
'''

_parser = lark.Lark(_grammar, parser='lalr')
print _parser.parse(sys.stdin.read()).pretty()

When fed the invalid string hello followed by newline, the parser emits the following error message:

lark.exceptions.UnexpectedCharacters: No terminal defined for '
' at line 1 col 6

hello
     ^

Expecting: set([u'ESCAPED_STRING', u'__IGNORE_0'])

The Python set syntax seems out of place in the error message. More importantly, the end user has no way to interpret what to make of u'__IGNORE_0'.

Likewise, when fed random text like bla, the error says:

No terminal defined for 'b' at line 1 col 1

bla
^

Expecting: set([u'ENDLINE', u'HELLO', u'__IGNORE_0'])

Is it possible to customize the errors so they don't include names of lexer terminals like ENDLINE or __IGNORE_0?

Note that I am perfectly fine with writing additional code to convert lark exceptions to more user-friendly exceptions, but currently the attributes of a lark.exceptions.UnexpectedCharacters don't seem to provide enough details to craft a user-friendly exception.

erezsh commented 5 years ago

currently the attributes of a lark.exceptions.UnexpectedCharacters don't seem to provide enough details to craft a user-friendly exception

Why not? Which information is missing?

Here are the attributes currently existing:

hniksic commented 5 years ago

For example, in case of bla, the allowed attribute contains set([u'ENDLINE', u'HELLO', u'__IGNORE_0']). The first two will be meaningful to the end user who is not familiar with the formal grammar of the DSL, and the last one appears auto-generated.

erezsh commented 5 years ago

The name is auto-generated. If you want it to have a better name, you can give it one in the grammar.

You can also create your own error message, instead of using Lark's, which is intended at developers.

hniksic commented 5 years ago

I'll do that, thanks. I guess the auto-generated name comes from the %ignore directive.

If you want to keep this issue, the idea would be for Lark to provide a user-facing error message without referring to the names from the grammar, something like "Expected one of hello, \n or ;". If this is not feasible or desirable, feel free to close it.

erezsh commented 5 years ago

It would be better for the users of the tools, but not for those who write the grammars.

But thanks for the suggestion.

hniksic commented 5 years ago

This could be resolved by the message beginning with the "user" part and then following up with additional information for the grammar developer. The latter would be easy to cut off when providing the error message to the end user.

(My use case is a large desktop application where the end user is not aware that the program is written using Lark, or even Python.)

erezsh commented 5 years ago

I don't want to encourage lazy programming.

The correct way to do it is create your own error message using the exception's attributes. There is even a get_context method to get the same formatting Lark.

hniksic commented 5 years ago

There is a related problem with the UnexpectedToken exception (and a more complex grammar). This exception as an expected attribute, but I'm not sure how to interpret it. The message looks like this:

Unexpected token Token($END, '') at line 1, column 1.
Expected one of: 
    * FILE
    * HEADER
    * __ANON_2
    * COLUMNS
    * SEPARATOR
    * ENDLINE

The expected attribute contains [u'FILE', u'HEADER', '__ANON_2', u'COLUMNS', u'SEPARATOR', 'ENDLINE'].

It seems that the grammar rules are uppercased, which mixes them with lexical terminators. Is that intentional?

More importantly, I'm not sure how to find out what __ANON_2 refers to. Is it a case of me forgetting to name a terminator?

erezsh commented 5 years ago

They are called terminals.

expected only contains terminals. Why do you think it's wrong?

Anonymous tokens are created by using terminals in rules, without defining a name for them. for example:

dict: "{" items "}"

The curly braces are anonymous. If you want to make sure they'll have a name, you have to define:

LCURLY: "{" 

etc.

hniksic commented 5 years ago

Maybe I am misunderstanding something. In the above exception expected seems to contain a mixture of rules (lower-case in the grammar, but upper-cased in expected) and terminals (upper-case in both the grammar and expected). The grammar doesn't contain an entry called FILE (all-caps), for example, but it's listed. Is that intentional?

Also, the expected list contains a mixture of byte and unicode strings - not a problem, just mentioning in case it's relevant.

erezsh commented 5 years ago

seems to contain a mixture of rules (lower-case in the grammar, but upper-cased in expected)

If you can show this to be true, it would be considered a bug.

You should probably take into consideration that Lark automatically names anonymous terminals when it makes sense. So a string of "file" will be automatically named FILE if there is no collision.

Re mixed byte & unicode, it's no big deal, but definitely not the intended behavior. If you provide a small example showing this, then I'll fix it.

hniksic commented 5 years ago

Thanks, that clears the confusion. The issue was that I have a rule called file which happens to refer to an anonymous terminal file, like this:

file: "file" file_pattern ENDLINE

As noted, the string "file" was automatically named FILE, which is why such string ended up in the error message.

I'll report the byte/unicode thing in a separate issue as soon as I reduce it to a minimal example.