vlasovskikh / funcparserlib

Recursive descent parsing library for Python based on functional combinators
https://funcparserlib.pirx.ru
MIT License
338 stars 38 forks source link

Parse errors should say what tokens are legal at the stopped position #52

Closed Kodiologist closed 3 years ago

Kodiologist commented 6 years ago

Consider the code

from funcparserlib.parser import a
p = a('a') + (a('b') | a('c'))
p.parse("ad")

The parse failure produces a stack of exceptions that ends with "funcparserlib.parser.NoParseError: got unexpected token: d". This message shows what was unexpected, but not what was expected. A better message would be something like "got unexpected token d; expected b or c". Such messages could be built using the name attribute of parsers.

vlasovskikh commented 6 years ago

@Kodiologist funcparserlib currently traces the most deeply parsed token sequence and reports the last token that the parser managed to reach as the wrong input. The maximum token position reached and the current token position are stored in the funcparserlib.parser.State object.

I guess the error reporting mechanism could be enhanced so the State stores the alternatives available at the maximum token position.

The only concern I have is that with the current API we cannot always determine the expected terminal tokens (defined via the a combinator) since it's allowed to use a generic some combinator that just tests a token against a custom predicate function, so we don't really know what's expected.

It's also not that straightforward it seems if the alternatives are non-terminal nodes, like (a('b') | many(a('c')). What do you expect in this case? Do you expect: "expected b or c" or will you be OK with something like "expected b or (many c)" or even with "expected (a 'b') | { (a 'c') }"? This is how the name attribute is now defined for parsers.

I'm open to suggestions and especially to pull requests.

Kodiologist commented 6 years ago

I expect that in order to get pretty error messages in practice, the author of a parser will have to set names manually, so that, e.g., a some parser can be described in terms of the sort of tokens it matches.

What do you expect in this case? Do you expect: "expected b or c" or will you be OK with something like "expected b or (many c)" or even with "expected (a 'b') | { (a 'c') }"?

Any of those would make sense, except that "expected b or c" should be "expected b, c, or end of input" (since many can match zero times), and that (a FOO) is better just written FOO. In general I would prefer descriptive English rather than symbolic notation, so the user doesn't have to learn meta-syntax while they're struggling to write code that's parsed with a parser written by somebody else.

vlasovskikh commented 6 years ago

The user can define their parser names via the Parser.named(name) method to make it readable. I agree that end-user friendly names are better, not sure though how user-friendly they can be for all the combinators.

@Kodiologist Are you interested in contributing better error messages? I'm not sure if it's the first thing I want to do for funcparserlib by myself.

I'm more concerned with updating the library for the up-to-date versions of Python, cleaning stuff up, dropping Python 2.4-2.6 at least, and probably continuing working on the https://github.com/vlasovskikh/funcparserlib/tree/declarative-api branch that basically changes the API a bit in order to allow for huge performance improvements via Cython in the future. I have a test implementation of these improvements if anyone cares.

Kodiologist commented 6 years ago

Are you interested in contributing better error messages?

Maybe at some point, but Hy has more pressing issues at the moment.

erlenstar commented 4 years ago

@vlasovskikh is there a plan to merge the your branch that updates funcparserlib to more modern versions? I had an issue with the old exception syntax in parser.py while running Hy in Pythonista on an iPad (which works!). I just updated the file in place for now, but happy to submit a pull request unless your modernization work is still in progress.

vlasovskikh commented 3 years ago

I might consider implementing showing expected tokens on errors, probably based on my https://github.com/vlasovskikh/funcparserlib/tree/declarative-api branch since it makes it a lot more approachable. But I wanted to release version 1.0.0 first to freeze the public API of 0.3.6, fix a few bugs.

The declarative-api branch, better parsing performance, better error messages, Python 3 only might be the features of version 2.0.0 with some backwards incompatible changes required for the declarative-api branch. In particular, it is unlikely that some(pred) will stay in the public API: we don't know anything about its predicate to reason about next tokens, cache things, etc.

See also:

vlasovskikh commented 3 years ago

See also funcparserlib.parser.first() in declarative-api which does almost what you need for showing possible tokens.

vlasovskikh commented 3 years ago

@Kodiologist @erlenstar I've implemented better error messages that mention expected tokens and grammar rules. See example error messages in 93257833acca90fd687fcdb01fb5ca9f070a415d. I plan to release it as a part of 1.0.0, here's the changelog: https://funcparserlib.pirx.ru/changes/. I'll release several alpha versions on PyPI as I mentioned in #65.

Kodiologist commented 3 years ago

The new error message for my example, funcparserlib.parser.NoParseError: got unexpected token: 'd', expected: 'b' or 'c', is quite good, but it appears as the last of a series of three exceptions, instead of alone as I would expect. Is this intentional?

vlasovskikh commented 2 years ago

@Kodiologist Yes, it was intentional. I wanted to preserve the original exception traceback while adding the information about expected tokens. It looks like that in a simple example:

>>> from funcparserlib.parser import *
>>> p = a("x") + a("y")
>>> p.parse("xy")
('x', 'y')
>>> p.parse("xz")
Traceback (most recent call last):
  File "/Users/vlan/src/funcparserlib/funcparserlib/parser.py", line 217, in parse
    (tree, _) = self.run(tokens, State(0, 0, None))
  File "/Users/vlan/src/funcparserlib/funcparserlib/parser.py", line 301, in _add
    (v2, s3) = other.run(tokens, s2)
  File "/Users/vlan/src/funcparserlib/funcparserlib/parser.py", line 612, in _some
    raise NoParseError("got unexpected token", s2)
funcparserlib.parser.NoParseError: got unexpected token

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/vlan/src/funcparserlib/funcparserlib/parser.py", line 239, in parse
    raise NoParseError(msg, e.state)
funcparserlib.parser.NoParseError: got unexpected token: 'z', expected: 'y'

I guess another option would be to integrate expected tokens in all the places I raise the original NoParseError. Would it make it more readable from your point of view?

Kodiologist commented 2 years ago

I was reacting more to the occurrence of several nested exceptions than the fact that only one of those exceptions showed the expected tokens. At least in Hy, the goal is just to show the user (that is, the programmer of Hy code) his mistake, and a detailed traceback of the parts of Hy or funcparserlib that led to the error would be a distraction. The user is trying to debug his own program, not Hy or its dependencies. So if the status quo here is maintained, we'll probably filter out the earlier exceptions… if that's not already done by Hy's preexisting, probably needlessly complicated, exception-filtering code. I have to admit that exception filtering is an area where it can take considerable work and thought to get small benefits.

vlasovskikh commented 2 years ago

@Kodiologist I've combined nested exceptions into a single one:

>>> from funcparserlib.parser import *
>>> p = a("x") + a("y")
>>> p.parse("xy")
('x', 'y')
>>> p.parse("xz")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/Users/vlan/src/funcparserlib/funcparserlib/parser.py", line 217, in parse
    (tree, _) = self.run(tokens, State(0, 0, None))
  File "/Users/vlan/src/funcparserlib/funcparserlib/parser.py", line 302, in _add
    (v2, s3) = other.run(tokens, s2)
  File "/Users/vlan/src/funcparserlib/funcparserlib/parser.py", line 614, in _some
    raise NoParseError("got unexpected token", s2)
funcparserlib.parser.NoParseError: got unexpected token: 'z', expected: 'y'
vlasovskikh commented 2 years ago

@Kodiologist Regarding your comment about how to display errors to users, I'd recommend using str(e) for NoParseError exceptions, or you can access the message via e.msg. I'm not sure about showing the traceback to end-users of Hy, I guess I would like to see it for debugging purposes if I were an author of a Hy macro.

Kodiologist commented 2 years ago

Neat, thanks.