lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.86k stars 413 forks source link

IndexError in indenter for Python 3 Parser #321

Closed DRMacIver closed 5 years ago

DRMacIver commented 5 years ago

If you add the following line to main in python_parser.py:

python_parser3.parse("#")

Then you get the following error:

Traceback (most recent call last):
  File "python_parser.py", line 79, in <module>
    python_parser3.parse("#")
  File "/home/david/scratch/lark/lark/lark.py", line 253, in parse
    return self.parser.parse(text)
  File "/home/david/scratch/lark/lark/parser_frontends.py", line 37, in parse
    return self.parser.parse(token_stream, *[sps] if sps is not NotImplemented else [])
  File "/home/david/scratch/lark/lark/parsers/lalr_parser.py", line 68, in parse
    for token in stream:
  File "/home/david/scratch/lark/lark/indenter.py", line 34, in _process
    for t in self.handle_NL(token):
  File "/home/david/scratch/lark/lark/indenter.py", line 18, in handle_NL
    indent_str = token.rsplit('\n', 1)[1] # Tabs and spaces
IndexError: list index out of range

(Tested on d1fea12aa59dab6f152cf4c48c3bbf90d1c85ac1 on Python 3.6.6).

Presumably the problem is that it's expecting the line to be terminated by a new line rather than an EOF. I'm pretty sure not doing so is valid Python, but either way the IndexError seems suboptimal.

erezsh commented 5 years ago

You may notice the python parser example always appends a newline:

https://github.com/lark-parser/lark/blob/master/examples/python_parser.py#L57

That is the proper usage. It has no downsides, as far as I'm aware, except for slight inelegance.

DRMacIver commented 5 years ago

You may notice the python parser example always appends a newline:

Ah, so it does, my apologies!

That is the proper usage. It has no downsides, as far as I'm aware, except for slight inelegance.

Well the downside that caused me to run into this is that I'm trying to get Hypothesis to understand more of Lark's grammar functionality automatically, and having rules that aren't encoded in the grammar or parser means that it can't do that. I understand that might not be a priority for you though!

erezsh commented 5 years ago

I'm trying to get Hypothesis to understand more of Lark's grammar functionality automatically

Interesting. What's your approach?

For example, the indenter is not part of the grammar either. How do you plan to handle that?

DRMacIver commented 5 years ago

Interesting. What's your approach?

We've already got the basics in place as of a few versions ago, it just currently doesn't handle any custom lexing at all.

I've currently got a refactor which is much more explicit about state management which I hope to land in the next day or two, at which point it's just a case of maintaining a parallel version of whatever state the parser needs and running the logic in reverse when it comes time to turn terminals into strings.

For example, the indenter is not part of the grammar either. How do you plan to handle that?

Well as long as it's part of the parser we have access to it - we take the whole parser object, not just a grammar definition.

The current plan is to handle it just by noting that the postlex is an Indent class and handling the indentation levels correctly ourselves when we get _INDENT and _DEDENT terminals emitted. It would be nice if we could handle arbitrary postlexers, but I don't currently think there's a way to do that beyond providing an extensible mechanism to hook in similar logic for other postlexer classes.

Exact details are still TBD - I ran into enough problems getting this to work (nothing intrinsic or major I think - just a mix of my not yet understanding Lark as well as I'd like and getting two complex projects with different assumptions about the world to see eye to eye) that I ended up having to deprioritise it for the next few weeks.

DRMacIver commented 5 years ago

BTW in case you're interested once I fixed the test I had to always add a newline, '#\ud800\n' is an example of code parsed by the python 3 Lark example parser but rejected by ast.parse (with a UnicodeEncodeError weirdly, which I'd almost think was a bug in ast.parse although it's arguable) - the default source encoding without an encoding comment is utf-8, and this is a bare surrogate which can't be represented as such.

This is a bit of a weird edge case. I'm happy to file a separate issue about it if you care, but I'm honestly not sure if it even counts as a bug and mostly mention it for your possible amusement.

erezsh commented 5 years ago

not yet understanding Lark as well as I'd like

If I can make anything about the code clearer, feel free to ask me. I don't mind explaining it. Also, regarding the information which you intend to access, perhaps we can agree on an interface for it, to spare you dealing with the internals. That way I won't accidentally break it as the versions progress.

but I'm honestly not sure if it even counts as a bug

I'm not sure it is either. After all, I wrote the Python grammar for parsing, not validation. But I'm open to be convinced otherwise.

DRMacIver commented 5 years ago

If I can make anything about the code clearer, feel free to ask me. I don't mind explaining it.

Thanks! So far it's more a matter of time rather than anything confusing me (this was my first time looking into it - @Zac-HD wrote the original lark integration, I just picked it up because I had a hand-rolled Python 3 generator that I was hoping I'd be able to replace. Alas, not yet), but I'm sure I'll have questions later.

Also, regarding the information which you intend to access, perhaps we can agree on an interface for it, to spare you dealing with the internals. That way I won't accidentally break it as the versions progress.

Sounds good to me. Although don't worry about breaking it too much - we're used to needing details of libraries that are a bit internal and we're pretty good at staying on top of changes thanks to our release automation. It might be worth deferring this discussion until our usage is a bit more mature and/or we first run into a problem?

FWIW the main things we're using right now are the ability to turn terminals into regular expressions (so we can lean on the existing regular expression support in Hypothesis) and map non-terminals to the list of possible expansions, plus we need to look up the starting symbol for the grammar and have access to the Terminal and NonTerminal classes and their name property.

I'm not sure it is either. After all, I wrote the Python grammar for parsing, not validation. But I'm open to be convinced otherwise.

From our point of view grammars are a lot more useful if everything they describe is a valid instance of the desired format, but it doesn't matter that much - if we end up shipping anything based on the Python 3 grammar we can always tweak it for our needs.

erezsh commented 5 years ago

Okay, these are unlikely to change, I suppose. We can always revisit this subject if it becomes an issue.

if we end up shipping anything based on the Python 3 grammar we can always tweak it for our needs

Okay. I'll be happy to link to it, or even keep a copy in Lark when it's complete.

Anyway, let me know if there's anything I can do to aid your progress.