Closed DRMacIver closed 5 years ago
You may notice the python parser example always appends a newline:
https://github.com/lark-parser/lark/blob/master/examples/python_parser.py#L57
That is the proper usage. It has no downsides, as far as I'm aware, except for slight inelegance.
You may notice the python parser example always appends a newline:
Ah, so it does, my apologies!
That is the proper usage. It has no downsides, as far as I'm aware, except for slight inelegance.
Well the downside that caused me to run into this is that I'm trying to get Hypothesis
to understand more of Lark's grammar functionality automatically, and having rules that aren't encoded in the grammar or parser means that it can't do that. I understand that might not be a priority for you though!
I'm trying to get
Hypothesis
to understand more of Lark's grammar functionality automatically
Interesting. What's your approach?
For example, the indenter is not part of the grammar either. How do you plan to handle that?
Interesting. What's your approach?
We've already got the basics in place as of a few versions ago, it just currently doesn't handle any custom lexing at all.
I've currently got a refactor which is much more explicit about state management which I hope to land in the next day or two, at which point it's just a case of maintaining a parallel version of whatever state the parser needs and running the logic in reverse when it comes time to turn terminals into strings.
For example, the indenter is not part of the grammar either. How do you plan to handle that?
Well as long as it's part of the parser we have access to it - we take the whole parser object, not just a grammar definition.
The current plan is to handle it just by noting that the postlex is an Indent
class and handling the indentation levels correctly ourselves when we get _INDENT
and _DEDENT
terminals emitted. It would be nice if we could handle arbitrary postlexers, but I don't currently think there's a way to do that beyond providing an extensible mechanism to hook in similar logic for other postlexer classes.
Exact details are still TBD - I ran into enough problems getting this to work (nothing intrinsic or major I think - just a mix of my not yet understanding Lark as well as I'd like and getting two complex projects with different assumptions about the world to see eye to eye) that I ended up having to deprioritise it for the next few weeks.
BTW in case you're interested once I fixed the test I had to always add a newline, '#\ud800\n'
is an example of code parsed by the python 3 Lark example parser but rejected by ast.parse
(with a UnicodeEncodeError
weirdly, which I'd almost think was a bug in ast.parse
although it's arguable) - the default source encoding without an encoding comment is utf-8, and this is a bare surrogate which can't be represented as such.
This is a bit of a weird edge case. I'm happy to file a separate issue about it if you care, but I'm honestly not sure if it even counts as a bug and mostly mention it for your possible amusement.
not yet understanding Lark as well as I'd like
If I can make anything about the code clearer, feel free to ask me. I don't mind explaining it. Also, regarding the information which you intend to access, perhaps we can agree on an interface for it, to spare you dealing with the internals. That way I won't accidentally break it as the versions progress.
but I'm honestly not sure if it even counts as a bug
I'm not sure it is either. After all, I wrote the Python grammar for parsing, not validation. But I'm open to be convinced otherwise.
If I can make anything about the code clearer, feel free to ask me. I don't mind explaining it.
Thanks! So far it's more a matter of time rather than anything confusing me (this was my first time looking into it - @Zac-HD wrote the original lark integration, I just picked it up because I had a hand-rolled Python 3 generator that I was hoping I'd be able to replace. Alas, not yet), but I'm sure I'll have questions later.
Also, regarding the information which you intend to access, perhaps we can agree on an interface for it, to spare you dealing with the internals. That way I won't accidentally break it as the versions progress.
Sounds good to me. Although don't worry about breaking it too much - we're used to needing details of libraries that are a bit internal and we're pretty good at staying on top of changes thanks to our release automation. It might be worth deferring this discussion until our usage is a bit more mature and/or we first run into a problem?
FWIW the main things we're using right now are the ability to turn terminals into regular expressions (so we can lean on the existing regular expression support in Hypothesis) and map non-terminals to the list of possible expansions, plus we need to look up the starting symbol for the grammar and have access to the Terminal
and NonTerminal
classes and their name property.
I'm not sure it is either. After all, I wrote the Python grammar for parsing, not validation. But I'm open to be convinced otherwise.
From our point of view grammars are a lot more useful if everything they describe is a valid instance of the desired format, but it doesn't matter that much - if we end up shipping anything based on the Python 3 grammar we can always tweak it for our needs.
Okay, these are unlikely to change, I suppose. We can always revisit this subject if it becomes an issue.
if we end up shipping anything based on the Python 3 grammar we can always tweak it for our needs
Okay. I'll be happy to link to it, or even keep a copy in Lark when it's complete.
Anyway, let me know if there's anything I can do to aid your progress.
If you add the following line to main in
python_parser.py
:Then you get the following error:
(Tested on d1fea12aa59dab6f152cf4c48c3bbf90d1c85ac1 on Python 3.6.6).
Presumably the problem is that it's expecting the line to be terminated by a new line rather than an EOF. I'm pretty sure not doing so is valid Python, but either way the
IndexError
seems suboptimal.