Open RossPatterson opened 5 months ago
My general opinion is that lark.lark
should reflect the syntax as defined by the BNF inside of load_grammar.py
and not the semantic checks implemented on top of that (and therefore most of this PR shouldn't be done in it's current form). But if @erezsh is of a different mind, I am not going to argue too much. I also wouldn't be opposed to making the syntax specfified in load_grammar.py
more restrictive and therefore removing semantic checks in other parts of the code. My "dream" would be that the BNF hard coded in load_grammar.py
could be generated from lark.lark
and we only have one source of truth. Further diverging the two grammars is IMO not a good idea.
My general opinion is that lark.lark should reflect the syntax as defined by the BNF inside of load_grammar.py and not the semantic checks implemented on top of that (and therefore most of this PR shouldn't be done in it's current form).
I see your point, but I think for anyone who isn't writing code in the Lark codebase, lark.lark
is a much more accessible definition of the Lark grammar than the implementation in load_grammar.py
. Some of the things I noted (e.g., literal_range
being STRING
s) sounds more to me like syntax than semantics, but that may just be bike-shedding.
My "dream" would be that the BNF hard coded in
load_grammar.py
could be generated fromlark.lark
and we only have one source of truth.
Mine as well. I was actually quite surprised at how different they were. It was your comments in Issue #1355 that pointed me down this path.
Further diverging the two grammars is IMO not a good idea.
I guess I see it more as specifying the same grammar in two forms, not two grammars. But you're right about semantics - there are some things (e.g., my comments about literal_range
) that Lark can't easily express today, and that require post-processing.
I missed two lines that I should have deleted in the test files, which I used for before/after testing (... 'grammars/lark.lark-ORIG' ...
). I'll remove them.
I missed two lines that I should have deleted in the test files, which I used for before/after testing (
... 'grammars/lark.lark-ORIG' ...
). I'll remove them.
Fixed in 9493f81e9eab6dbe42096938bbcfc1f0cf5a2135.
@MegaIng @RossPatterson I can see both points of view, and I think the determining factor should be practicality above all else.
In an ideal world, load_grammar would only preload a minimal parser, that would parse lark.lark and use it to create the Lark parser that supports the full syntax. That would be the cleanest approach imho. But it would also be slower, probably in a very noticeable way. (did we ever do this experiment?)
As for the BNF in load_grammar, it is non-restrictive on purpose. By moving the validation from the parser to the post-parsing code, we can perform more sophisticated validation, and provide custom error messages that are more helpful and friendly.
As for what to do with lark.lark, in my mind it's still an open question, and depends heavily on what are its main uses. Is it to parse and manipulate lark grammars programatically? Is it to validate correct grammars? Is it to be used in a lark editor? etc.
I think the options that we are facing are:
1) Leave lark.lark as a non-validating grammar, that is still capable of parsing every correct lark grammar
2) Same as 1, but pair it with an "official" visitor that performs extra validation with better error messages
3) Make lark.lark as restrictive as possible, to match load_grammar
I'm leaning towards (2), because I'm can't really think of good use-cases for 3, except than it being a "reference grammar" for human eyes. But I don't know if that's a good enough reason, especially since that can be accomplished using better docs, or with comments within lark.lark
What do you guys think?
(needless to say, regardless of what we choose, I'm all for improving the documentation regarding the grammar)
As for what to do with lark.lark, in my mind it's still an open question, and depends heavily on what are its main uses. Is it to parse and manipulate lark grammars programatically?
That's part of what drove me down this path. I've got a use case where I'm playing with @MegaIng's lark2railroad
, combined with my own recent contributions to railroad-diagrams
, to generate documentation for small programs that use Lark to parse their input.
Is it to validate correct grammars?
That's another part of my motivation. As noted in Issue #1355, I attempted to use aliases on rule-alternates inside a rule-group, and received a weird result (from load_grammar.py
, which as you implied, has great contextual error messages). It turns out, although neither lark.lark
nor the documetation says so, that's invalid syntax. Now, I can simply avoid doing that again, but I try to contribute incremental improvements to open source projects that I use, so here we are :-)
I think the options that we are facing are: ... What do you guys think?
I think having an inaccurate reference grammar for human purposes is almost worse than not having one at all. I think having a partial grammar for parsing purposes is somewhat useful, but will be surprising to anyone who tries to use it reliably. And I think better documentation might obviate the need for a perfect reference grammar, except for the cases like lark2railroad
that are actually trying to parse Lark grammars from random authors.
I'm not going to be offended by however the decision turns out. If it's something I can help with, fine, if not, also fine. This isn't my project, I'm just a (so far very happy) user who has the skills to contribute a bit.
I think having an inaccurate reference grammar for human purposes
@RossPatterson Can you explain which option this refers to? I don't think I advocated at any point for an inaccurate grammar, only a non-validating one
which as you implied, has great contextual error messages
The load_grammar error messages are definitely better than the generic lark error messages. If I overlooked or missed something, it's not proof that the approach is bad
Do you have any objections to option 2 that I suggested? Do you think it doesn't serve your efforts as well as option 3? Do you think it's more confusing?
Frankly, I don't accept or reject PR based on who will be offended. I'm just trying to make Lark as good as it can be. (with the limited time that I can spare for it)
I think having an inaccurate reference grammar for human purposes
@RossPatterson Can you explain which option this refers to? I don't think I advocated at any point for an inaccurate grammar, only a non-validating one
Sorry, I misunderstood.
The load_grammar error messages are definitely better than the generic lark error messages. If I overlooked or missed something, it's not proof that the approach is bad
No, in fact, like I said, it's great. The load_grammar.py
code is hard to understand, but the error messages are great.
Do you have any objections to option 2 that I suggested? Do you think it doesn't serve your efforts as well as option 3? Do you think it's more confusing?
No, I have no objections to option 2. My efforts will be served as long as lark.lark
can parse legitimate Lark grammars, and as long as I know the things I shouldn't do.
Frankly, I don't accept or reject PR based on who will be offended. I'm just trying to make Lark as good as it can be. (with the limited time that I can spare for it)
I understand completely. I was trying to say that I'm OK with whatever you decide to do.
@MegaIng Any objections to option 2? i.e. keep lark.lark non-validating and pair it with a visitor that performs the same extra validation as load_grammar?
Currently no, I think that is the ideal solution. Preferably the same visitor is used both internally and externally and is the only source of GrammerErrors in load_grammer to make sure that everything is in sync.
Preferably the same visitor is used both internally and externally and is the only source of GrammerErrors
That seems a bit optimistic, but I agree it would be nice if we could get there.
@RossPatterson Would you be okay adapting your PR to solution 2?
That means getting lark.lark as close as possible to load_grammar, and adding a validating visitor for the rest (or somehow using the one in load_grammar)
If you'd like, we can split this PR to two, one for the docs, which I think we all agree can be merged already as is (or with tiny changes), and a separate one for lark.lark. But it's up to you.
That means getting lark.lark as close as possible to load_grammar
To note: lark.lark should probably still use ebnf and not restrict itself to bnf like the hardcoded grammar does, and IMO it's ok to change the grammar in load_grammar a bit if it makes stuff easier to sync up (if it doesn't have a performance hit and ofcourse doesn't break anything)
@RossPatterson Would you be okay adapting your PR to solution 2?
That means getting lark.lark as close as possible to load_grammar, and adding a validating visitor for the rest (or somehow using the one in load_grammar)
Yup, I can do that. I've already had to do the analysis to find the differences, and to fix them. So it looks like I'll just be backing out some of the fixes, and creating their equivalent in a visitor.
If you'd like, we can split this PR to two, one for the docs, which I think we all agree can be merged already as is (or with tiny changes), and a separate one for lark.lark. But it's up to you.
I'm fine with keeping them together.
To note: lark.lark should probably still use ebnf and not restrict itself to bnf like the hardcoded grammar does,
I'm personally biased in favor of BNF, because it's so well defined, whereas "EBNF" means any of a variety of possibilities. But I agree, lark.lark
should continue to be in the Lark variant of EBNF.
IMO it's ok to change the grammar in load_grammar a bit if it makes stuff easier to sync up (if it doesn't have a performance hit and ofcourse doesn't break anything)
I don't anticipate that being necessary. But I've been wrong before :-)
I sat down tonight to start implementing option 2, and quickly found myself categorizing most of my changes (at the top of this PR) as "_making lark.lark
accept the same syntax as load_grammar
_", and therefore I was planning not to revert them. In fact, when I was done, the only changes I felt weren't either adding omitted directives or removing illegal syntax were three of the last four:
rule1: !?rule2
)?!
, in addition to the existing !
, ?
, and !?
.I somehow doubt this is what either of you expect.
I think the changes that correct the %extend
and %override
syntax (4, 5, and 8) should obviously be retained. The changes that correct the token rule syntax (1 and 6) seem appropriate to keep as well. The rule aliases change (2), %ignore
syntax (3) and tokens-can't-include-rules (7) are on the fence, and my personal feeling is that 2 and 3 should be kept and 7 should be reverted and moved to the new visitor. I'm conflicted on rule references (9) - it seems like it's a syntax issue, but it's only implied by the RULE
terminal definition, it isn't explicitly stated in any rule. So I'm inclined to revert it an move it to the visitor. I think the last one, the query-bang rule modifier (10), is a real nit, and ought to be in the visitor instead.
So, before I go off and start messing things up, can you tell me if I'm reading things right or wrong, and if wrong, how so?
@RossPatterson I think you misunderstood what we (at least I) meant with option 2. At the top of load_grammar.py
, there is a BNF + terminals description of the exact grammar (the syntax) that is parsed by the file. The idea is that lark.lark
should exactly match that, ignoring every other post processing step (the semantics) done in load_grammar.py
. Then we also provide an extra visitor somewhere that does all the other validation (i.e. the semantic checks). If you think it's reasonable to move stuff from semantics to syntax, then you can do that, but please that also do the move in the BNF grammar defined in load_grammar.py
.
I reverted all my changes to lark.lark
, and did what @MegaIng suggested - made it do exactly what the BNF in load_grammar.py
does.
rule
, split out the modifiers to rule_modifiers
.token
, removed token_params
.statement
, added %extend
.statement
, added %override token
.alias
, the RULE
usage now does not accept !
and ?
modifiers (see RULE_MODIFIERS
below).value
> template_usage
, changed name
to RULE
, to prevent accepting TOKEN
s.RULE
, split out the rule modifiers to RULE_MODIFIERS
.I built the validating visitor, lark_validator_vistor.py::LarkValidatorVisitor
. It is invoked by a class-level validate(tree)
method. It checks:
token
s don't have aliases.%ignore
has only one token specified. Note that load_grammar
doesn't do this, you get unexpected results from %ignore A B
.rule
s don't have alias
s on "inner expansions
". Note that load_grammar
doesn't do this, you get unexpected results instead.There wasn't any existing doc on using the lark.lark
grammar, so I put some instructions for the validator at the top of the grammar.
I also moved the doc for templates in grammar.md
from the "Terminals" section to the "Rules" section, since you can't use templates with terminals, and left behind the "Templates are not allowed with terminals." part
Some notes:
lark.lark
inlines several rules that load_grammar
does not. I haven't changed this, because it will change trees produced by lark.lark
, but I think I should.
expansions
expansion
value
load_grammar
inlines one rule that lark.lark
does not. I haven't changed this, because it will change trees produced by lark.lark
, but I think I should.
name
load_grammar
defines OP
as [+*]|[?](?![a-z_])
, while lark.lark
omits the underscore in the not-followed-by set. I assume load_grammar
does this to help in recognizing the difference between atom?
and ?rulename:
, but it's a little too arcane for me to be sure.load_grammar
defines RULE_MODIFIERS
as (!|![?]?|[?]!?)(?=[_a-z])
. Why does it use both !
and ![?]?
? Why not just ![?]?
?lark.lark
imports ESCAPED_STRING
from common.lark
as _STRING
, while load_grammar
, of course, defines it itself. They're expressed differently, but I believe they accept the same set of values. I have left them both as they were originally.load_grammar
defines STRING
as "(\\"|\\\\|[^"\n])*?"i?
. Why use *?
? Isn't that the same as *
?lark.lark
imports SIGNED_INT
from common.lark
as NUMBER
, and WS_INLINE
from common.lark
, while load_grammar
, of course, defines them itself. They're expressed differently, but they accept the same set of values. I have left them both as they were originally.lark.lark
and load_grammar
define COMMENT
slightly differently, but the resulting regexp is identical. I have left them both as they were originally.Very nice, this is about what I imagined :-) Thank you for going through this effort.
I haven't changed this, because it will change trees produced by
lark.lark
, but I think I should.
I think that's fine. I don't think there are too many people out there relying on the details of that grammar, and I think we changed it before without warning.
lark.lark
inlines several rules that load_grammar does not.
alias
and expr
are also never inlined by lark.lark
despite having a ?
because the []
makes it so that we always have a None
as a second child. I think this should both be changed to use ()?
instead and be inlined. IMO that is more inline with that would be expected from the parsing, matches load_grammar
better and is probably? easier to work with. But it would require changes in the validator you wrote.
It might make sense to write an Interpreter
instead of a Visitor
which allows slighter better control about what happens when.
There wasn't any existing doc on using the lark.lark grammar, so I put some instructions for the validator at the top of the grammar.
We should have a doc page with a reference list for our "stdlib" grammars, but that can be a different PR. An example using the validator might be nice.
I haven't changed this, because it will change trees produced by
lark.lark
, but I think I should.I think that's fine. I don't think there are too many people out there relying on the details of that grammar, and I think we changed it before without warning.
OK, I removed inlining from expansions
, expansion
, and value
in lark.lark
.
Fixed by 2ec5ef3dd5a0eeeea479fc7f4c8e76e2fb22e0ff.
It might make sense to write an
Interpreter
instead of aVisitor
which allows slighter better control about what happens when.
The Visitor
model seems to fit well for the validator. By its nature, it finds all the token
s, alias
es, and ignore
s, which I'd have to write code to hunt down if I switched to Interpreter
. Is there a subtlty I'm missing here?
There wasn't any existing doc on using the lark.lark grammar, so I put some instructions for the validator at the top of the grammar.
We should have a doc page with a reference list for our "stdlib" grammars, but that can be a different PR. An example using the validator might be nice.
If someone wants to write that page, I'll certainly write of an example for the validator.
@RossPatterson Nice work. Let me know once you've cleaned it up a bit, and I'll give it another review.
I'll see if I find the time to write that doc page. I imagine it should be pretty short and straight-forward.
alias
andexpr
are also never inlined bylark.lark
despite having a?
because the[]
makes it so that we always have aNone
as a second child. I think this should both be changed to use()?
instead and be inlined. IMO that is more inline with that would be expected from the parsing, matchesload_grammar
better and is probably? easier to work with. But it would require changes in the validator you wrote.
Fixing the validator was pretty easy. Fixed in e9c026eecdc69a0c030a10184bfd61f1d9e7e04c.
With regard to tests, also see that we are already testing
lark.lark
:It might make sense to extend the tests there to include
LarkValidatorVisitor
and make sure that both also raise the same error messages although it might be hard to check that we don't expect grammar analysis errors. I don't thinkearley
should be raising any? But I might be wrong with that.
I'm working on blending my test_lark_lark.py
and test_grammar_formal.py
into test_parser.py
and test_grammar.py
. I'm using the technique used in test_parser.py
for running a set of tests against multiple parsers.
I'm working on blending my
test_lark_lark.py
andtest_grammar_formal.py
intotest_parser.py
andtest_grammar.py
. I'm using the technique used intest_parser.py
for running a set of tests against multiple parsers.
While doing that, I found that neither the original lark.lark
nor my version pass test_grammar.py
testcase test_line_breaks
. Apparently load_grammar.py
accepts escaping of newlines (i.e., backslash-newline) to continue a statement. If I'm following things correctly, that's because load_grammar
includes BACKSLASH
in its ignore-set, while lark.lark
does not.
Fixed in 7f02bd130b40b46fcd2709986e9e2a855c82cc9d.
I'll be offline for the next few weeks, but I'll wrap this up after I return in mid-March.
While testing and verifying my changes, I realized there are a few more differences between load_grammar.py
and lark.lark
, which result in differently-shaped parse trees. Specifically:
load_grammar.py
version of rule
pushes the optionality of rule_modifiers
and priority
down into rule_modifiers
and priority
, while lark.lark
does not. That means the load_grammar.py
tree always contains nodes for rule_modifiers
and priority
node, while the lark.lark
tree may not. I'm going to change lark.lark
to do what load_grammar.py
does, because the validator is affected by this difference. FYI, term
does not do the same thing with priority
as rule
does. Both load_grammar.py
and lark.lark
agree that the priority
is optional in term
.load_grammar.py
calls token names TERMINAL
, and calls token definitions term
, while lark.lark
uses TOKEN
and token
. I'm not going to change this, because it seems like a big change.load_grammar.py
version of value
splits the recognition of RULE
and TERMINAL
into nonterminal
and terminal
, while lark.lark
uses the combined name
(which exists in both grammars). I'm not going to change this, because leaving 2 above intact means this will be different anyway.load_grammar.py
, name
is inlined, while in lark.lark
, it is not. I don't know what to do about this, if anything.
- The
load_grammar.py
version ofrule
pushes the optionality ofrule_modifiers
andpriority
down intorule_modifiers
andpriority
, whilelark.lark
does not.
FIxed by 4f7a5ebacaf3afe679ba552aba76a7a6a722b68a.
@erezsh @MegaIng
Can one of you take a look at test_grammar.TestGrammar.test_undefined_term
? It expects a GrammarError
, but the error that is actually raised seems wrong. It says Rule 'A' used but not defined (in rule start)"
, but the entire grammar is just start: A
. Shouldn't it raise "Terminal 'A' ..." instead?
@erezsh @MegaIng Can one of you take a look at
test_grammar.TestGrammar.test_undefined_term
? It expects aGrammarError
, but the error that is actually raised seems wrong. It saysRule 'A' used but not defined (in rule start)"
, but the entire grammar is juststart: A
. Shouldn't it raise "Terminal 'A' ..." instead?
That's a bug that was introducded in #1018 that we both missed, specfically see here:
That shouldn't be d.is_term
, that is referring to whether we are in the definition of a symbol/rule, but _grammar_error
uses it to refer to the name itself.
That's a bug that was introducded in #1018 that we both missed, specfically see here: ... That shouldn't be
d.is_term
, that is referring to whether we are in the definition of a symbol/rule, but_grammar_error
uses it to refer to the name itself.
Fixed in 40576d2383f57b58a25d303b599bfcc54cbe23dc.
As of daac65d84b1d61cf4ff3e5eb00ce99f1b5a5ae02, everything I intended to do is done. @erezsh @MegaIng, I think this meets your expectations, and is ready for code review.
Is this work still of interest? I think it's ready for final review, and probably for merging.
Make
lark.lark
parse the same grammar asload_grammar.py
, and makegrammar.md
document it more fully.Fixes #391, #1382, #1384.
All scenarios have tests in new file
tests/test_lark_lark.py
. Each test passedload_grammar.py
and failedlark.lark
before these changes, and passed both after. All existing un-skipped unit tests continue to pass after these changes. The skipped Earley tests have not been checked.Updates to the formal Lark grammar (
lark/grammars/lark.lark
) to make it match the actual parser (lark/load_grammar.py
)NAME_1 -> NAME_2
) are no longer allowed.load_grammar.py
appears to accept them, but forces the new name to be a rule-name.tests/test_grammar.py.test_alias_in_terminal
says they should raise aGrammarError
.expansions
rule, and all the rules it calls, intorule_expansions
andtoken_expansions
(and similarly named rules for them to call).name_1 -> name_2
) are now only valid for alternates at the top-level of a rule.%ignore
directive must now be a single token.load_grammar.py
accepts multiple tokens, but only processes the first.%extend
directive is now defined for rules.%extend
directive is now defined for tokens.%override
directive now supports overriding token definitions.rule1: !?rule2
)?!
, in addition to the existing!
,?
, and!?
.Several tests in
tests/tests/test_grammar.py
have been copied to new filetests/test_grammar_formal.py
and updated to parse usinglark.lark
instead ofload_grammar.py
. Only the tests that verify how the Lark grammar is parsed could be implemented. Those that verify how a Lark-generated parser operates could not, because there's no (easy?) way to create a parser from the parse tree.Oddities not changed:
lark.lark
andload_grammar.py
asSTRING
s, but must actually be single characters.load_grammar.py
initially accepts multi-character strings, but later rejects them.STRING ".." STRING
inlark.lark
with/./ ".." /./
failed, as Lark accepts Unicode escapes as strings (e.g.test_unicode_literal_range_escape2
uses"\U0000FFFF".."\U00010002"
).STRING ".." STRING
inlark.lark
with/.|"\\U[0-9a-fA-F]+"/ ".." /.|"\\U[0-9a-fA-F]+"/
failed, causing lots of unrelated tests to fail.load_grammar.py
does not allowSTRING
tokens to be empty (i.e.,""
). There is no (easy?) way to prevent that in lark.lark. The obvious answer, adding/(?<!"")/
, broketests/test_parser.py._TestParser.test_backslash2
.