Make lark.lark parse the same grammar as load_grammar.py, and make grammar.md document it more fully.

RossPatterson commented 5 months ago

Make lark.lark parse the same grammar as load_grammar.py, and make grammar.md document it more fully.

Fixes #391, #1382, #1384.

All scenarios have tests in new file tests/test_lark_lark.py. Each test passed load_grammar.py and failed lark.lark before these changes, and passed both after. All existing un-skipped unit tests continue to pass after these changes. The skipped Earley tests have not been checked.
Updates to the formal Lark grammar (lark/grammars/lark.lark) to make it match the actual parser (lark/load_grammar.py)
1. Token aliases (e.g., NAME_1 -> NAME_2) are no longer allowed.
  - load_grammar.py appears to accept them, but forces the new name to be a rule-name.
  - tests/test_grammar.py.test_alias_in_terminal says they should raise a GrammarError.
  - Making this change necessitated splitting the shared expansions rule, and all the rules it calls, into rule_expansions and token_expansions (and similarly named rules for them to call).
2. Rule aliases (name_1 -> name_2) are now only valid for alternates at the top-level of a rule.
  - load_grammar.py accepts them at all levels, but as Issue #1355 points out, they are not supported below the top level.
3. The item in a %ignore directive must now be a single token.
  - load_grammar.py accepts multiple tokens, but only processes the first.
4. The %extend directive is now defined for rules.
5. The %extend directive is now defined for tokens.
6. Token templates are no longer allowed (at least until Issue #555 is resolved).
7. Tokens can no longer include rules.
8. The %override directive now supports overriding token definitions.
9. Rule modifiers are no longer allowed in rule references (e.g., rule1: !?rule2)
10. Rule modifiers are now allowed to be ?!, in addition to the existing !, ?, and !?.
Several tests in tests/tests/test_grammar.py have been copied to new file tests/test_grammar_formal.py and updated to parse using lark.lark instead of load_grammar.py. Only the tests that verify how the Lark grammar is parsed could be implemented. Those that verify how a Lark-generated parser operates could not, because there's no (easy?) way to create a parser from the parse tree.
Oddities not changed:
- Literal ranges are defined in both lark.lark and load_grammar.py as STRINGs, but must actually be single characters.
- load_grammar.py initially accepts multi-character strings, but later rejects them.
- Attempting to replace STRING ".." STRING in lark.lark with /./ ".." /./ failed, as Lark accepts Unicode escapes as strings (e.g. test_unicode_literal_range_escape2 uses "\U0000FFFF".."\U00010002").
- Attempting to replace STRING ".." STRING in lark.lark with /.|"\\U[0-9a-fA-F]+"/ ".." /.|"\\U[0-9a-fA-F]+"/ failed, causing lots of unrelated tests to fail.
- load_grammar.py does not allow STRING tokens to be empty (i.e., ""). There is no (easy?) way to prevent that in lark.lark. The obvious answer, adding /(?<!"")/, broke tests/test_parser.py._TestParser.test_backslash2.

MegaIng commented 5 months ago

My general opinion is that lark.lark should reflect the syntax as defined by the BNF inside of load_grammar.py and not the semantic checks implemented on top of that (and therefore most of this PR shouldn't be done in it's current form). But if @erezsh is of a different mind, I am not going to argue too much. I also wouldn't be opposed to making the syntax specfified in load_grammar.py more restrictive and therefore removing semantic checks in other parts of the code. My "dream" would be that the BNF hard coded in load_grammar.py could be generated from lark.lark and we only have one source of truth. Further diverging the two grammars is IMO not a good idea.

RossPatterson commented 5 months ago

My general opinion is that lark.lark should reflect the syntax as defined by the BNF inside of load_grammar.py and not the semantic checks implemented on top of that (and therefore most of this PR shouldn't be done in it's current form).

I see your point, but I think for anyone who isn't writing code in the Lark codebase, lark.lark is a much more accessible definition of the Lark grammar than the implementation in load_grammar.py. Some of the things I noted (e.g., literal_range being STRINGs) sounds more to me like syntax than semantics, but that may just be bike-shedding.

My "dream" would be that the BNF hard coded in load_grammar.py could be generated from lark.lark and we only have one source of truth.

Mine as well. I was actually quite surprised at how different they were. It was your comments in Issue #1355 that pointed me down this path.

Further diverging the two grammars is IMO not a good idea.

I guess I see it more as specifying the same grammar in two forms, not two grammars. But you're right about semantics - there are some things (e.g., my comments about literal_range) that Lark can't easily express today, and that require post-processing.

RossPatterson commented 5 months ago

I missed two lines that I should have deleted in the test files, which I used for before/after testing (... 'grammars/lark.lark-ORIG' ...). I'll remove them.

RossPatterson commented 5 months ago

I missed two lines that I should have deleted in the test files, which I used for before/after testing (... 'grammars/lark.lark-ORIG' ...). I'll remove them.

Fixed in 9493f81e9eab6dbe42096938bbcfc1f0cf5a2135.

erezsh commented 5 months ago

@MegaIng @RossPatterson I can see both points of view, and I think the determining factor should be practicality above all else.

In an ideal world, load_grammar would only preload a minimal parser, that would parse lark.lark and use it to create the Lark parser that supports the full syntax. That would be the cleanest approach imho. But it would also be slower, probably in a very noticeable way. (did we ever do this experiment?)

As for the BNF in load_grammar, it is non-restrictive on purpose. By moving the validation from the parser to the post-parsing code, we can perform more sophisticated validation, and provide custom error messages that are more helpful and friendly.

As for what to do with lark.lark, in my mind it's still an open question, and depends heavily on what are its main uses. Is it to parse and manipulate lark grammars programatically? Is it to validate correct grammars? Is it to be used in a lark editor? etc.

I think the options that we are facing are:

1) Leave lark.lark as a non-validating grammar, that is still capable of parsing every correct lark grammar

2) Same as 1, but pair it with an "official" visitor that performs extra validation with better error messages

3) Make lark.lark as restrictive as possible, to match load_grammar

I'm leaning towards (2), because I'm can't really think of good use-cases for 3, except than it being a "reference grammar" for human eyes. But I don't know if that's a good enough reason, especially since that can be accomplished using better docs, or with comments within lark.lark

What do you guys think?

(needless to say, regardless of what we choose, I'm all for improving the documentation regarding the grammar)

RossPatterson commented 5 months ago

As for what to do with lark.lark, in my mind it's still an open question, and depends heavily on what are its main uses. Is it to parse and manipulate lark grammars programatically?

That's part of what drove me down this path. I've got a use case where I'm playing with @MegaIng's lark2railroad, combined with my own recent contributions to railroad-diagrams, to generate documentation for small programs that use Lark to parse their input.

Is it to validate correct grammars?

That's another part of my motivation. As noted in Issue #1355, I attempted to use aliases on rule-alternates inside a rule-group, and received a weird result (from load_grammar.py, which as you implied, has great contextual error messages). It turns out, although neither lark.lark nor the documetation says so, that's invalid syntax. Now, I can simply avoid doing that again, but I try to contribute incremental improvements to open source projects that I use, so here we are :-)

I think the options that we are facing are: ... What do you guys think?

I think having an inaccurate reference grammar for human purposes is almost worse than not having one at all. I think having a partial grammar for parsing purposes is somewhat useful, but will be surprising to anyone who tries to use it reliably. And I think better documentation might obviate the need for a perfect reference grammar, except for the cases like lark2railroad that are actually trying to parse Lark grammars from random authors.

I'm not going to be offended by however the decision turns out. If it's something I can help with, fine, if not, also fine. This isn't my project, I'm just a (so far very happy) user who has the skills to contribute a bit.

erezsh commented 5 months ago

I think having an inaccurate reference grammar for human purposes

@RossPatterson Can you explain which option this refers to? I don't think I advocated at any point for an inaccurate grammar, only a non-validating one

which as you implied, has great contextual error messages

The load_grammar error messages are definitely better than the generic lark error messages. If I overlooked or missed something, it's not proof that the approach is bad

Do you have any objections to option 2 that I suggested? Do you think it doesn't serve your efforts as well as option 3? Do you think it's more confusing?

Frankly, I don't accept or reject PR based on who will be offended. I'm just trying to make Lark as good as it can be. (with the limited time that I can spare for it)

RossPatterson commented 5 months ago

I think having an inaccurate reference grammar for human purposes

@RossPatterson Can you explain which option this refers to? I don't think I advocated at any point for an inaccurate grammar, only a non-validating one

Sorry, I misunderstood.

The load_grammar error messages are definitely better than the generic lark error messages. If I overlooked or missed something, it's not proof that the approach is bad

No, in fact, like I said, it's great. The load_grammar.py code is hard to understand, but the error messages are great.

Do you have any objections to option 2 that I suggested? Do you think it doesn't serve your efforts as well as option 3? Do you think it's more confusing?

No, I have no objections to option 2. My efforts will be served as long as lark.lark can parse legitimate Lark grammars, and as long as I know the things I shouldn't do.

Frankly, I don't accept or reject PR based on who will be offended. I'm just trying to make Lark as good as it can be. (with the limited time that I can spare for it)

I understand completely. I was trying to say that I'm OK with whatever you decide to do.

erezsh commented 5 months ago

@MegaIng Any objections to option 2? i.e. keep lark.lark non-validating and pair it with a visitor that performs the same extra validation as load_grammar?

MegaIng commented 5 months ago

Currently no, I think that is the ideal solution. Preferably the same visitor is used both internally and externally and is the only source of GrammerErrors in load_grammer to make sure that everything is in sync.

erezsh commented 5 months ago

Preferably the same visitor is used both internally and externally and is the only source of GrammerErrors

That seems a bit optimistic, but I agree it would be nice if we could get there.

@RossPatterson Would you be okay adapting your PR to solution 2?

That means getting lark.lark as close as possible to load_grammar, and adding a validating visitor for the rest (or somehow using the one in load_grammar)

If you'd like, we can split this PR to two, one for the docs, which I think we all agree can be merged already as is (or with tiny changes), and a separate one for lark.lark. But it's up to you.

MegaIng commented 5 months ago

That means getting lark.lark as close as possible to load_grammar

To note: lark.lark should probably still use ebnf and not restrict itself to bnf like the hardcoded grammar does, and IMO it's ok to change the grammar in load_grammar a bit if it makes stuff easier to sync up (if it doesn't have a performance hit and ofcourse doesn't break anything)

RossPatterson commented 5 months ago

@RossPatterson Would you be okay adapting your PR to solution 2?

That means getting lark.lark as close as possible to load_grammar, and adding a validating visitor for the rest (or somehow using the one in load_grammar)

Yup, I can do that. I've already had to do the analysis to find the differences, and to fix them. So it looks like I'll just be backing out some of the fixes, and creating their equivalent in a visitor.

If you'd like, we can split this PR to two, one for the docs, which I think we all agree can be merged already as is (or with tiny changes), and a separate one for lark.lark. But it's up to you.

I'm fine with keeping them together.

RossPatterson commented 5 months ago

To note: lark.lark should probably still use ebnf and not restrict itself to bnf like the hardcoded grammar does,

I'm personally biased in favor of BNF, because it's so well defined, whereas "EBNF" means any of a variety of possibilities. But I agree, lark.lark should continue to be in the Lark variant of EBNF.

IMO it's ok to change the grammar in load_grammar a bit if it makes stuff easier to sync up (if it doesn't have a performance hit and ofcourse doesn't break anything)

I don't anticipate that being necessary. But I've been wrong before :-)

RossPatterson commented 5 months ago

I sat down tonight to start implementing option 2, and quickly found myself categorizing most of my changes (at the top of this PR) as "_making lark.lark accept the same syntax as load_grammar_", and therefore I was planning not to revert them. In fact, when I was done, the only changes I felt weren't either adding omitted directives or removing illegal syntax were three of the last four:

7 Tokens can no longer include rules.
9 Rule modifiers are no longer allowed in rule references (e.g., rule1: !?rule2)
10 Rule modifiers are now allowed to be ?!, in addition to the existing !, ?, and !?.

I somehow doubt this is what either of you expect.

I think the changes that correct the %extend and %override syntax (4, 5, and 8) should obviously be retained. The changes that correct the token rule syntax (1 and 6) seem appropriate to keep as well. The rule aliases change (2), %ignore syntax (3) and tokens-can't-include-rules (7) are on the fence, and my personal feeling is that 2 and 3 should be kept and 7 should be reverted and moved to the new visitor. I'm conflicted on rule references (9) - it seems like it's a syntax issue, but it's only implied by the RULE terminal definition, it isn't explicitly stated in any rule. So I'm inclined to revert it an move it to the visitor. I think the last one, the query-bang rule modifier (10), is a real nit, and ought to be in the visitor instead.

So, before I go off and start messing things up, can you tell me if I'm reading things right or wrong, and if wrong, how so?

MegaIng commented 5 months ago

@RossPatterson I think you misunderstood what we (at least I) meant with option 2. At the top of load_grammar.py, there is a BNF + terminals description of the exact grammar (the syntax) that is parsed by the file. The idea is that lark.lark should exactly match that, ignoring every other post processing step (the semantics) done in load_grammar.py. Then we also provide an extra visitor somewhere that does all the other validation (i.e. the semantic checks). If you think it's reasonable to move stuff from semantics to syntax, then you can do that, but please that also do the move in the BNF grammar defined in load_grammar.py.

RossPatterson commented 5 months ago

I reverted all my changes to lark.lark, and did what @MegaIng suggested - made it do exactly what the BNF in load_grammar.py does.

In rule, split out the modifiers to rule_modifiers.
In token, removed token_params.
In statement, added %extend.
In statement, added %override token.
In alias, the RULE usage now does not accept ! and ? modifiers (see RULE_MODIFIERS below).
In value > template_usage, changed name to RULE, to prevent accepting TOKENs.
In RULE, split out the rule modifiers to RULE_MODIFIERS.

I built the validating visitor, lark_validator_vistor.py::LarkValidatorVisitor. It is invoked by a class-level validate(tree) method. It checks:

That tokens don't have aliases.
That %ignore has only one token specified. Note that load_grammar doesn't do this, you get unexpected results from %ignore A B.
That rules don't have aliass on "inner expansions". Note that load_grammar doesn't do this, you get unexpected results instead.

There wasn't any existing doc on using the lark.lark grammar, so I put some instructions for the validator at the top of the grammar.

I also moved the doc for templates in grammar.md from the "Terminals" section to the "Rules" section, since you can't use templates with terminals, and left behind the "Templates are not allowed with terminals." part

Some notes:

lark.lark inlines several rules that load_grammar does not. I haven't changed this, because it will change trees produced by lark.lark, but I think I should.
- expansions
- expansion
- value
load_grammar inlines one rule that lark.lark does not. I haven't changed this, because it will change trees produced by lark.lark, but I think I should.
- name
load_grammar defines OP as [+*]|[?](?![a-z_]), while lark.lark omits the underscore in the not-followed-by set. I assume load_grammar does this to help in recognizing the difference between atom? and ?rulename:, but it's a little too arcane for me to be sure.
load_grammar defines RULE_MODIFIERS as (!|![?]?|[?]!?)(?=[_a-z]). Why does it use both ! and ![?]?? Why not just ![?]??
lark.lark imports ESCAPED_STRING from common.lark as _STRING, while load_grammar, of course, defines it itself. They're expressed differently, but I believe they accept the same set of values. I have left them both as they were originally.
load_grammar defines STRING as "(\\"|\\\\|[^"\n])*?"i?. Why use *?? Isn't that the same as *?
lark.lark imports SIGNED_INT from common.lark as NUMBER, and WS_INLINE from common.lark, while load_grammar, of course, defines them itself. They're expressed differently, but they accept the same set of values. I have left them both as they were originally.
lark.lark and load_grammar define COMMENT slightly differently, but the resulting regexp is identical. I have left them both as they were originally.

MegaIng commented 5 months ago

Very nice, this is about what I imagined :-) Thank you for going through this effort.

I haven't changed this, because it will change trees produced by lark.lark, but I think I should.

I think that's fine. I don't think there are too many people out there relying on the details of that grammar, and I think we changed it before without warning.

lark.lark inlines several rules that load_grammar does not.

alias and expr are also never inlined by lark.lark despite having a ? because the [] makes it so that we always have a None as a second child. I think this should both be changed to use ()? instead and be inlined. IMO that is more inline with that would be expected from the parsing, matches load_grammar better and is probably? easier to work with. But it would require changes in the validator you wrote.

It might make sense to write an Interpreter instead of a Visitor which allows slighter better control about what happens when.

There wasn't any existing doc on using the lark.lark grammar, so I put some instructions for the validator at the top of the grammar.

We should have a doc page with a reference list for our "stdlib" grammars, but that can be a different PR. An example using the validator might be nice.

RossPatterson commented 4 months ago

I haven't changed this, because it will change trees produced by lark.lark, but I think I should.

I think that's fine. I don't think there are too many people out there relying on the details of that grammar, and I think we changed it before without warning.

OK, I removed inlining from expansions, expansion, and value in lark.lark.

Fixed by 2ec5ef3dd5a0eeeea479fc7f4c8e76e2fb22e0ff.

RossPatterson commented 4 months ago

It might make sense to write an Interpreter instead of a Visitor which allows slighter better control about what happens when.

The Visitor model seems to fit well for the validator. By its nature, it finds all the tokens, aliases, and ignores, which I'd have to write code to hunt down if I switched to Interpreter. Is there a subtlty I'm missing here?

RossPatterson commented 4 months ago

There wasn't any existing doc on using the lark.lark grammar, so I put some instructions for the validator at the top of the grammar.

We should have a doc page with a reference list for our "stdlib" grammars, but that can be a different PR. An example using the validator might be nice.

If someone wants to write that page, I'll certainly write of an example for the validator.

erezsh commented 4 months ago

@RossPatterson Nice work. Let me know once you've cleaned it up a bit, and I'll give it another review.

I'll see if I find the time to write that doc page. I imagine it should be pretty short and straight-forward.

RossPatterson commented 4 months ago

alias and expr are also never inlined by lark.lark despite having a ? because the [] makes it so that we always have a None as a second child. I think this should both be changed to use ()? instead and be inlined. IMO that is more inline with that would be expected from the parsing, matches load_grammar better and is probably? easier to work with. But it would require changes in the validator you wrote.

Fixing the validator was pretty easy. Fixed in e9c026eecdc69a0c030a10184bfd61f1d9e7e04c.

RossPatterson commented 4 months ago

With regard to tests, also see that we are already testing lark.lark:

https://github.com/lark-parser/lark/blob/7646fb31609c6c83e0998b121eeffb908f5ec5ba/tests/test_parser.py#L1047

It might make sense to extend the tests there to include LarkValidatorVisitor and make sure that both also raise the same error messages although it might be hard to check that we don't expect grammar analysis errors. I don't think earley should be raising any? But I might be wrong with that.

I'm working on blending my test_lark_lark.py and test_grammar_formal.py into test_parser.py and test_grammar.py. I'm using the technique used in test_parser.py for running a set of tests against multiple parsers.

RossPatterson commented 4 months ago

I'm working on blending my test_lark_lark.py and test_grammar_formal.py into test_parser.py and test_grammar.py. I'm using the technique used in test_parser.py for running a set of tests against multiple parsers.

While doing that, I found that neither the original lark.lark nor my version pass test_grammar.py testcase test_line_breaks. Apparently load_grammar.py accepts escaping of newlines (i.e., backslash-newline) to continue a statement. If I'm following things correctly, that's because load_grammar includes BACKSLASH in its ignore-set, while lark.lark does not.

Fixed in 7f02bd130b40b46fcd2709986e9e2a855c82cc9d.

RossPatterson commented 4 months ago

I'll be offline for the next few weeks, but I'll wrap this up after I return in mid-March.

RossPatterson commented 3 months ago

While testing and verifying my changes, I realized there are a few more differences between load_grammar.py and lark.lark, which result in differently-shaped parse trees. Specifically:

The load_grammar.py version of rule pushes the optionality of rule_modifiers and prioritydown into rule_modifiers and priority, while lark.lark does not. That means the load_grammar.py tree always contains nodes for rule_modifiers and priority node, while the lark.lark tree may not. I'm going to change lark.lark to do what load_grammar.py does, because the validator is affected by this difference. FYI, term does not do the same thing with priority as rule does. Both load_grammar.py and lark.lark agree that the priority is optional in term.
load_grammar.py calls token names TERMINAL, and calls token definitions term, while lark.lark uses TOKEN and token. I'm not going to change this, because it seems like a big change.
The load_grammar.py version of value splits the recognition of RULE and TERMINAL into nonterminal and terminal, while lark.lark uses the combined name (which exists in both grammars). I'm not going to change this, because leaving 2 above intact means this will be different anyway.
In load_grammar.py, name is inlined, while in lark.lark, it is not. I don't know what to do about this, if anything.

RossPatterson commented 3 months ago

The load_grammar.py version of rule pushes the optionality of rule_modifiers and prioritydown into rule_modifiers and priority, while lark.lark does not.

FIxed by 4f7a5ebacaf3afe679ba552aba76a7a6a722b68a.

RossPatterson commented 3 months ago

@erezsh @MegaIng Can one of you take a look at test_grammar.TestGrammar.test_undefined_term? It expects a GrammarError, but the error that is actually raised seems wrong. It says Rule 'A' used but not defined (in rule start)", but the entire grammar is just start: A. Shouldn't it raise "Terminal 'A' ..." instead?

MegaIng commented 3 months ago

@erezsh @MegaIng Can one of you take a look at test_grammar.TestGrammar.test_undefined_term? It expects a GrammarError, but the error that is actually raised seems wrong. It says Rule 'A' used but not defined (in rule start)", but the entire grammar is just start: A. Shouldn't it raise "Terminal 'A' ..." instead?

That's a bug that was introducded in #1018 that we both missed, specfically see here:

https://github.com/lark-parser/lark/blob/706190849ee4529cfc852bc1adb86f1aab11c560/lark/load_grammar.py#L1367-L1369

That shouldn't be d.is_term, that is referring to whether we are in the definition of a symbol/rule, but _grammar_error uses it to refer to the name itself.

RossPatterson commented 3 months ago

That's a bug that was introducded in #1018 that we both missed, specfically see here: ... That shouldn't be d.is_term, that is referring to whether we are in the definition of a symbol/rule, but _grammar_error uses it to refer to the name itself.

Fixed in 40576d2383f57b58a25d303b599bfcc54cbe23dc.

RossPatterson commented 3 months ago

As of daac65d84b1d61cf4ff3e5eb00ce99f1b5a5ae02, everything I intended to do is done. @erezsh @MegaIng, I think this meets your expectations, and is ready for code review.

RossPatterson commented 2 months ago

Is this work still of interest? I think it's ready for final review, and probably for merging.

lark-parser / lark

Make lark.lark parse the same grammar as load_grammar.py, and make grammar.md document it more fully. #1388