lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.81k stars 409 forks source link

Cleaner EBNF grammar #155

Open brupelo opened 6 years ago

brupelo commented 6 years ago

Would it be possible to modify the current ebnf grammar so instead the current syntax where you're forced to have the first rule and colon after the rule name on the same line:

grammar = '''
    rule : rule1
        | rule2
'''

you could have indented and clean blocks like this (fornmat used on glsl specs):

grammar = '''
    rule : 
        rule1
        rule2
'''

or maybe (not very clean one):

grammar = '''
    rule : 
         | rule1
         | rule2
'''

or:

grammar = '''
    rule 
         : rule1
         | rule2
'''

or (inspired from Antlr4 ):

grammar = '''
    rule 
         : rule1
         | rule2
         ;
'''

Rationale: That way the EBNF grammar will become much more readable and not only that, you'll be able to fold long grammars easily on your favourite text editor because the grammar now has proper indentation, example here.

Guess it's a matter to tweak a bit this file, even if you don't like the idea, could you explain how you'd do so?

Thanks.

erezsh commented 6 years ago

I like these versions: 1.

        rule: 
            rule1
            rule2

2.

        rule 
             : rule1
             | rule2

The first one is cleaner and the positions of the alternatives are easy to rearrange. But I'm worried that it's not clear enough that each line is a different option.

The second one is a little bit weird, but it might work.

I'll have to give it some thought, to make sure it won't collide with other concepts.

brupelo commented 6 years ago

Yeah, I do agree, the first one is the "optimal" one from the whole set of EBNF grammars (as it doesn't contain redundant/verbose elements, it's the most pythonic one :))

Also, did you look this repo, it contains a lot of ebnf grammars ready to go... it'd be cool if they could be used out of the box with lark, or maybe converting them automatically to lark... Guess adding more additional arguments to the Lark constructor to specify which type of ebnf you're dealing with would be ugly.

I mean, I guess this is some sort of tradeoff... usually you want your functions/constructor/ui/gui to be as minimal as possible so they become clear as water for users, quoting:

The ideal numbers of arguments for a function is zero (niladic). Next comes one (monadic), followed closely by two (dyadic). Three arguments (triadic) should be avoided where possible. More than three (polyadic) requires very special justification ‐ and then shouldn't be used anyway.

Anyway, just give it some thoughts, I like to bring to the table new use-cases or improvements about usability ;)

B.

NS: When I said "converting them" automatically I meant maybe creating some sort of script (without modifying the lark core so the code doesn't become more complex without any real reason)

erezsh commented 6 years ago

it'd be cool if they could be used out of the box with lark, or maybe converting them automatically to lark

Yes, that's a good idea. I'd say converting them is the way to go. However, Not all of them can be converted; many of them require code that resides in the grammar to work correctly.

The conversion script can be added to lark.tools.

whitten commented 6 years ago

I'm concerned that the nonterminal : | terminal | nonterminal ;

could be understood as an "empty" Right Hand Side but I think we should have some way of being explicit that the right hand side matches "nothing" or no input. Since "[someterminal]" means that the someterminal is optional, perhaps if we use the syntax [] to mean an optional "empty".

brupelo commented 6 years ago

Guys, just for the sake of making my point clearer (sometimes the best way to prove something is by presenting visual samples) I want you to take a look to the below comparison between the syntax used here and the lark one:

GLSLangSpec.4.60.original_ebnf

variable_identifier : IDENTIFIER primary_expression : variable_identifier INTCONSTANT UINTCONSTANT FLOATCONSTANT BOOLCONSTANT DOUBLECONSTANT LEFT_PAREN expression RIGHT_PAREN postfix_expression : primary_expression postfix_expression LEFT_BRACKET integer_expression RIGHT_BRACKET function_call postfix_expression DOT FIELD_SELECTION postfix_expression INC_OP postfix_expression DEC_OP integer_expression : expression function_call : function_call_or_method function_call_or_method : function_call_generic function_call_generic : function_call_header_with_parameters RIGHT_PAREN function_call_header_no_parameters RIGHT_PAREN function_call_header_no_parameters : function_call_header VOID function_call_header function_call_header_with_parameters : function_call_header assignment_expression function_call_header_with_parameters COMMA assignment_expression function_call_header : function_identifier LEFT_PAREN function_identifier : type_specifier postfix_expression unary_expression : postfix_expression INC_OP unary_expression DEC_OP unary_expression unary_operator unary_expression unary_operator : PLUS DASH BANG TILDE multiplicative_expression : unary_expression multiplicative_expression STAR unary_expression multiplicative_expression SLASH unary_expression multiplicative_expression PERCENT unary_expression additive_expression : multiplicative_expression additive_expression PLUS multiplicative_expression additive_expression DASH multiplicative_expression shift_expression : additive_expression shift_expression LEFT_OP additive_expression shift_expression RIGHT_OP additive_expression relational_expression : shift_expression relational_expression LEFT_ANGLE shift_expression relational_expression RIGHT_ANGLE shift_expression relational_expression LE_OP shift_expression relational_expression GE_OP shift_expression equality_expression : relational_expression equality_expression EQ_OP relational_expression equality_expression NE_OP relational_expression and_expression : equality_expression and_expression AMPERSAND equality_expression exclusive_or_expression : and_expression exclusive_or_expression CARET and_expression inclusive_or_expression : exclusive_or_expression inclusive_or_expression VERTICAL_BAR exclusive_or_expression logical_and_expression : inclusive_or_expression logical_and_expression AND_OP inclusive_or_expression logical_xor_expression : logical_and_expression logical_xor_expression XOR_OP logical_and_expression logical_or_expression : logical_xor_expression logical_or_expression OR_OP logical_xor_expression conditional_expression : logical_or_expression logical_or_expression QUESTION expression COLON assignment_expression assignment_expression : conditional_expression unary_expression assignment_operator assignment_expression assignment_operator : EQUAL MUL_ASSIGN DIV_ASSIGN MOD_ASSIGN ADD_ASSIGN SUB_ASSIGN LEFT_ASSIGN RIGHT_ASSIGN AND_ASSIGN XOR_ASSIGN OR_ASSIGN expression : assignment_expression expression COMMA assignment_expression constant_expression : conditional_expression declaration : function_prototype SEMICOLON init_declarator_list SEMICOLON PRECISION precision_qualifier type_specifier SEMICOLON type_qualifier IDENTIFIER LEFT_BRACE struct_declaration_list RIGHT_BRACE SEMICOLON type_qualifier IDENTIFIER LEFT_BRACE struct_declaration_list RIGHT_BRACE IDENTIFIER SEMICOLON type_qualifier IDENTIFIER LEFT_BRACE struct_declaration_list RIGHT_BRACE IDENTIFIER array_specifier SEMICOLON type_qualifier SEMICOLON type_qualifier IDENTIFIER SEMICOLON type_qualifier IDENTIFIER identifier_list SEMICOLON identifier_list : COMMA IDENTIFIER identifier_list COMMA IDENTIFIER function_prototype : function_declarator RIGHT_PAREN function_declarator : function_header function_header_with_parameters function_header_with_parameters : function_header parameter_declaration function_header_with_parameters COMMA parameter_declaration function_header : fully_specified_type IDENTIFIER LEFT_PAREN parameter_declarator : type_specifier IDENTIFIER type_specifier IDENTIFIER array_specifier parameter_declaration : type_qualifier parameter_declarator parameter_declarator type_qualifier parameter_type_specifier parameter_type_specifier parameter_type_specifier : type_specifier init_declarator_list : single_declaration init_declarator_list COMMA IDENTIFIER init_declarator_list COMMA IDENTIFIER array_specifier init_declarator_list COMMA IDENTIFIER array_specifier EQUAL initializer init_declarator_list COMMA IDENTIFIER EQUAL initializer single_declaration : fully_specified_type fully_specified_type IDENTIFIER fully_specified_type IDENTIFIER array_specifier fully_specified_type IDENTIFIER array_specifier EQUAL initializer fully_specified_type IDENTIFIER EQUAL initializer fully_specified_type : type_specifier type_qualifier type_specifier invariant_qualifier : INVARIANT interpolation_qualifier : SMOOTH FLAT NOPERSPECTIVE layout_qualifier : LAYOUT LEFT_PAREN layout_qualifier_id_list RIGHT_PAREN layout_qualifier_id_list : layout_qualifier_id layout_qualifier_id_list COMMA layout_qualifier_id layout_qualifier_id : IDENTIFIER IDENTIFIER EQUAL constant_expression SHARED precise_qualifier : PRECISE type_qualifier : single_type_qualifier type_qualifier single_type_qualifier single_type_qualifier : storage_qualifier layout_qualifier precision_qualifier interpolation_qualifier invariant_qualifier precise_qualifier storage_qualifier : CONST IN OUT INOUT CENTROID PATCH SAMPLE UNIFORM BUFFER SHARED COHERENT VOLATILE RESTRICT READONLY WRITEONLY SUBROUTINE SUBROUTINE LEFT_PAREN type_name_list RIGHT_PAREN type_name_list : TYPE_NAME type_name_list COMMA TYPE_NAME type_specifier : type_specifier_nonarray type_specifier_nonarray array_specifier array_specifier : LEFT_BRACKET RIGHT_BRACKET LEFT_BRACKET constant_expression RIGHT_BRACKET array_specifier LEFT_BRACKET RIGHT_BRACKET array_specifier LEFT_BRACKET constant_expression RIGHT_BRACKET type_specifier_nonarray : VOID FLOAT DOUBLE INT UINT BOOL VEC2 VEC3 VEC4 DVEC2 DVEC3 DVEC4 BVEC2 BVEC3 BVEC4 IVEC2 IVEC3 IVEC4 UVEC2 UVEC3 UVEC4 MAT2 MAT3 MAT4 MAT2X2 MAT2X3 MAT2X4 MAT3X2 MAT3X3 MAT3X4 MAT4X2 MAT4X3 MAT4X4 DMAT2 DMAT3 DMAT4 DMAT2X2 DMAT2X3 DMAT2X4 DMAT3X2 DMAT3X3 DMAT3X4 DMAT4X2 DMAT4X3 DMAT4X4 ATOMIC_UINT SAMPLER2D SAMPLER3D SAMPLERCUBE SAMPLER2DSHADOW SAMPLERCUBESHADOW SAMPLER2DARRAY SAMPLER2DARRAYSHADOW SAMPLERCUBEARRAY SAMPLERCUBEARRAYSHADOW ISAMPLER2D ISAMPLER3D ISAMPLERCUBE ISAMPLER2DARRAY ISAMPLERCUBEARRAY USAMPLER2D USAMPLER3D USAMPLERCUBE USAMPLER2DARRAY USAMPLERCUBEARRAY SAMPLER1D SAMPLER1DSHADOW SAMPLER1DARRAY SAMPLER1DARRAYSHADOW ISAMPLER1D ISAMPLER1DARRAY USAMPLER1D USAMPLER1DARRAY SAMPLER2DRECT SAMPLER2DRECTSHADOW ISAMPLER2DRECT USAMPLER2DRECT SAMPLERBUFFER ISAMPLERBUFFER USAMPLERBUFFER SAMPLER2DMS ISAMPLER2DMS USAMPLER2DMS SAMPLER2DMSARRAY ISAMPLER2DMSARRAY USAMPLER2DMSARRAY IMAGE2D IIMAGE2D UIMAGE2D IMAGE3D IIMAGE3D UIMAGE3D IMAGECUBE IIMAGECUBE UIMAGECUBE IMAGEBUFFER IIMAGEBUFFER UIMAGEBUFFER IMAGE1D IIMAGE1D UIMAGE1D IMAGE1DARRAY IIMAGE1DARRAY UIMAGE1DARRAY IMAGE2DRECT IIMAGE2DRECT UIMAGE2DRECT IMAGE2DARRAY IIMAGE2DARRAY UIMAGE2DARRAY IMAGECUBEARRAY IIMAGECUBEARRAY UIMAGECUBEARRAY IMAGE2DMS IIMAGE2DMS UIMAGE2DMS IMAGE2DMSARRAY IIMAGE2DMSARRAY UIMAGE2DMSARRAY struct_specifier TYPE_NAME precision_qualifier : HIGH_PRECISION MEDIUM_PRECISION LOW_PRECISION struct_specifier : STRUCT IDENTIFIER LEFT_BRACE struct_declaration_list RIGHT_BRACE STRUCT LEFT_BRACE struct_declaration_list RIGHT_BRACE struct_declaration_list : struct_declaration struct_declaration_list struct_declaration struct_declaration : type_specifier struct_declarator_list SEMICOLON type_qualifier type_specifier struct_declarator_list SEMICOLON struct_declarator_list : struct_declarator struct_declarator_list COMMA struct_declarator struct_declarator : IDENTIFIER IDENTIFIER array_specifier initializer : assignment_expression LEFT_BRACE initializer_list RIGHT_BRACE LEFT_BRACE initializer_list COMMA RIGHT_BRACE initializer_list : initializer initializer_list COMMA initializer declaration_statement : declaration statement : compound_statement simple_statement simple_statement : declaration_statement expression_statement selection_statement switch_statement case_label iteration_statement jump_statement compound_statement : LEFT_BRACE RIGHT_BRACE LEFT_BRACE statement_list RIGHT_BRACE statement_no_new_scope : compound_statement_no_new_scope simple_statement compound_statement_no_new_scope : LEFT_BRACE RIGHT_BRACE LEFT_BRACE statement_list RIGHT_BRACE statement_list : statement statement_list statement expression_statement : SEMICOLON expression SEMICOLON selection_statement : IF LEFT_PAREN expression RIGHT_PAREN selection_rest_statement selection_rest_statement : statement ELSE statement statement condition : expression fully_specified_type IDENTIFIER EQUAL initializer switch_statement : SWITCH LEFT_PAREN expression RIGHT_PAREN LEFT_BRACE switch_statement_list RIGHT_BRACE switch_statement_list : /* nothing */ statement_list case_label : CASE expression COLON DEFAULT COLON iteration_statement : WHILE LEFT_PAREN condition RIGHT_PAREN statement_no_new_scope DO statement WHILE LEFT_PAREN expression RIGHT_PAREN SEMICOLON FOR LEFT_PAREN for_init_statement for_rest_statement RIGHT_PAREN statement_no_new_scope for_init_statement : expression_statement declaration_statement conditionopt : condition /* empty */ for_rest_statement : conditionopt SEMICOLON conditionopt SEMICOLON expression jump_statement : CONTINUE SEMICOLON BREAK SEMICOLON RETURN SEMICOLON RETURN expression SEMICOLON DISCARD SEMICOLON // Fragment shader only translation_unit : external_declaration translation_unit external_declaration external_declaration : function_definition declaration SEMICOLON function_definition : function_prototype compound_statement_no_new_scope

GLSLangSpec.4.60.lark_ebnf

variable_identifier : IDENTIFIER primary_expression : variable_identifier | INTCONSTANT | UINTCONSTANT | FLOATCONSTANT | BOOLCONSTANT | DOUBLECONSTANT | LEFT_PAREN expression RIGHT_PAREN postfix_expression : primary_expression | postfix_expression LEFT_BRACKET integer_expression RIGHT_BRACKET | function_call | postfix_expression DOT FIELD_SELECTION | postfix_expression INC_OP | postfix_expression DEC_OP integer_expression : expression function_call : function_call_or_method function_call_or_method : function_call_generic function_call_generic : function_call_header_with_parameters RIGHT_PAREN | function_call_header_no_parameters RIGHT_PAREN function_call_header_no_parameters : function_call_header VOID | function_call_header function_call_header_with_parameters : function_call_header assignment_expression | function_call_header_with_parameters COMMA assignment_expression function_call_header : function_identifier LEFT_PAREN function_identifier : type_specifier | postfix_expression unary_expression : postfix_expression | INC_OP unary_expression | DEC_OP unary_expression | unary_operator unary_expression unary_operator : PLUS | DASH | BANG | TILDE multiplicative_expression : unary_expression | multiplicative_expression STAR unary_expression | multiplicative_expression SLASH unary_expression | multiplicative_expression PERCENT unary_expression additive_expression : multiplicative_expression | additive_expression PLUS multiplicative_expression | additive_expression DASH multiplicative_expression shift_expression : additive_expression | shift_expression LEFT_OP additive_expression | shift_expression RIGHT_OP additive_expression relational_expression : shift_expression | relational_expression LEFT_ANGLE shift_expression | relational_expression RIGHT_ANGLE shift_expression | relational_expression LE_OP shift_expression | relational_expression GE_OP shift_expression equality_expression : relational_expression | equality_expression EQ_OP relational_expression | equality_expression NE_OP relational_expression and_expression : equality_expression | and_expression AMPERSAND equality_expression exclusive_or_expression : and_expression | exclusive_or_expression CARET and_expression inclusive_or_expression : exclusive_or_expression | inclusive_or_expression VERTICAL_BAR exclusive_or_expression logical_and_expression : inclusive_or_expression | logical_and_expression AND_OP inclusive_or_expression logical_xor_expression : logical_and_expression | logical_xor_expression XOR_OP logical_and_expression logical_or_expression : logical_xor_expression | logical_or_expression OR_OP logical_xor_expression conditional_expression : logical_or_expression | logical_or_expression QUESTION expression COLON assignment_expression assignment_expression : conditional_expression | unary_expression assignment_operator assignment_expression assignment_operator : EQUAL | MUL_ASSIGN | DIV_ASSIGN | MOD_ASSIGN | ADD_ASSIGN | SUB_ASSIGN | LEFT_ASSIGN | RIGHT_ASSIGN | AND_ASSIGN | XOR_ASSIGN | OR_ASSIGN expression : assignment_expression | expression COMMA assignment_expression constant_expression : conditional_expression declaration : function_prototype SEMICOLON | init_declarator_list SEMICOLON | PRECISION precision_qualifier type_specifier SEMICOLON | type_qualifier IDENTIFIER LEFT_BRACE struct_declaration_list RIGHT_BRACE SEMICOLON | type_qualifier IDENTIFIER LEFT_BRACE struct_declaration_list RIGHT_BRACE IDENTIFIER | SEMICOLON | type_qualifier IDENTIFIER LEFT_BRACE struct_declaration_list RIGHT_BRACE IDENTIFIER | array_specifier SEMICOLON | type_qualifier SEMICOLON | type_qualifier IDENTIFIER SEMICOLON | type_qualifier IDENTIFIER identifier_list SEMICOLON identifier_list : COMMA IDENTIFIER | identifier_list COMMA IDENTIFIER function_prototype : function_declarator RIGHT_PAREN function_declarator : function_header | function_header_with_parameters function_header_with_parameters : function_header parameter_declaration | function_header_with_parameters COMMA parameter_declaration function_header : fully_specified_type IDENTIFIER LEFT_PAREN parameter_declarator : type_specifier IDENTIFIER | type_specifier IDENTIFIER array_specifier parameter_declaration : type_qualifier parameter_declarator | parameter_declarator | type_qualifier parameter_type_specifier | parameter_type_specifier parameter_type_specifier : type_specifier init_declarator_list : single_declaration | init_declarator_list COMMA IDENTIFIER | init_declarator_list COMMA IDENTIFIER array_specifier | init_declarator_list COMMA IDENTIFIER array_specifier EQUAL initializer | init_declarator_list COMMA IDENTIFIER EQUAL initializer single_declaration : fully_specified_type | fully_specified_type IDENTIFIER | fully_specified_type IDENTIFIER array_specifier | fully_specified_type IDENTIFIER array_specifier EQUAL initializer | fully_specified_type IDENTIFIER EQUAL initializer fully_specified_type : type_specifier | type_qualifier type_specifier invariant_qualifier : INVARIANT interpolation_qualifier : SMOOTH | FLAT | NOPERSPECTIVE layout_qualifier : LAYOUT LEFT_PAREN layout_qualifier_id_list RIGHT_PAREN layout_qualifier_id_list : layout_qualifier_id | layout_qualifier_id_list COMMA layout_qualifier_id layout_qualifier_id : IDENTIFIER | IDENTIFIER EQUAL constant_expression | SHARED precise_qualifier : PRECISE type_qualifier : single_type_qualifier | type_qualifier single_type_qualifier single_type_qualifier : storage_qualifier | layout_qualifier | precision_qualifier | interpolation_qualifier | invariant_qualifier | precise_qualifier storage_qualifier : CONST | IN | OUT | INOUT | CENTROID | PATCH | SAMPLE | UNIFORM | BUFFER | SHARED | COHERENT | VOLATILE | RESTRICT | READONLY | WRITEONLY | SUBROUTINE | SUBROUTINE LEFT_PAREN type_name_list RIGHT_PAREN type_name_list : TYPE_NAME | type_name_list COMMA TYPE_NAME type_specifier : type_specifier_nonarray | type_specifier_nonarray array_specifier array_specifier : LEFT_BRACKET RIGHT_BRACKET | LEFT_BRACKET constant_expression RIGHT_BRACKET | array_specifier LEFT_BRACKET RIGHT_BRACKET | array_specifier LEFT_BRACKET constant_expression RIGHT_BRACKET type_specifier_nonarray : VOID | FLOAT | DOUBLE | INT | UINT | BOOL | VEC2 | VEC3 | VEC4 | DVEC2 | DVEC3 | DVEC4 | BVEC2 | BVEC3 | BVEC4 | IVEC2 | IVEC3 | IVEC4 | UVEC2 | UVEC3 | UVEC4 | MAT2 | MAT3 | MAT4 | MAT2X2 | MAT2X3 | MAT2X4 | MAT3X2 | MAT3X3 | MAT3X4 | MAT4X2 | MAT4X3 | MAT4X4 | DMAT2 | DMAT3 | DMAT4 | DMAT2X2 | DMAT2X3 | DMAT2X4 | DMAT3X2 | DMAT3X3 | DMAT3X4 | DMAT4X2 | DMAT4X3 | DMAT4X4 | ATOMIC_UINT | SAMPLER2D | SAMPLER3D | SAMPLERCUBE | SAMPLER2DSHADOW | SAMPLERCUBESHADOW | SAMPLER2DARRAY | SAMPLER2DARRAYSHADOW | SAMPLERCUBEARRAY | SAMPLERCUBEARRAYSHADOW | ISAMPLER2D | ISAMPLER3D | ISAMPLERCUBE | ISAMPLER2DARRAY | ISAMPLERCUBEARRAY | USAMPLER2D | USAMPLER3D | USAMPLERCUBE | USAMPLER2DARRAY | USAMPLERCUBEARRAY | SAMPLER1D | SAMPLER1DSHADOW | SAMPLER1DARRAY | SAMPLER1DARRAYSHADOW | ISAMPLER1D | ISAMPLER1DARRAY | USAMPLER1D | USAMPLER1DARRAY | SAMPLER2DRECT | SAMPLER2DRECTSHADOW | ISAMPLER2DRECT | USAMPLER2DRECT | SAMPLERBUFFER | ISAMPLERBUFFER | USAMPLERBUFFER | SAMPLER2DMS | ISAMPLER2DMS | USAMPLER2DMS | SAMPLER2DMSARRAY | ISAMPLER2DMSARRAY | USAMPLER2DMSARRAY | IMAGE2D | IIMAGE2D | UIMAGE2D | IMAGE3D | IIMAGE3D | UIMAGE3D | IMAGECUBE | IIMAGECUBE | UIMAGECUBE | IMAGEBUFFER | IIMAGEBUFFER | UIMAGEBUFFER | IMAGE1D | IIMAGE1D | UIMAGE1D | IMAGE1DARRAY | IIMAGE1DARRAY | UIMAGE1DARRAY | IMAGE2DRECT | IIMAGE2DRECT | UIMAGE2DRECT | IMAGE2DARRAY | IIMAGE2DARRAY | UIMAGE2DARRAY | IMAGECUBEARRAY | IIMAGECUBEARRAY | UIMAGECUBEARRAY | IMAGE2DMS | IIMAGE2DMS | UIMAGE2DMS | IMAGE2DMSARRAY | IIMAGE2DMSARRAY | UIMAGE2DMSARRAY | struct_specifier | TYPE_NAME precision_qualifier : HIGH_PRECISION | MEDIUM_PRECISION | LOW_PRECISION struct_specifier : STRUCT IDENTIFIER LEFT_BRACE struct_declaration_list RIGHT_BRACE | STRUCT LEFT_BRACE struct_declaration_list RIGHT_BRACE struct_declaration_list : struct_declaration | struct_declaration_list struct_declaration struct_declaration : type_specifier struct_declarator_list SEMICOLON | type_qualifier type_specifier struct_declarator_list SEMICOLON struct_declarator_list : struct_declarator | struct_declarator_list COMMA struct_declarator struct_declarator : IDENTIFIER | IDENTIFIER array_specifier initializer : assignment_expression | LEFT_BRACE initializer_list RIGHT_BRACE | LEFT_BRACE initializer_list COMMA RIGHT_BRACE initializer_list : initializer | initializer_list COMMA initializer declaration_statement : declaration statement : compound_statement | simple_statement simple_statement : declaration_statement | expression_statement | selection_statement | switch_statement | case_label | iteration_statement | jump_statement compound_statement : LEFT_BRACE RIGHT_BRACE | LEFT_BRACE statement_list RIGHT_BRACE statement_no_new_scope : compound_statement_no_new_scope | simple_statement compound_statement_no_new_scope : LEFT_BRACE RIGHT_BRACE | LEFT_BRACE statement_list RIGHT_BRACE statement_list : statement | statement_list statement expression_statement : SEMICOLON | expression SEMICOLON selection_statement : IF LEFT_PAREN expression RIGHT_PAREN selection_rest_statement selection_rest_statement : statement ELSE statement | statement condition : expression | fully_specified_type IDENTIFIER EQUAL initializer switch_statement : SWITCH LEFT_PAREN expression RIGHT_PAREN LEFT_BRACE switch_statement_list | RIGHT_BRACE switch_statement_list : /* nothing */ | statement_list case_label : CASE expression COLON | DEFAULT COLON iteration_statement : WHILE LEFT_PAREN condition RIGHT_PAREN statement_no_new_scope | DO statement WHILE LEFT_PAREN expression RIGHT_PAREN SEMICOLON | FOR LEFT_PAREN for_init_statement for_rest_statement RIGHT_PAREN statement_no_new_scope for_init_statement : expression_statement | declaration_statement conditionopt : condition | /* empty */ for_rest_statement : conditionopt SEMICOLON | conditionopt SEMICOLON expression jump_statement : CONTINUE SEMICOLON | BREAK SEMICOLON | RETURN SEMICOLON | RETURN expression SEMICOLON | DISCARD SEMICOLON // Fragment shader only translation_unit : external_declaration | translation_unit external_declaration external_declaration : function_definition | declaration | SEMICOLON function_definition : function_prototype compound_statement_no_new_scope

Now, you tell me, which one is easier to understand and work with (improving, tweaking, ...)?

You can also see a ST side by side comparison here

But I'm worried that it's not clear enough that each line is a different option.

@erezsh : Well, my background is a lot of years coding on c/c++ and when I started using python many years ago I thought I wouldn't survive without braces... nowadays is the other way around, each time i see redundant tokens like braces I become sad ;)

@whitten : Just for the record, I think that particular syntax is the worst from the whole set of proposals (worst=less elegant and less cleaner)

NS: I haven't checked correctness (just a fast adaptation), so not sure if you'd transform this:

conditionopt :
    condition
    /* empty */

like:

conditionopt : condition
    | /* empty */

nor this:

switch_statement_list :
    /* nothing */
    statement_list

like:

switch_statement_list : /* nothing */ 
   | statement_list

guess that's just wrong syntax...

brupelo commented 6 years ago

Ideally, you'd be able to express the same rule on different ways, for instance:

flow_stmt1: break_stmt | continue_stmt | return_stmt | raise_stmt | yield_stmt

flow_stmt2: 
    break_stmt | continue_stmt | return_stmt | raise_stmt | yield_stmt

flow_stmt3: 
    break_stmt
    continue_stmt
    return_stmt
    raise_stmt
    yield_stmt

flow_stmt1==flow_stmt2==flow_stmt3.

The idea here is you're giving the user the freedom to create hacky small compact grammars if he wants to do so or... more clean ones (even if a little bit more verbose one)

erezsh commented 6 years ago

Yep, that's what I was planning.

marxsk commented 4 years ago

I suggest one more way how to write multiple right-sides of the rule.

foo: bar1 | bar2

foo: bar1
foo: bar2

The main reason for such format (currently, it is rejected because of the duplicity of the left-side) is that you can easilly add comments with explanation for every line. I already have a (naive) preprocessor for that.

erezsh commented 4 years ago

@marxsk Line comments are already possible


>>> from lark import Lark
>>> p=Lark("""
...             // Comment
...     start: "a"
...             // Another comment
...          | "b"
...             // And C
...          | "c"
... """)
>>>
gideongrinberg commented 3 years ago

Could the lark grammar be reimplemented in Lark?

MegaIng commented 3 years ago

@gideongrinberg It kinda is. We have a mirror that should match exactly what the actual parser accepts: lark.lark, but that is not what is being used internally. That is still being parsed with the lalr parser, but the rules are encoded here

erezsh commented 3 years ago

I just had a crazy thought. What if Lark accepted a grammar_grammar argument, which would describe the syntax for the grammar? (it should work as long as the structure is the same as lark.lark)

MegaIng commented 3 years ago

@erezsh That actually sounds like a good idea. I would suggest that the argument takes a Lark instance (e.g. Something that has a .parse(str) -> Tree method). This would allow the grammar to use a Transformer to fix things that don't exactly match between EBNF syntax. This system would allow us to easily use EBNF grammars in a different syntax. (The current built-in grammar would have to stay ofcourse)

erezsh commented 3 years ago

@MegaIng It would be interesting to try! Though seems like it's just a single function, parse(str) -> Tree.

So passing grammar_parser = Lark.open('lark.lark', ...).parse will have no effect, other than extra processing.

It will require a bit of work, because currently there is a deviation of structure between lark.lark and the native grammar loader.

ThatXliner commented 3 years ago

I just had a crazy thought. What if Lark accepted a grammar_grammar argument, which would describe the syntax for the grammar? (it should work as long as the structure is the same as lark.lark)

We need standards. That's probably a bad idea in case I wanted to read other's code

BUT it could be a good idea because it allows people who don't know Lark's grammar, but instead something like ANTLR, to be able to write a grammar. We could have a libraries of possible default grammar-grammar-grammars.

erezsh commented 3 years ago

@ThatXliner Well, if they produce a Tree that corresponds to lark.lark, I imagine the reconstructor should be able to automatically generate a working lark grammar. In theory, at least.

But I agree that it might become confusing, if suddently everyone used their own syntax.

julie777 commented 2 years ago

I'm concerned that the nonterminal : | terminal | nonterminal ;

Since "[someterminal]" means that the someterminal is optional, perhaps if we use the syntax [] to mean an optional "empty".

I also think that at times it is easier to write the grammar with rules where one of the alternatives is "empty". However, I would much prefer that common.lark declare an EMPTY terminal. It could even be managed as a special case when processing a grammar. (I am pretty sure that EMPTY: "" won't work.

julie777 commented 2 years ago

@brupelo Just my 2 cents about the format of the alternatives that I haven't seen mentioned explicitly.

rule: SOME_TERMINAL some_rule some_other_rule ANOTHER_TERMINAL
        | TERMINAL3

Is actually implying parenthesis by using the line break. The above is actually.

rule: (SOME_TERMINAL some_rule some_other_rule ANOTHER_TERMINAL)     | TERMINAL3

I can understand the thinking that following the rule declaration having the rest of the line empty means that the definition of the rule is an indented block. This is very YAML-like.

Note: I really don't like the ANTLR : | ; notation and I used ANTLER before it was converted to JAVA.

Note: about notation, regardless of the multiline format used to define a rule I think it is imperative that the colon always be used to indicate the preceding string is either a rule or terminal.

If it is true that currently having nothing on the line after a rule declaration is an error then allowing the

grammar = '''
    rule : 
        rule1
        rule2
'''

format could be okay. I think that being explicit about what the alternatives are

   rule:
            | rule1
            | rule2

matches the YAML list construct where list item starts with "- " and in this case a rule alternative would start with '| ". To me that is a pretty clear meaning where each alternative is listed on a single line (which implies parenthesis) and has a preceeding indent followed by the alternative tag "| ". It would be one more extension to EBNF that Lark adds.

If the above was added along with a predefined EMPTY terminal and documentation reflected the use of EMPTY when that is what you mean then having nothing after the colon on the line shouldn't be confusing.

Now if Lark supported inline rule definitions (even more like YAML) I would not be happy.

// this would be very bad   :-)
rule:
    | rule2: TERMINAL1 TERMINAL2
    | rule3: TERMINAL3 rule2