Trying to understand the reduction/`!` pattern

SquidDev commented 1 year ago

I realise everything in this issue is quite involved, and not sure how "production ready" this repository is, so happy if you'd rather just close this :).

I'm currently fiddling with using lrgrep for a Lua parser I'm working on, in an attempt to provide better error messages for some common errors. I've put my current progress on this fork.

One issue I see a lot is people forgetting to put function parenthesis on zero-argument function calls, and so I'd like to provide an error for this case:

print -- should be print()
print("Hello")

Here parsing will fail on line 2, with an unexpected identifier. At that point, the interpreter trace for this parse looks like the following:

- line 1:0-5 IDENT
   [var: IDENT .]
 ↱ var
   [name: var .]
 ↱ simple_expr
   [name: simple_expr . OSQUARE expr CSQUARE]
   [name: simple_expr . DOT IDENT]
   [call: simple_expr . COLON IDENT call_args]
   [call: simple_expr . call_args]
 ↱ sep_list1(COMMA,name)
   [basic_stmt: sep_list1(COMMA,name) . EQUALS sep_list1(COMMA,expr)]
 ↱ name
   [simple_expr: name .]
   [sep_list1(COMMA,name): name . COMMA sep_list1(COMMA,name)]
   [sep_list1(COMMA,name): name .]
- entrypoint program

I'm currently matching this case with the following rule:

| [basic_stmt: sep_list1(COMMA,name) . EQUALS]; !
  partial { (* ... *) }

However, what I really want to be able to do is determine if the next token is on the same line (i.e. x 42, in which case its possible the user wanted to write x(42) or x = 42) or the next line (i.e print\nprint(), in which case it's probably just a function call). I tried doing this by putting a capture around the LR item, so I could compare positions:

| ([basic_stmt: sep_list1(COMMA,name) . EQUALS] as names); !
  partial { (* ... *) }

However, while the pattern still matches, names is always None. This is where my understanding begins to break down a little bit - the bytecode executed doesn't contain any Stores, and so I assume nothing is read from menhir's stack?

Is this something you'd expect to work or, if not, is there a alternative way I could look at implementing such a rule?

Again, apologies for such a verbose question!

let-def commented 1 year ago

Hi @SquidDev,

Thanks for trying out LRGREP. It is indeed a research project not yet meant for production, and I am not currently working on it but I plan to resume work early in January.

In your example, there is no store because the engine does not support capture inside reduction yet. The reason is that the reductions are simulated, they are not computed, so there is no semantic value to capture. I see two possible improvements over this situation:

actually computing the reduction on demand when a rule match (this is only acceptable if the parser is careful about its side effects, as we may explore multiple reductions speculatively)
only capture the location that would be associated with the semantic value if the reduction had been computed.

The second case is simpler to implement and should be sufficient to solve your problem. (Alternatively, if you have an explicit token for newlines, you could also just match against it in the action -- it is helpful for certain grammars such as JavaScript where semicolon can be inserted if necessary when looking ahead at a newline).

--

More generally, LRGREP is still being designed and this is a good opportunity for me to learn how to handle features from other programming languages that I haven't think about, so thank you very much for this report.

SquidDev commented 1 year ago

In your example, there is no store because the engine does not support capture inside reduction yet. The reason is that the reductions are simulated, they are not computed, so there is no semantic value to capture.

Right, that makes sense! I think I was expecting something to have been read from the Menhir stack, and then rejected inside execute_error_message due to the match against Parser_raw.MenhirInterpreter.incoming_symbol st failing, but it makes sense that lrgrep wouldn't capture that at all.

this is a good opportunity for me to learn how to handle features from other programming languages that I haven't think about

I think this is the only snag I've hit so far, otherwise things have worked really nicely!

Thank you so much!

let-def commented 1 year ago

Good to know. I will add the location capture feature in January then.

(In the meantime, merry christmas, happy new year, etc 🥳 )

let-def commented 1 year ago

I gave a bit more thought to the reduction operator. I will probably significantly change the implementation soon: there are at least three things I would like to revisit.

1) Greedy and lazy `!`

First, when capturing values, we can see there are two semantics for !: longest/latest match and shortest/first match. The two are relevant in different situations, so I will probably add syntax for the two variants. (This is like the greedy and lazy quantifier semantics for posix regex, and indeed, the ! is a bit like a Kleene-*).

Say !! for longest match and ! for shortest one.

2) Specifying the scope of the reduction (and capturing the locations)

Right now we can specify when the reduction starts, but not where it stops. This is problematic because we don't know if some symbols matched on the real stack or only inside a reduction. For real symbols, we can capture the semantic value, inside the reduction... we can only get the location of the whole reduction.

I am thinking to change the definition of ! to take a sub-regex and only match inside it. Captures would not be allowed inside !, and a capture of the ! itself would be the locations.

For instance, your example could be written as:

| ([basic_stmt: sep_list1(COMMA,name) . EQUALS])! as names
  partial { (* ... *) }

where names is bound to a value of type Lexing.position * Lexing.position.

3) Characterize what is inside the reduction

I said that ! would apply to a "sub-regex" without captures, but it is not clear (to me) that this is the right language. What's left on the stack after a sequence of reduction is always one or more non-terminals.

Maybe the language of "reductions" could be restricted to that. For instance, your example could be written (slightly changing the semantics of items, but that's a topic for another day):

 | basic_stmt: sep_list1(COMMA,name)! as names . EQUALS

That can be read as follow:

basic_stmt: filter matches that can reduce to basic_stmt (operationally, this doesn't consume any element from the stack but only reject stacks that would not permit the reduction)
sep_list1(COMMA,name)! as names match something that can reduce to sep_list1(COMMA,name), bind the location to names
. EQUAL filter matches to those with an item with an EQUAL on the rhs.

That way we blur the distinction between items and other atoms. This is both a pros and a cons, though I think it leads to a simplification/unification of concepts that is beneficial overall. Your issue in #6 should be covered by that too: the table_entry: would have to be put at the right position, but the right prefix would always be matched:

| table_entry: expr! partial { ... }

let-def / lrgrep