Open SquidDev opened 1 year ago
Hi @SquidDev,
Thanks for trying out LRGREP. It is indeed a research project not yet meant for production, and I am not currently working on it but I plan to resume work early in January.
In your example, there is no store because the engine does not support capture inside reduction yet. The reason is that the reductions are simulated, they are not computed, so there is no semantic value to capture. I see two possible improvements over this situation:
The second case is simpler to implement and should be sufficient to solve your problem. (Alternatively, if you have an explicit token for newlines, you could also just match against it in the action -- it is helpful for certain grammars such as JavaScript where semicolon can be inserted if necessary when looking ahead at a newline).
--
More generally, LRGREP is still being designed and this is a good opportunity for me to learn how to handle features from other programming languages that I haven't think about, so thank you very much for this report.
In your example, there is no store because the engine does not support capture inside reduction yet. The reason is that the reductions are simulated, they are not computed, so there is no semantic value to capture.
Right, that makes sense! I think I was expecting something to have been read from the Menhir stack, and then rejected inside execute_error_message
due to the match against Parser_raw.MenhirInterpreter.incoming_symbol st
failing, but it makes sense that lrgrep wouldn't capture that at all.
this is a good opportunity for me to learn how to handle features from other programming languages that I haven't think about
I think this is the only snag I've hit so far, otherwise things have worked really nicely!
Thank you so much!
Good to know. I will add the location capture feature in January then.
(In the meantime, merry christmas, happy new year, etc 🥳 )
I gave a bit more thought to the reduction operator. I will probably significantly change the implementation soon: there are at least three things I would like to revisit.
!
First, when capturing values, we can see there are two semantics for !
: longest/latest match and shortest/first match.
The two are relevant in different situations, so I will probably add syntax for the two variants.
(This is like the greedy and lazy quantifier semantics for posix regex, and indeed, the !
is a bit like a Kleene-*
).
Say !!
for longest match and !
for shortest one.
Right now we can specify when the reduction starts, but not where it stops. This is problematic because we don't know if some symbols matched on the real stack or only inside a reduction. For real symbols, we can capture the semantic value, inside the reduction... we can only get the location of the whole reduction.
I am thinking to change the definition of !
to take a sub-regex and only match inside it.
Captures would not be allowed inside !
, and a capture of the !
itself would be the locations.
For instance, your example could be written as:
| ([basic_stmt: sep_list1(COMMA,name) . EQUALS])! as names
partial { (* ... *) }
where names
is bound to a value of type Lexing.position * Lexing.position
.
I said that !
would apply to a "sub-regex" without captures, but it is not clear (to me) that this is the right language.
What's left on the stack after a sequence of reduction is always one or more non-terminals.
Maybe the language of "reductions" could be restricted to that. For instance, your example could be written (slightly changing the semantics of items, but that's a topic for another day):
| basic_stmt: sep_list1(COMMA,name)! as names . EQUALS
That can be read as follow:
basic_stmt:
filter matches that can reduce to basic_stmt
(operationally, this doesn't consume any element from the stack but only reject stacks that would not permit the reduction)sep_list1(COMMA,name)! as names
match something that can reduce to sep_list1(COMMA,name)
, bind the location to names
. EQUAL
filter matches to those with an item with an EQUAL
on the rhs.That way we blur the distinction between items and other atoms. This is both a pros and a cons, though I think it leads to a simplification/unification of concepts that is beneficial overall.
Your issue in #6 should be covered by that too: the table_entry:
would have to be put at the right position, but the right prefix would always be matched:
| table_entry: expr! partial { ... }
I realise everything in this issue is quite involved, and not sure how "production ready" this repository is, so happy if you'd rather just close this :).
I'm currently fiddling with using lrgrep for a Lua parser I'm working on, in an attempt to provide better error messages for some common errors. I've put my current progress on this fork.
One issue I see a lot is people forgetting to put function parenthesis on zero-argument function calls, and so I'd like to provide an error for this case:
Here parsing will fail on line 2, with an unexpected identifier. At that point, the interpreter trace for this parse looks like the following:
I'm currently matching this case with the following rule:
However, what I really want to be able to do is determine if the next token is on the same line (i.e.
x 42
, in which case its possible the user wanted to writex(42)
orx = 42
) or the next line (i.eprint\nprint()
, in which case it's probably just a function call). I tried doing this by putting a capture around the LR item, so I could compare positions:However, while the pattern still matches,
names
is alwaysNone
. This is where my understanding begins to break down a little bit - the bytecode executed doesn't contain anyStore
s, and so I assume nothing is read from menhir's stack?Is this something you'd expect to work or, if not, is there a alternative way I could look at implementing such a rule?
Again, apologies for such a verbose question!