let-def / lrgrep

Menhir polishing toolbox, for experienced druids
25 stars 2 forks source link

Lrgrep alpha v3 #8

Closed let-def closed 1 year ago

let-def commented 1 year ago

This is the third redesign of the approach. There are a few fundamental changes here, most notably in the pattern language.

A summary of the syntax, in pseudo-BNF:

expr ::=
  | symbol               -- match a symbol
  | var '=' symbol       -- match a symbol and bind its semantic value and location to a variable
  | '(' expr ')'         -- parenthesis to disambiguate precedence :)
  | expr_1 ';' expr_2    -- match if expr_1 matches then expr_2 matches
  | expr_1 '|' expr_2    -- disjunction, match if either expr_1 or expr_2 match
  | expr '*'             -- kleene star, match expr zero or more times
  | '[' expr ']'         -- simulate reductions
  | var = '[' expr ']'   -- simulate reductions, bind its location to a variable
  | expr '/' filter      -- restrict matches to certain items

  -- greedy variants
  | expr '**'            -- greedy kleene star
  | '[[' expr ']]'       -- greedy reductions

symbol ::=
  | TERMINAL             -- a terminal from the grammar
  | nonterminal          -- a nonterminal from the grammar
  | '_'                  -- wildcard, any symbol

filter ::=
  | (lhs:nonterminal ':')? prefix:symbol* '.' suffix:symbol*
     -- a filter is a list of states that will restrict the matches
     -- a state is accepted if its itemset contains at least one item fitting
     -- the description. The description is interpreted "loosely":
     -- lhs is optional, prefix and suffix are anchored only around the 'dot'

clause ::=
  expr                              -- pattern that should match for the rule to trigger
  ('@' terminal ('|' 'terminal')*)  -- if provided, restrict matches to those lookahead symbols
  'partial'?                        -- if provided, the semantic action should return an option
                                    -- returning None causes the matcher to fall
                                    -- back to the next matching rule
  '{' ocaml code '}'                -- semantic action to execute when the rule match

rule ::=
  'rule' name names* '=' 'parse' 'error' -- declare a named rule, with a list of arguments
  ('|' clause) *                         -- the ordered list of clauses that should be matched

file ::=
  '{' ocaml code '} rule* '{' ocaml code '}'
  -- a file is composed of a list of rules with header and footer codes

The main differences with the previous version are that the reductions are now delimited, we distinguish between greedy and non-greedy variants of recursive constructions, and filters allow to restrict a set of matches without consuming anything by themselves.

[...] was used to introduce items and has been repurposed to trigger reductions!

Reductions

In the previous version, reductions were introduced with ! which was trying to reduce its left-handside as far as possible. Now the target of the reduction should be entirely specified within square brackets [ ... ]. For instance, ...foo... expr ! was matching anything that reduce to an expr and possibly more, if other matches were found in the prefix. The problem was that we could not tell exactly where the reduction stopped. Since we cannot capture semantic values inside a reduction, it was not possible to tell what could be captured from the syntax. This example would now be writter ...foo... [expr], and it is clear we only reduce expr. Captures are allowed anywhere in ...foo..., and loc=[expr] captures the location of the symbols being reduced.

The error rules for OCaml have been ported to: https://github.com/let-def/lrgrep/blob/lrgrep-v3/ocaml/parse_errors.mlyl

Greediness

Greedy vs non-greedy allow to control the span of the capture: if there are multiple matches, do we want the shortest one (with []) or the longest one (with [[]])? For instance, when working with the arithmetic grammar, and the input a + b + c:

Items

In the previous versions, there were two kinds of atoms:

Now symbols (with wildcards) are the only kind of atom. Items can now be expressed with the / filter construction, that doesn't consume anything on stack but restricts the possible matches.

For instance, the old rule x=[label_declaration: mutable_flag LIDENT . COLON] can now be written x=LIDENT / label_declaration: mutable_flag LIDENT . COLON. On the one hand it is a bit redundant (LIDENT is written twice), but on the other hand, it is clear that only LIDENT is consumed from the stack (and will be bound to x). Furthermore, the filtering semantics compose better with other constructions (one can filter a reduction, a star, etc.).

Misc

Following #7, the DFA is now minimized using Valmari.

let-def commented 1 year ago

I just added support for $startloc(x) and $endloc(x) "keywords" to refer to the location of captured variables. token is also bound to the current lookahead token, located at $startloc(token) and $endloc(token).

In the previous version the lookahead token was not treated specially -- if one cared about it, it had to be passed manually as an extra argument. It is now mandatory to pass the lookahead token when invoking the matcher. This allows to implement a "lookahead constraint" on clauses, see https://github.com/let-def/lrgrep/blob/lrgrep-v3/ocaml/parse_errors.mlyl#L87 for instance. No need to make these actions partial and to match on the token manually anymore.

In the future this will allow to implement a precise coverage checker.

let-def commented 1 year ago

I just rebased and merged into master. Main development will continue there.