microsoft / Trieste

A term rewriting system for experimental programming language development.
MIT License
38 stars 20 forks source link

Semantics of predicate patterns #37

Open EliasC opened 1 year ago

EliasC commented 1 year ago

The pattern matching semantics has some surprising corner cases related to predicate patterns, i.e., patterns that do not consume tokens. I take predicate patterns to mean lookahead (++p and --p) as well as the "positional" patterns In(ty), Start and End.

For example, let's look at what happens when negating predicates with the ! pattern:

!(Start) * T(Foo) >>
  [](Match&) { return NodeDef::create(Bar); }

Running this on foo foo bar results in bar bar, since !(Start) matches on the first foo and consumes it together with the subsequent foo. I would say the expected result here is foo bar bar. If we instead match on T(Bar) * !(Start), nothing happens since ! only matches if there is a term to consume in that position.

More generally, I would expect the negation of a zero-width pattern to consume zero terms. The semantics for negation of patterns longer than one is also unintuitive:

!(T(Foo) * T(Bar)) * T(Bar) >>
  [](Match&) { return NodeDef::create(Baz); }

Running this on foo bar bar currently results in foo baz, whereas running it on foo foo bar does nothing. In the first case, the negation pattern is successfully matched against bar bar and then the first bar is consumed, after which T(Bar) matches on the second bar. In the second case, although foo foo matches the negation pattern, only the first foo is consumed, so the second foo does not match T(Bar). After advancing the cursor foo bar fails to match the negation pattern.

Another corner case is pattern matching the children of a zero-width sequence:

Start >> T(Foo)[Foo] >>
  [](Match& _) { return _(Foo); }

This example currently causes a segfault, since there are no children in the sequence matched by Start. More reasonable would be to fail matching, or possibly to (ideally statically) disallow the pattern since it will never succeed.

A related question is how to interpret matching on the the children of a sequence of more than one term:

(T(Foo) * T(Bar)) << (Any[Any]) >>
  [](Match& _) { return _(Any); }

Currently, Any binds to the first child of whatever comes first in the sequence (here Foo) which might be what we want. Other possible semantics include failing, or letting Any bind to the first child of every node in the parent pattern. If we go for requiring well formedness, we could disallow any <<-pattern where the parent pattern does not match exactly one term.