hackworthltd / primer

A pedagogical functional programming language.
GNU Affero General Public License v3.0
14 stars 1 forks source link

FR: Pattern matching wildcards and actions #996

Closed brprice closed 10 months ago

brprice commented 1 year ago

Description

Currently pattern matches/case statements are

We would like to relax these restrictions. In particular the "no primitives" one: we definitely want to be able to pattern match on primitives (rather than have to build an if-tree with a primitive eqInt etc). The "maximal" one would be nice (and necessary) to relax, at least to allow a final "catch all" pattern.

We do not at this time propose to allow deep matching. This is both to keep this FR small and because it is not yet clear whether it would be pedagogically better to allow the more expressive but more complex construct. However, we do talk about deep matching a bit, to contrast it and to illuminate some corner cases.

Note that we will (or at least, may want to) lose the property of "order independent" if we allow wildcards (especially with deep matching): case x of 1 -> True; _ -> False is not the same as case x of _ -> True; 1 -> False (in Haskell -- see below for options here).

Dependencies

None

Spec

Note that for clarity in this spec I will talk about ADTs, and not primitives, everything applies with minor modifications to primitives also.

Simple wildcards

(Abstract) Syntax

The only part of the ast (i.e. the Expr' and Type' definitions) that will change is the auxillary CaseBranch. Recall (eliding metadata), Expr = ... | Case Expr [CaseBranch], data CaseBranch = Branch ValConName [LVarName] Expr, where Case e [Branch C1 [x,y] t1, Branch C2 [] t2] means roughly the same as the Haskell case e of C1 x y -> t1; C2 -> t2. The variables x, y listed in a branch are binders, and scope over the expression t1 in that branch.

The first part of this proposal is to add an optional fallback branch (binding no variables), essentially Expr = ... | Case Expr [CaseBranch] (Maybe Expr). This extra branch Case e brs (Just r) is a "wildcard" match, similar to case e of ... ; _ -> r.

Typing rules

The rule for a case without the extra branch is as before:

e ∈ S a  (any number of parameters)
S is a type constructor
[Ci,...] is a list of all the constructors of the type `S`,
For each constructor S p = ... | Ci Aij{p} we have
  xij : Aij{a}, ... ⊢ T ∋ ti
---
T ∋ Case e [Branch Ci [xij...] ti, ...] Nothing

Here curly braces denote (meta-level/syntactic) dependency/substitution/scoping: if we had data T p = C (A p) B p, then C0 = C, A00 = A p, A01 = B and A02 = p -- note that the Aij depend on p (A01 does so trivially); and A00{X} would mean A X. For consistency, our implementation requires that the constructors/branches appear in the same order in both the type declaration and the branches. This is not necessary though. (For simplicity we have not formally stated two additional requirements: firstly that each branch has the correct number of binders (i.e. xij and Aij have the same range of j); secondly that all binders in one branch {xij : j ∈...} are distinct, i.e. we only allow linear patterns).

The new rule is when the extra branch exists:

e ∈ S a
S is a type constructor
[Ci,...] is a strict subset of the constructors of the type `S`,
For each constructor S p = ... | Ci Aij{p} we have
  xij : Aij{a}, ... ⊢ T ∋ ti
T ∋ t*
---
T ∋ Case e [Branch Ci [xij...] ti, ...] (Just t*)

i.e. it is treated as another branch which does not bind any variables. Note that in this case we do not need to give a (named/old-style) branch for every constructor, but only a subset of them. In fact, we require a strict subset, so that there are some terms that will hit the fallback branch.

Notice however, that these rules only cover the case when the scrutinee synthesises a type which is headed by a constructor. Matching against something of arrow type (for instance) is ill-typed, and "smartholes" will wrap the scrutinee in a hole. Thus the only other interesting case is where we synthesise a hole. Since this will never reduce, there is no need for any branches, and since there are no matching constructors we should in fact enforce being no constructors. Thus the rule is

e ∈ ?
---
T ∋ Case e [] Nothing

Evaluation rules

The rule for case-of-known-constructor is modified so that if the constructor does not match a branch, then the new, fallback, branch is taken.

case (Ci s... : T A) of ..., C y... -> r{y...}, ...   ~>   r{s...}
case (C s... : T A) of ..., Just r   ~>   r     (if the constructor C does not appear in any named branch)

Note that we only reduce to the fallback branch if the scrutinee is in weak head normal form and the constructor is apart from each other branch. In particular we do not reduce with a stuck scrutinee (e.g. hole, or variable if we are evaluating under a binder): otherwise having more information to "unstick" the scrutinee could lead to a different reduct.

If there is only a fallback branch, and no named branches, then there is a choice whether to enforce the scrutinee being in WHNF. This actually changes the laziness of the language: consider case (letrec x : A = x in x) of _ -> True (note the scrutinee is non-terminating). Should this be allowed to evaluate to True? Some options (EDIT: we chose the "maximally lazy" version):

Note that for a normal-order reduction, we evaluate the scrutinee as little as possible before reducing the case to one of the branch RHSs. We never evaluate inside a RHS if we could first evaluate either the scrutinee or case.

Actions

At a low level, we need some way to choose what branches we wish to have. We already have a "make a Case" action, and we have a free choice what branches that brings into being: the two sensible ones are "all the branches, no wildcard" or "no branches except wildcard". The particular choice does not matter, as this low-level api is intended for internal use (and for clients wishing to see low-level details), but not intended to be ergonomic for a human. However, for infinitely-branching primitives, creating all branches is impossible (unless we have something like a "not-pattern", which I do not propose adding); thus for consistency we should only offer the "only wildcard" action. (EDIT: we actually chose that ADTs get all branches, primitives and holes get just a fallback) Whilst this deals with most scrutinees, it is somewhat lacking for those with no constructors: empty types and hole types (which we can deal with identically). These are required to have no branches, not even a wildcard. Thus the above rule needs to be slightly refined: "no branches except perhaps a wildcard".

We also need to be able to add and delete (or, merge into a wildcard) branches. Thus two actions: "add a branch for constructor C" and "delete the branch for constructor C". The first has the obvious interpretation: if such a branch exists, do nothing/error out; if no such exists, insert one in the appropriate place (assuming branches are ordered the same as datatype, otherwise at the end, just before the wildcard) with correct number of binders (auto-generated names) and a RHS of an empty hole (EDIT: we actually chose to copy the RHS of the wildcard branch). This could be offered either on the Case as a whole, on as an action on the wildcard branch "split a constructor out". One subtlety is when the wildcard only covers one alternative (this is a valid state, and may be intentional, as future-proofing for adding more constructors to a type). In this case, when adding an explicit branch for the last constructor we must delete the wildcard branch (is it is now unreachable). This will also delete its RHS: note that this is a low-level action, and we will make a different choice at high-level.

The second is slightly more tricky: what should happen to the RHS? However, this is a low-level action, so for simplicity we will just delete it. Note that if the case was exhaustive it now will not be, so a wildcard branch must be created in this case; the new (wildcard) RHS will be a copy of the deleted branch's RHS, with holes for the now-out-of-scope variables (i.e. those bound by the deleted branch's pattern).

At a high level, the basic actions will be similar, but we may want to tweak the details for the sake of ergonomics/pedagogy. The choices we need to make are "what branches should exist on a new case", what is on the RHS of a new branch, and "what happens to RHSs of removed branches". There seem to be merits in both "start with all branches and wittle down" (except this does not work for primitives) and "start with no branches and select them". There is also an argument for "ask what branches to start with" (e.g. exposed to a human as check boxes when initially creating a case, or maybe just offering both "all branches" and "no branches" options). For the sake of simplicity (and a more concrete proposal) I suggest only offering one action, "case with wildcard only". (EDIT: we decided to create all branches) For creating a new branch, we will copy the RHS of the wildcard branch, since it is much easier to delete unwanted code than to recreate it. Note that this means when creating the final branch the effect will be to change the wildcard pattern into a constructor pattern (as we add a new branch, delete the redundant wildcard, and also preserve the RHS of the now-deleted wildcard branch). For deleting branches, in the case of no wildcard, one could do

case ... of ...; C x -> r{x}   ~> case ... of ...; _ -> t{?}

i.e. create a wildcard branch with the same RHS, with bound variables replaced by holes, or

case ... of ...; C x -> r{x}   ~> case ... of ...; _ -> ?

i.e. delete the RHS entirely. In the case where a wildcard already exists, case ... of ...; C x -> r{x}; _ -> s then one must "merge" r and s. There are a few cases to consider here: what if neither/one/both of r and s are holes, what if they are the same, what if they have the same few outer layers but then differ, what if they are completely different. I think the only choices simple enough to be sensible to expose to a student are "always delete r{?} and keep s" or "if s is a hole, keep r{?}, otherwise keep s" (EDIT: we chose to keep the existing fallback RHS)

case ... of ...; C x -> r{x}; ...; _ -> s   ~> case ... of ...; _ -> MERGE(r{?},s)

where

MERGE(r,s) = s

or

MERGE(r,?) = r
MERGE(_,s) = s

Actions

At a low level, we need some way to choose what branches we wish to have. We already have a "make a Case" action, and we have a free choice what branches that brings into being: the two sensible ones are "all the branches, no wildcard" or "no branches except wildcard". The particular choice does not matter, as this low-level api is intended for internal use (and for clients wishing to see low-level details), but not intended to be ergonomic for a human. However, for infinitely-branching primitives, creating all branches is impossible (unless we have something like a "not-pattern", which I do not propose adding); thus for consistency we should only offer the "only wildcard" action. Whilst this deals with most scrutinees, it is somewhat lacking for those with no constructors: empty types and hole types (which we can deal with identically). These are required to have no branches, not even a wildcard. Thus the above rule needs to be slightly refined: "no branches except perhaps a wildcard".

We also need to be able to add and delete (or, merge into a wildcard) branches. Thus two actions: "add a branch for constructor C" and "delete the branch for constructor C". The first has the obvious interpretation: if such a branch exists, do nothing/error out; if no such exists, insert one in the appropriate place (assuming branches are ordered the same as datatype, otherwise at the end, just before the wildcard) with correct number of binders (auto-generated names) and a RHS of an empty hole. This could be offered either on the Case as a whole, on as an action on the wildcard branch "split a constructor out". One subtlety is when the wildcard only covers one alternative (this is a valid state, and may be intentional, as future-proofing for adding more constructors to a type). In this case, when adding an explicit branch for the last constructor we must delete the wildcard branch (is it is now unreachable). This will also delete its RHS: note that this is a low-level action, and we will make a different choice at high-level.

The second is slightly more tricky: what should happen to the RHS? However, this is a low-level action, so for simplicity we will just delete it.

At a high level, the basic actions will be similar, but we may want to tweak the details for the sake of ergonomics/pedagogy. The choices we need to make are "what branches should exist on a new case", what is on the RHS of a new branch, and "what happens to RHSs of removed branches". There seem to be merits in both "start with all branches and wittle down" (except this does not work for primitives) and "start with no branches and select them". There is also an argument for "ask what branches to start with" (e.g. exposed to a human as check boxes when initially creating a case, or maybe just offering both "all branches" and "no branches" options). For the sake of simplicity (and a more concrete proposal) I suggest only offering one action, "case with wildcard only". For creating a new branch, we will copy the RHS of the wildcard branch, since it is much easier to delete unwanted code than to recreate it. Note that this means when creating the final branch the effect will be to change the wildcard pattern into a constructor pattern (as we add a new branch, delete the redundant wildcard, and also preserve the RHS of the now-deleted wildcard branch). For deleting branches, in the case of no wildcard, one could do

case ... of ...; C x -> r{x}   ~> case ... of ...; _ -> t{?}

i.e. create a wildcard branch with the same RHS, with bound variables replaced by holes, or

case ... of ...; C x -> r{x}   ~> case ... of ...; _ -> ?

i.e. delete the RHS entirely. In the case where a wildcard already exists, case ... of ...; C x -> r{x}; _ -> s then one must "merge" r and s. There are a few cases to consider here: what if neither/one/both of r and s are holes, what if they are the same, what if they have the same few outer layers but then differ, what if they are completely different. I think the only choices simple enough to be sensible to expose to a student are "always delete r{?} and keep s" or "if s is a hole, keep r{?}, otherwise keep s"

case ... of ...; C x -> r{x}; ...; _ -> s   ~> case ... of ...; _ -> MERGE(r{?},s)

where

MERGE(r,s) = s

or

MERGE(r,?) = r
MERGE(_,s) = s

Interaction with other actions

As well as each AST node having actions of their own, they also need to respond to actions taken elsewhere (mostly due to types of subterms changing). In the implementation this is dealt with mostly via "smartholes". The relevant effect here is when the type of the scrutinee changes. Note that this may well happen often during writing a program: one may start with a hole, then insert a pattern match (giving a hole scrutinee), then fill in the scrutinee (changing its type).

The general problem is different types have different constructors and thus require different branches. In line with the current implementation (which requires every branch to exist, and does not have wildcards), we will simply remove all current branches and re-create the correct ones. For consistency with creating a Case in the first place, this will either be zero branches (hole type or empty type) or one wildcard branch (everything else). However, since wildcards do not bind anything, we have the option to preserve the RHS of wildcard branches (if they exist in both the before and after state) -- it is as yet unclear whether we want to do this. This can destroy interesting parts of the program on the RHS, but this is also the state today so it is at least not a step backward. I suspect (but have no concrete data) that almost every instance of a scrutinee changing type will be during initial program writing when nothing has been put on the RHS of branches, and almost all the rest will be mistakes which will be quickly undone.

There is also another problem: the type of the scrutinee may stay the same, but that type's definition may be edited. There are three cases here: adding a new constructor, deleting an old constructor, and changing parameters of an existing constructor. The first two are dealt with simply: add a new wildcard branch if there is not one already (implementation choice: alternatively could add an explicitly-patterned branch) or delete that branch (respectively). Changing arguments is similar: if add a argument we can update the patterns to bind another variable; if removing a argument we update and also remove mention of it on the RHS (replacing with a hole); if changing the type of a argument we put every mention of it in a hole (as otherwise the program will be ill-typed).

Ordering

I suggest that the wildcard comes last, thus the order of other branches does not matter (in the sense of not changing the result of evaluation). (This is true because we do not have "deep" matching, and thus all patterns other than wildcard are pairwise disjoint.) For consistency we could enforce ordering matching the type definition. If not, e.g. if "insert" adds the branch just before the wildcard, then we should also add an action to reorder branches -- in particular, one would not then have to think ahead to split/insert in the correct order to get the final outcome desired. One argument for making ordering changable is for aesthetics/documentation: it may be preferable to put branches in a particular order for reasons other than evaluation. (Potentially we could even allow the wildcard to be reordered, if we didn't rigidly tie priority for evaluation to syntactic ordering, though I suspect that may be confusing.)

Coverage checking

I do not propose to take long-distance-information into consideration, as in (from "Lower your guards" 4.1):

f True = 1
f x = case x of 
   False -> 2
   True -> UNREACHABLE

Thus coverage checking is trivial for these shallow patterns, and is subsumed by the typing rules above.

Implementation details

There are some choices to make, which have been flagged up (EDIT: and answered) inline above.

These choices are:

Not in spec

Or-patterns

Whilst a main motivation for this FR is pattern matching on large primitive types, which "or-patterns" won't help with, they are a refinement of the "wildcard" idea, at least for finite types. I briefly describe them here, with a view to deciding "these do not fit our vision" or "let's spec this out more fully". (Note that they unfortunately do not subsume wildcards, since they will only match a finite number of constructors.)

Consider the type of weekdays data Day = Monday | ... | Sunday, and the function isWeekend. Before this FR, writing isWeekend is rather tedious: it requires:

We can here get away with having two named branches and a wildcard. However, there is still duplication in the two branches, and (depending on our choices for what happens when editing types) if we added a new day it would automatically and silently be a non-weekend, which may be surprising!

An "or-pattern" is a way to have a branch match multiple constructors, and reduce to the same RHS (see ML, or https://github.com/ghc-proposals/ghc-proposals/blob/master/proposals/0522-or-patterns.rst). One can think of a wildcard as an or-pattern which lists every constructor that has not been mentioned in a named branch. In this example we could write something like

isWeekend : Day -> Bool
isWeekend = λd . case d of
  Monday | Tuesday | Wednesday | Thursday | Friday -> False
  Saturday | Sunday -> True

There are some choices to be made when constructors have arguments: for data D = A Int | B Bool Int | C the following are obviously nonsense

case e of
  A x | B x y -> ...
  C -> ...

case e of
  A x | B x _ -> ...
  C -> ...

case e of
  A x | C -> ...
  B x y -> ...

but the following may be accepted (is easy with plain ADTs, is harder for Haskell with GADTs introducing type equalities) (using an _ to mean "unnamed/ignored variable"), but we do not have to accept them:

case e of
  A x | B _ x -> ...
  C -> ...

case e of
  A _ | C -> ...
  B x y -> ...

There is a choice whether to allow or-patterns nestedly, or only at top level.

With or-patterns, there should be an option to "merge" two patterns (keeping only the common variables, considering name and type). However, there is the question of how to merge the RHS -- perhaps only offer if one of them is a hole?

Deep matching

We briefly describe "deep" matching. If this seems like something we wish to include, it deserves its own FR. A deep match is matching on more than one level of the scrutinee. Thus patterns are recursive: pat ::= con pat ... pat | var. There is a design choice whether to allow a plain variable (note a wildcard would then be a plain variable which is ignored) or whether to enforce the top level being a constructor.

The typing rules are essentially the same, except that more variables can be brought into scope (their types are easy to calculate, since we know the type of the root pattern) and the coverage checks change: above, where we say "[Ci,...] is a list of all the constructors of the type `T`" we should say something like "the patterns cover every possibility, and none of them are redundant". This is somewhat tricky to state formally, I shall simply refer to prior work for now: http://moscova.inria.fr/~maranget/papers/warn/warn.pdf https://www.cl.cam.ac.uk/~nk480/pattern-popl09.pdf https://people.cs.kuleuven.be/~tom.schrijvers/Research/papers/icfp2015.pdf https://www.microsoft.com/en-us/research/uploads/prod/2020/03/lyg_ext.pdf

For both evaluation, the order of branches now matters, since multiple patterns may match the same scrutinee. As a simple example,

case C True False of
C True x -> 1
C x False -> 2

where the result depends on the priority of the branches. More subtly, if the scrutinee was C (letrec x:A=x in x) False, then the priority of the branches would affect the termination of the program. Similarly (see GMTM, section 8.1) the order of patterns in one line matters (at least for a lazy case):

case C (letrec x:A=x in x) False of
C True True -> 1
C x False -> 2
C False True -> 3

If we attempt to evaluate and first ask "does this match the first branch, in the first argument" we will attempt to evaluate a divergent term to WHNF. If we instead asked "does this match the first branch, in the second argument" we would see clearly not, and then (say) move on to "does this match the second branch, in the ... argument" and see it does, in both, returning 2.

What order should be chosen? The classical choice is "branches top-to-bottom, arguments left-to-right, as written in the source". One possible alternative (that I don't know of prior work on) is to enforce the existance of meets: i.e. whenever multiple patterns match a term, then there is a unique "most specific one", and this is the one that is matched. One drawback (other than complexity) is that this forces more branches to exist, giving bigger programs, however "or-patterns" can minimise this problem.

The evaluation rule is now case e of ... ; p{xs} -> r{xs} ; ... ~> r{as} where p is a pattern that binds the variables xs and is the highest-priority pattern which matches e, giving assignments of terms as to xs. Here "highest priority" is discussed above, when considering ordering. "Matches, giving assignments" can be formalised by considering patterns as a subset of terms and saying "there is a substitution of the terms as for the variables xs such p{as} = e" I think when having deep matching, the semantics should be either fully lazy ("evaluate as much as needed") or fully strict ("always fully evaluate"), but not head-nomalise ("force one layer of constructors").

For actions, depending on our choices about ordering, there may need to be an action to reorder. There would also need to be an action to make patterns: when focussed on one variable in one pattern, it should split one option out, leaving a variable to catch all the other cases or the sole remaining constructor if there is only one option (e.g. change x into Just y and Nothing, or d into Wednesday and d'), duplicating the rest of the branch (if that variable is used in the RHS, it would need to be substituted or let-bound). Conversely, an action for deleting or merging is needed, however the precise spec here is not yet clear. In both cases, there is the issue of what to do with a "clash": consider

case e of
  C True y -> ...
  C x    False -> ...

and then deleting True, or splitting False from y and True from x.

Discussion

None

Future work

See "not in spec"

dhess commented 1 year ago

Thanks for putting this together.

My initial thoughts:

Wildcard branches

Eval rules and ordering

Or patterns

Overall, I'm not inclined to implement this:

Deep matching

brprice commented 1 year ago

The implementation of this is started at #1049. It needs another pass when we finalise this FR to ensure that the implementation and spec line up.

dhess commented 1 year ago

Where did this end up wrt. #1049?

brprice commented 1 year ago

Thanks for digging through the old issues!

We have implemented wildcards (and thus matching on primitives). I have edited the FR above to clarify what choices we made with the implementation details.


WRT your comment above:

For non-primitive types, I would like to elide wildcards from match-related Actions in beginner mode(s).

This has not been done, but would be easy to add

[thoughts about branch ordering]

This has indeed all been left for future work

Or-patterns

Agreed not to implement

Deep matching

As you mentioned, this is future work.

georgefst commented 1 year ago

While we're on the topic, one behaviour I've noticed, which I couldn't quite tell whether it's intentional or not (I don;t think I see it in the spec, but the relevant section is quite complex), is that the initial branches in a match depend on the order of actions:

I would think the latter behaviour is more useful, and we should only default to wildcards for primitive scrutinees.

brprice commented 1 year ago

While we're on the topic, one behaviour I've noticed, which I couldn't quite tell whether it's intentional or not (I don;t think I see it in the spec, but the relevant section is quite complex), is that the initial branches in a match depend on the order of actions:

* Starting with a `match` node, then filling the scrutinee, results in a single wildcard branch.

* Inserting a `match` node in a non-hole puts that expression in the scrutinee and generates the full set of non-wildcard branches.

This is intentional-in-the-implementation. The critical interactions are (a) creating a new match creates all branches and (b) changing the (synthesised) type of the scrutinee ensures that the branches are consistent. To be explicit about the mechanics of the current implementation: the two sequences are

I would think the latter behaviour (i.e. filling in a hole-scrutinee will update the branches) is more useful, and we should only default to wildcards for primitive scrutinees.

I think I have a mild preference agreeing with you, but this could do with more thought about pedagogy and ergonomics-for-a-beginner. However, I am not sure of the precise spec that you would like here. Let's enumerate a few scenarios:

What would you like to see in each of these cases? What is the general rule, and is it consistent with "just do all branch modifications in the typechecker"?


For the result of changing a scrutinee, the relevant part of the spec is

In the implementation this is dealt with mostly via "smartholes". The relevant effect here is when the type of the scrutinee changes. Note that this may well happen often during writing a program: one may start with a hole, then insert a pattern match (giving a hole scrutinee), then fill in the scrutinee (changing its type).

The general problem is different types have different constructors and thus require different branches. In line with the current implementation (which requires every branch to exist, and does not have wildcards), we will simply remove all current branches and re-create the correct ones.

This is perhaps poorly-written, and technically differs slightly from the implementation (indeed, your example is actually called out specifically here!). The issue is that the implementation does not actually react to "the scrutinee('s type) changed", but simply typechecks the resulting program; since a case-of-hole has just a wildcard branch, and this is consistent with any type, we do not recreate the branches.

brprice commented 10 months ago

Closing as the main feature was implemented in #1049. There are other possible choices, enhancements and future work, but they would be best served by a new FR. Some future work is tracked in #334.