gvanrossum / patma

Pattern Matching
1.02k stars 64 forks source link

Should we allow unpacking raw iterators? #7

Open gvanrossum opened 4 years ago

gvanrossum commented 4 years ago

For example,

def throw(n=1):
    for i in range(n):
        yield randint(1, 6)

match throw(2):
    case [1, 1]: print("Snake eyes")
    case [4, 2] | [2, 4]: print("Easy six")
    case x: print("Some throw:", list(x))

We sure would want the last case to print two numbers. But that would require some kind of buffering of the match values.

viridia commented 4 years ago

As discussed in our conversation there are a couple of different approaches but none of them are completely satisfactory.

Idea 1: Buffered Iterators

The first idea is what I would call a 'buffered iterator', which is a wrapper around an iterator that allows the iteration sequence to be re-started. This is similar to a token lookahead buffer in a parser. The buffer saves each iterated value in a list, and when it is restarted, it consumes from the list first before consuming more items from the inner iterator.

I would suggest that this class could be added to 'itertools' on its own merits, independently of its use in match statements. It is similar in concept to the existing 'tee' function.

Knowing when to wrap iterators is tricky: basically any time you see a destructuring pattern in a match pattern (either top level or nested), and the input value is an iterator but not an iterable (because presumably iterables can be re-iterated), then you need to wrap.

However, as pointed out, this means that the final case (in the example above) where you are not destructuring the iterator, you'll get the wrapper and not the original object - because the original iterator is no longer in a state where you can get the first few items. There's no way to 'rewind'. (Heisenberg's iterators).

Whether this is a problem or not depends on user expectations: does the user expect to get exactly the same type of object in the non-destructuring case, or will any iterator that returns the same set of values do?

Idea 2: Iterate only once

Instead of trying all of the match patterns one at a time, process them simultaneously. This can be done using standard parsing techniques as described in the dragon book.

Basically, each match pattern can be transformed into an NFA (Non-deterministic Finite Automata). Then you can construct a single NFA consisting of all the match patterns as alternate choices. Using the algorithm outlined in the dragon book, convert the combined NFA into a DFA and then into shift/reduce parse tables or whatever representational form is suitable for the matching engine implementation.

So for example, if you have two patterns [1, 2] and [1, 3], the DFA will look like this:

State 1:
* input 1 - proceed to state 2, which is the union of the two match patterns
* anything else - fail
* end of input - fail

State 2:
* input 2 - proceed to state 3
* input 3 - proceed to state 4
* anything else - fail
* end of input - fail

State 3:
* end of input - succeed first match pattern
* anything else - fail

State 4:
* end of input - succeed second match pattern
* anything else - fail

Note: 'fail' in this case means fall through to the default case.

The table generator will need to be careful to resolve ambiguities in favor of the earliest declared match pattern.

Because the input grammar is so simple (no recursion, limited repetition) constructing an NFA should not be difficult - the only tricky bit is the * (rest) operators.

Theoretically this algorithm should be faster and more efficient than testing each of the matches individually. However, it will take longer to construct the parse tables (presumably done at compilation time) and will require some memory to store the graph. You might be able to cut down on this by using a recursive algorithm to evaluate the DFA directly instead of converting it to parsing tables.

Unfortunately, this technique still does not handle the non-destructuring case - once the iterator has been iterated, there's no way to rewind it.

That being said, this approach might be worth taking for other reasons, such as performance as mentioned. It may be worth experimenting with a pure-python implementation of this algorithm.

viridia commented 4 years ago

Also, we want to be careful about terminology here: the problem is not unpacking raw iterables, but raw iterators. It might be a design choice to support the former but not the latter.

gvanrossum commented 4 years ago

But you may still have to buffer some of the values, e.g. in this case:

match x:
  case [1, 2]: ...
  case [a, b]: ...

And IMO we should not assign to a and b if the first case matches.

viridia commented 4 years ago

What happens if the [a, b] case has a guard condition that fails? Are a and b still modified? Personally I don't think you should allow existing in-scope variables to be mutated in match expressions, you can only declare new variables that are scoped to the case suite.

With the parser table approach, what you'd have to do is defer binding until you finally decided which match expression you are going to choose. (And if the guard condition fails, choose another one and re-bind.)

One way to do this would be to keep extra metadata on your shift/reduce stack that remembers the length of each matched subsequence, and then, when you reach an acceptance state, uses that information to figure out which variable binds to which stack slot. Something like "'a' binds to the last slot of the first subsequence..."

However, something you said in the other thread, about match patterns being a purely runtime construct, is incompatible with the parser table approach - i.e. it only makes sense to compute an NFA / DFA if you can do it in advance, since it's an expensive computation. So this whole suggestion may be moot.

I also think that people generally pass around iterables more than they pass around iterators (generators being the notable outlier). I think the use case for supporting iterables is stronger than the use case for supporting iterators, and not supporting the latter is an easy rule to explain.

gvanrossum commented 4 years ago

What happens if the [a, b] case has a guard condition that fails?

One could philosophize about this forever, and in the discussion leading up to the Walrus operator this space was explored extensively. In the end we decided against introducing new scopes, and the walrus will happily override an existing local (or nonlocal or global, if there was such a declaration!); the bindings by the walrus also persist afterwards. This is all the same as what we do with for-loop control variables. I think we should use the same rules here.

match patterns being a purely runtime construct

I didn't mean that the compiler couldn't optimize things. Arguably "2+2" is a runtime construct too, but the compiler still pre-evaluates it and simply loads the constant "4".

the use case for supporting iterables is stronger than the use case for supporting iterators

But note that the formal definition of iterable is such that an iterator is also an iterable. And perhaps more to the point dicts and sets are also iterables. Do you really want dict(p=1, q=2) to match case [a, b] binding a = 'p', b = 'q'? (CORRECTED) I don't.

That's why my current prototype currently insists on the value being an instance of collections.abc.Sequence.

gvanrossum commented 4 years ago

Whoops, I initially gave the wrong example for the dict binding. It will bind the keys, not the values!