hylang / hyrule

A utility library for Hy
MIT License
44 stars 9 forks source link

Add Clojure's seq abstraction #32

Open gilch opened 7 years ago

gilch commented 7 years ago

We're running into occasional annoying problems because the ubiquitous Python iterators are mutable. Clojure doesn't have this problem because it uses the immutable seq abstraction instead. A seq can still have the lazyness of some iterables, but getting the next item won't change what the first seq means.

I've been playing around with a Python implementation of seq. Here's a rough attempt. This could easily be translated to Hy.

from collections.abc import Iterable

class LazyCons:
    __slots__ = ['car', 'cdr']
    def __init__(self, car, cdr):
        self.car = lambda: car
        def lazy_cdr():
            res = cdr()
            self.cdr = lambda: res
            return res
        self.cdr = lazy_cdr

class Seq(LazyCons, Iterable):
    __slots__ = ()
    def __init__(self, iterable):
        it = iter(iterable)
        car = next(it)
        super().__init__(car, lambda: Seq(it))
    def __iter__(self):
        def iterator():
            seq = self
            while 1:
                yield seq.car()
                seq = seq.cdr()
        return iterator()

And a function to demonstrate the laziness.

def verboseRange(*stop_startstopstep):
    for i in range(*stop_startstopstep):
        print('yielding', i)
        yield i

And the demonstration:

>>> spam = Seq(iter(verboseRange(5)))  # note iter() call
yielding 0
>>> spam.car()
0
>>> spam.car()
0
>>> spam.cdr()
yielding 1
<__main__.Seq object at 0x0000019C4D773D68>
>>> spam.cdr().car()
1
>>> list(spam)
yielding 2
yielding 3
yielding 4
[0, 1, 2, 3, 4]
>>> spam.car()
0

That iter() call isn't required, but it demonstrates converting a mutable iterator into an immutable seq (provided nothing else has a reference to that iterator).

Some more ideas:

Clojure often converts collections to seq's automatically. Rather than add that to a lot of Hy functions, we might instead have a short reader macro to do it. Although, (seq xs) is not that long to begin with.

We could also change the __repr__ of a Seq to print off up to the first 20 items or something. This would make seq's a lot more friendly in the repl, and lazy Python iterables too especially, when combined with the reader macro.

Kodiologist commented 7 years ago

How does this proposal compare to @tuturto's hy.contrib.sequences?

gilch commented 7 years ago

They look pretty different to me. They both have the property of caching the calculated values. @tuturto's sequences seems like more of an alternative to generators. My seq abstraction would probably make that obsolete, since you could just use seq on a normal yield generator (or genexpr) to make it cache its values. My seq works on any iterable or iterator.

tuturto commented 7 years ago

They're pretty different. Mine is a (possibly recursive) sequence defined in terms of index. Think of fibonacci:

=> (defseq fibonacci [n]
...    (if (= n 0) 0
...        (= n 1) 1
...        (+ (get fibonacci (- n 1))
...           (get fibonacci (- n 2)))))

=> (get fibonacci 6)
8

One can iterate over it, but the iterator is mutable (because it's just a normal iterator):

=> (setv fibs (iter fibonacci))
=> (first fibs)
0
=> (first fibs)
1
=> (first fibs)
1
=> (first fibs)
2

gilch's approach takes any iterator and creates a new immutable iterator:

=> (setv fibs (Seq fibonacci))
=> (first fibs)
0
=> (first fibs)
0
=> (first fibs)
0

However, since it doesn't define __getitem__ magic method, indexed access is not supported.

So they're solving related, but slightly different problems.

gilch commented 7 years ago

gilch's approach takes any iterator and creates a new immutable iterator:

More precisely, it takes any iterable and makes a new lazy immutable iterable from it. The iterables are those things that you can use iter() on to get an iterator. It so happens that in Python, all iterators are themselves iterables (their __iter__() method returns self), so you can do things like this:

>>> tuple([1,2,3,4,5])
(1, 2, 3, 4, 5)
>>> tuple(iter(iter(iter([1,2,3,4,5]))))
(1, 2, 3, 4, 5)
>>> spam = iter([1,2,3])
>>> iter(spam) is spam
True
>>> iter(iter(iter(spam))) is spam
True

A better optimized Seq implementation would similarly avoid creating a new Seq when it's created from a Seq or even from an iterator returned by a Seq.

You can use a Seq to convert a (mutable) iterator to an immutable iterable and still use it in any Python function that needs an iterable. So the Python interop is pretty seamless.

If you have an iterator, but want to use it more than once, the usual way of handling that in Python is to realize the iterator into a tuple, which is immutable, but still iterable. This doesn't work well if the iterator is sufficiently large. The itertools.tee() function is a lazy alternative, but the iterators it returns are still mutable.

I think we could also add a cdrs() method to Seq that gets a Seq of all the cdrs, like how {}.items() gives you a different view of the data.

See Clojure's seq abstraction for what else we could do with this. The LazyCons class is inspired by Clojure's lazy-seq.

tuturto commented 7 years ago

I would really love to see this in Hy. While I like using iterators, I know from first hand that some cases are hard to work with and can introduce subtle bugs. Should come up with good naming though, that separates these two from each other. Since @gilch's is closer to what clojure's sequences are, I would propose to renaming my solution to something different.

refi64 commented 7 years ago

This would be pretty great, and it seems like it wouldn't cause any trouble with Python integration.

refi64 commented 7 years ago

Actually there is one thing that worries me: how is the memory usage here? Often in Python, iterators are used when processing large data sequences, and I'm just not sure how the storage of numerous attached functions and such would work out...

gilch commented 7 years ago

We'd have to design it carefully. My proof of concept might have a bit of a problem there. I haven't tested it enough (and have already uncovered other problems). But in principle, if you don't keep a reference to the head, the garbage collector is free to reclaim any lazy cons cells you've passed. So besides a small fixed overhead for the current value, you're not taking up any more memory than any other lazy iterator.

I think Python can reuse the function bytecode and only change the values in the closures, so I'm not too worried about the functions taking up too much memory. They're also generated on the fly, so it's still lazy.