micropython / micropython-lib

Core Python libraries ported to MicroPython
Other
2.41k stars 998 forks source link

itertools.tee wrongly programmed #580

Open rkompass opened 1 year ago

rkompass commented 1 year ago

The crux of the tee() function of itertools is that in duplicates/multiplies the generator so that you have independent generator copies: If one is depleated (StopIteration Exception) the others still are able to generate the sequence.

MPs itertools tee definition:

def tee(iterable, n=2):
    return [iter(iterable)] * n

does not achieve this. My very first example from the book "Clean Code in Python" by Mariano Anaya using tee thus lead to an error.

Circuitpythons itertools practically has the same definition. -> same problem.

tee in CPython has to store the objects that some, but not all of the generator copies have produced. It is coded in C. I see no other way to replicate this behaviour than to grow a list of those values. A deque with the option to retrieve intermediate objects would be optimal, which we do not have?!

mattytrentini commented 1 year ago

I guess that's why the CPython implementation is written in C and is ~500 LOC. 😳

We have deque improvements (#440) but - if I understand your suggestion - we'd have to add index? Pretty easy. But it's becoming a 'heavy' solution...

rkompass commented 1 year ago

I just now found a pure python implementation here: https://twitter.com/dontusethiscode/status/1504544685402079233. I have to admit that I only understand 30% of the code :-( - The buffer is there, it is appended to in next or popleft() ted from. I don't see how it keeps track of which of the generator copies has to take from within the buffer..

rkompass commented 1 year ago

I wrote a new tee function and, in a rush, completed itertools, together with tests. Will do a pull request in the next days.

sosi-deadeye commented 1 year ago

This gentleman loves generators and closures.

A method to probe an object if it's a generator:

def gen():
    """
    This is a generator function
    """
    yield

def is_generator(obj):
    """
    Return True is obj is a generator.
    """
    return iter(obj) is iter(obj)

def is_iterable(obj):
    """
    Return True if obj is iterable.
    """
    try:
        iter(obj)
    except TypeError:
        return False
    else:
        return True

Here an example in pure Python, which works also with micropython. Not everything is tested. I do not know if __slots__ has an effect in micropython.

from collections import deque

class Iterator:
    """
    Implementation of an Iterator which uses a deque
    """

    __slots__ = ("buffer", "parent")

    def __init__(self, parent, buffer):
        """
        Keeps a reference to buffer and parent.
        """
        self.buffer = buffer
        self.parent = parent

    def __iter__(self):
        return self

    def __next__(self):
        """
        Next object from buffer.
        """
        if not self.buffer:
            self.parent.next()

        return self.buffer.popleft()

class BufferedIterator:
    __slots__ = ("buffer", "iterator")

    def __init__(self, iterable, n, buffer_len=512):
        """
        Initialize class with buffers and convert the iterable to an iterator
        """
        self.buffer = [deque((), buffer_len) for n in range(n)]
        self.iterator = iter(iterable)

    def next(self):
        """
        Fill all buffers with the next object.
        """
        obj = next(self.iterator)
        for buffer in self.buffer:
            buffer.append(obj)

    def iterators(self):
        """
        Return the iterators.
        Each iterator gets its own buffer
        """
        return [iter(Iterator(self, buffer)) for buffer in self.buffer]

def tee(iterable, n=2):
    if iter(iterable) is iter(iterable):
        return BufferedIterator(iterable, n).iterators()
    else:
        return [iter(iterable) for _ in range(n)]

# some tests        

def gen():
    yield from range(10)

a, b = tee((1, 2, 3))
for x in a:
    ...

for y in b:
    ...

print("Last x and y:", x, y)
del a, b, x, y

print("tee with generator")
c, d = tee(gen())

for x, y in zip(c, d):
    print(f"{x=}|{y=}")

del c, d, x, y
rkompass commented 1 year ago

Hello @sosi-deadeye, thanks for your ideas and code. I din't know that the test if iter(iterable) is iter(iterable): may help to avoid buffers if we have lists or ranges as original iterators. This is a cool idea to save memory.

Will this always work (i.e. can I be sure that if iter(iterable) differs then consuming up one of these iterators, the other will not be consumed too? Or can they differ and still consume from the same source? As an alternative one could test if isinstance(iterable, (range, tuple, list)):.

Your is_generator(obj) test raises an exception if you test a generator.

If I test a true iterator your tee function gives a wrong result:

class iterx:
    def __init__(self, n=10):
        self._last = n
        self._i = 1
    def __iter__(self):
        return self
    def __next__(self):
        if self._i > self._last:
            raise StopIteration
        r = self._i
        self._i +=1
        return r

it1, it2, it3 = tee(iterx(1000), 3)
itmin = min(it1)
print(list([itmin, max(it2), sum(it3)]))

gives [1, 1000, 381184] instead of [1, 1000, 500500] which is the correct result. ???

My tee() function at present is:

def tee(iterable, n=2):
    it = iter(iterable)
    if n < 1:
        return ()
    elif n == 1:
        return (it,)
    buf = []                            # Buffer, contains stored values from itr
    ibuf = [0]*n                        # Indices of the individual generators, could be array('H', [0]*n)     
    def gen(k):                         #   but we have no 0 in ibuf in MP
        nonlocal buf, ibuf              # These are bound to the generators as closures 
        while True:
            if ibuf[k] < len(buf):      # We get an object stored in the buffer.
                r = buf[ibuf[k]]
                ibuf[k] += 1
                if ibuf[k] == 1:        # If we got the first object in the buffer, 
                    if 0 not in ibuf:   #   then check if other generators do not wait anymore on it
                        buf.pop(0)      #   so it may be popped left. Afterwards decrease all indices by 1.
                        for i in range(n):
                            ibuf[i] -= 1
            elif ibuf[k] == len(buf):
                try:
                    r = next(it)
                    buf.append(r)
                    ibuf[k] += 1
                except StopIteration:
                    return
            yield r                     # The returned generators are not thread-safe. For that the access to the
    return tuple(gen(i) for i in range(n))   #   shared buf and ibuf should be protected by locks.

Now that I learnt that generators are not iterators (iter(gen) raises an exception) I think that the iterators returned by tee should not be generators and I'm afraid this function is not correct.

rkompass commented 1 year ago

Correction: Your is_generator(obj) is o.k., I mismatched generator object and generator. The problem with tee returning generators does not exist.

I now included a test

    if iter(iterable) is not iter(iterable):       # save buffer for special cases that iterable is range, tuple, list ...
        return [iter(iterable) for _ in range(n)]  #   that have independent iterators

but I'm not sure if it would be better to test against isinstance(iterable, (range, tuple, list)).

The wrong result with your tee() function is due to the fact that you instantiated BufferedIterator always with buffer_len=512. Exceeding this is not probed. Having a distinct buffer for all iterators returned by tee b.t.w. wastes memory.

Another point: I decided to use lists directly for buffering, as the deque from collections is a list. This should save some memory but has the drawback that as soon as MP has a real deque, the code should be updated.

sosi-deadeye commented 1 year ago

The trick with iter is, that it returns the iterator. A called generator_function returns a generator, which knows its state and is also an Iterator. This is True for CPython, PyPy and Micropython.

If the argument is an iterator, iter returns the same object. If the object is an iterable, but not an iterator, iter will return a different object, which is an iterator.

iter does not consume anything. It's just a thin wrapper around the iterable to track the position. The call of next(iterator) consumes the iterator, and it's consumed indirectly, if you use it with a for-loop or if you call list(iterator) or similar.

This test does also work: id(iter(obj)) == id(obj) The same memory address is the same object. The is operator compares the identity, which is the memory address.

Here some code to test Iterables and Iterators:

class Iterable:
    def __init__(self):
        self.countdown = list(range(10))

    def __len__(self):
        return len(self.countdown)

    def __getitem__(self, index):
        return self.countdown[index]

    def __repr__(self):
        return "<Iterable>"

class Iterator:
    def __init__(self):
        self.countdown = list(range(10))
        self.pos = 0
        self.size = len(self.countdown)

    def __iter__(self):
        return self

    def __next__(self):
        if self.pos < size:
            obj = self.countdown[self.pos]
            self.pos += 1
            return obj
        else:
            raise StopIteration

    def __repr__(self):
        return "<Iterator>"

def generator_function():
    yield 42

generator = generator_function()
iterators = [Iterator(), generator]
iterables = [
    Iterable(),
    dict(),
    set(),
    list(),
    tuple(),
    str("Hello"),
    bytes(),
    bytearray(),
    range(10),
]

for obj in iterators + iterables:
    mem_id1 = id(iter(obj))
    mem_id2 = id(obj)
    # print(mem_id1, mem_id2)

    # iter(obj) is iter(obj)
    # is better
    if mem_id1 == mem_id2:
        kind = "iterator"
    else:
        kind = "iterable"

    print(obj, "is an", kind)

print()
print("cheking if generator is exhausted")
# generator yields only 42
for value in generator:
    print(value)

The wrong result with your tee() function is due to the fact that you instantiated BufferedIterator always with buffer_len=512.

To limit it, was my intention. The problem is, that we don't know the size of the objects, which are kept in buffer. If the distances of the two (or more) iterators are too big, you have too many objects in RAM.

Maybe it's better not to have a limit and push the responsibility to the programmer + a note in the documentation.

Another point: I decided to use lists directly for buffering, as the deque from collections is a list. This should save some memory but has the drawback that as soon as MP has a real deque, the code should be updated.

My first implementation was with lists. I changed it to deque because I think we will get in future a better implementation of deque.

but I'm not sure if it would be better to test against isinstance(iterable, (range, tuple, list)). If you do this, you must test for all possible types. For example, also reversed is an iterator. Python has many types.

Your implementation works with generators. I haven't tested much, but everything looks right.

rkompass commented 1 year ago

Nice instructive example program, thanks.