python / cpython

The Python programming language
https://www.python.org
Other
63.37k stars 30.33k forks source link

Incorrect optimization in itertools.tee() #123884

Closed rhettinger closed 1 month ago

rhettinger commented 1 month ago

Bug description:

To save a memory allocation, the code path for a tee-in-a-tee incorrectly reuses the outer tee object as the first tee object in the result tuple. This is incorrect. All tee objects in the result tuple should have the same behavior. They are supposed to be "n independent iterators". However, the first one is not independent and it has different behaviors from the others. This is an unfortunate side-effect of an early incorrect optimization. I've now seen this affect real code. It surprising, unhelpful, undocumented, and hard to debug.

Demonstration:

from itertools import tee

def demo(i):
    it = iter('abcdefghi')
    [outer_tee] = tee(it, 1)
    inner_tee = tee(outer_tee, 10)[i]
    return next(inner_tee), next(outer_tee)

print('These should all give the same result:')
for i in range(10):
    print(i, demo(i))

This outputs:

These should all give the same result:
0 ('a', 'b')
1 ('a', 'a')
2 ('a', 'a')
3 ('a', 'a')
4 ('a', 'a')
5 ('a', 'a')
6 ('a', 'a')
7 ('a', 'a')
8 ('a', 'a')
9 ('a', 'a')

There is a test for the optimization -- it wasn't an accident. However, the optimization itself is a bug against the published specification in the docs and against general expectations.

        a, b = tee('abc')
        c, d = tee(a)
        self.assertTrue(a is c)

Linked PRs

serhiy-storchaka commented 1 month ago

Agree, this is a bug. But fixing it can break a user code. For example:

while True:
    it, peek = tee(it)
    if next(peek).isdigit():
        it, num = parsenum(it)
        continue
    it, peek = tee(it)
    if next(peek).isalpha():
        it, word = parseword(it)
        continue
    it, peek = tee(it)
    if next(peek) in '+-*/':
        op = next(it)
        continue

Without optimization, you will get a growing chain of the tee objects with linearly growing computation time and stack consumption of next(). This may be the original purpose of this optimization, not only performance or memory use.

rhettinger commented 1 month ago

Nested tee() calls self-flatten just like nested partial() calls. So no, there would not be a "growing chain of the tee objects".

y5c4l3 commented 1 month ago

Without optimization, you will get a growing chain of the tee objects with linearly growing computation time and stack consumption of next(). This may be the original purpose of this optimization, not only performance or memory use.

teeobject is copyable, the problem is that the tuple construction loop in tee_impl only starts from 1. The 0-th element is a copyable iterator acquired by GetIter, but teeobject is self iterator AND copyable, which causes the bug here.

y5c4l3 commented 1 month ago

@rhettinger It seems that the re-use of outer_tee isn't valid, since the doc said:

Once a tee() has been created, the original iterable should not be used anywhere else; otherwise, the iterable could get advanced without the tee objects being informed. Once a tee() has been created, the original iterable should not be used anywhere else; otherwise, the iterable could get advanced without the tee objects being informed.

serhiy-storchaka commented 1 month ago

This behavior was introduced in ad983e79d6f215235d205245c2599620e33cf719, and it was explicitly documented from the beginning, so it was considered as a feature or a fair price.

Nested tee() calls self-flatten just like nested partial() calls. So no, there would not be a "growing chain of the tee objects".

It depends on your expectations. Do you expect tee(x for x in it) and tee(it) produce the same result? There are two shorcuts in the tee constructor -- omitting the tee_fromiterable() call if the iterator is copyable and omitting copying the first item in a tuple. The first one allows to avoid the growing chain of the tee objects, but it goes against the published specification and the general expectations. It all does not matter if you do not use the original iterator.

BTW, your use of "outer" and "inner" for iterators is confusing. I would call them in the opposite way -- the nested one is inner.

picnixz commented 1 month ago

Automatic backports failed so someone should take care of them manually.

bedevere-app[bot] commented 3 weeks ago

GH-125081 is a backport of this pull request to the 3.13 branch.

bedevere-app[bot] commented 3 weeks ago

GH-125153 is a backport of this pull request to the 3.12 branch.

serhiy-storchaka commented 3 weeks ago

There was something wrong with the backports -- @bedevere-app usually reports on the original PR page, not on the issue page.

Indeed, the topic of the backport PR should look like "[3.13] gh-{issue-number}: {text} (GH-{pr-number})". These PRs have the issue number in wrong place and do not have the original PR number.