Closed bartbroere closed 1 year ago
Treating iterators as sized by giving them a length hint has been discussed many times on Discuss and the Python-Ideas mailing list, and been rejected each time.
I expect that these objections will apply to length hints just as much as len()
itself. For brevity, I will just use "length" to mean "either __len__
or __length_hint__
.
The inherent mutability of iterators breaks the common expectation that mere iteration will not change the length of a a sequence. Consider an iterator it
that refers back to the sequence "abcd"
. How do we determine the length of the iterator?
Both options are reasonable. Both will be surprising under some circumstances.
Consider an example of where both options fail. Using length()
just as a short hand for either getting the length or the length hint:
it = iter("abcd")
n = length(it) # len() or operator.length_hint()
L = list(it)
assert length(it) == n # Succeeds under option 1, fails under option 2.
assert length(it) == length(L) # Fails under option 1, succeeds under option 2.
Both assertions are reasonable invariants but no matter which strategy we use for the length of an iterator, we're going to break one of the invariants and annoy people who expect us to use the other strategy.
The problem here is that the concept of a length (with or without the "hint" suffix) is not a good fit to iterators. Iterators have two distinct concepts:
Some of the itertools convert their input to tuples before doing their combinatory logic.
If that a guaranteed feature of the itertools, or merely a leaky abstraction and current implementation detail which is likely to change in the future?
This could be a nice creature comfort, because it's easy to create a Cartesian product with itertools that's just not feasible to run. Now it can be spotted a bit earlier.
Who is going to make that decision? Where is it going to be spotted? Here is an iterator which is clearly infeasible to run to completion, but we can easily grab the first few values:
>>> it = product(range(100000000), repeat=20)
>>> next(it)
(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
If you are talking about preventing list(it)
from running, that's fine for what it is worth. But if product
was to look at the length hint (1e160) and refuse to run, that would be bad.
To add onto the last point, the length hint is by nature something which is unreliable. You could have a length hint which is larger than the real length, making list(it)
cause an error despite being able to feasibly run.
As for optimizations, it has also been tried a bit before without good results. The big question here is how much performance is lost in general trying to optimize for a case which is very special.
It is also worth noting that in the examples you've given and the examples where it would make sense to have a length hint, you could probably just have a length hint value which you calculate alongside it manually, and that that would probably suffice for your needs. Consumers could then support a length hint argument and fall back to trying to get the length hint from the iterable if the user does not provide the length hint.
Thanks for the suggestion, but I am going to decline this.
To be correct, a __length_hint__
on an iterator would have to change its size as the iterator gets consumed. That would entail adding a new field to the object, increasing its size (which is undesirable). The new number of steps field would also need to updated on every iteration (reducing performance which is also undersirable).
Also to be correct, the computation would have to handle sizes larger than sys.maxsize
. That is no fun and that is why the corresponding computations in Modules/mathmodule.c
are large and complicated.
A __length_hint__
is mainly only useful when running an iterator to completion and bringing the entire result into memory (i.e. calling list
on the ). That use pattern is uncommon with the combinatoric iterators. People usually just loop over the tuples one at a time, taking full advantage of the iterator protocol — that is almost the entire point of the itertools module ;-)
- an iterator would have to change its size as the iterator gets consumed. That would entail adding a new field to the object, increasing its size (which is undesirable). The new number of steps field would also need to updated on every iteration (reducing performance which is also undersirable).
Huh? They're talking about product
& co, which already has and updates indices:
From that, the length hint could just be computed if/when requested, no?
- mainly only useful when running an iterator to completion and bringing the entire result into memory (i.e. calling
list
on the ). That use pattern is uncommon with the combinatoric iterators.
Take tqdm
, their motivating example. That's used like for p in tqdm(product(...)):
, indeed still looping over the tuples one at a time, not bringing the entire result into memory. It just shows a progress bar. And with a length hint, it could show an estimated percentage.
Thanks for all the replies! I'm glad I didn't put too much work into the PR yet. This comment is just a reply and not meant to re-open the issue.
Reading all the replies, it turns out there are objections to implementing __length_hint__
as well. Some of these were the same as for __len__
but I've also learned about some new objections. I'm going to use the two options of @stevendaprano to structure my reply:
- We can always return the length of the underlying sequence, i.e. 4 in this example. But that will violate the expectation that the length of the iterator will equal the length of the iterator turned into a list.
- Or we can do more work and have the iterator track the starting length (that of the underlying sequence) less the number of values already yielded.
In the PR I started implementing option 1. I have to admit that reading all of the comments above I would now prefer to have the length hint report the number of items left in the iterator (option 2). This prevents the presizing of lists that are too big as @SimpleArt pointed out was a problem with option 1. This has the downside that the iterator has to start keeping track of more state as @rhettinger explained.
Who is going to make that decision? Where is it going to be spotted? Here is an iterator which is clearly infeasible to run to completion, but we can easily grab the first few values:
>>> it = product(range(100000000), repeat=20) >>> next(it) (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
If you are talking about preventing list(it) from running, that's fine for what it is worth. But if product was to look at the length hint (1e160) and refuse to run, that would be bad.
There's one thing I need to clarify here, about who is going to make that decision: I had the end user in mind. They run a script, see a progress bar, and make the decision they have to change their script. I didn't have CPython in mind as the one making the decision or spotting that something might not be feasible.
Summarizing: the most important objection I think is that whatever number will be in the length hint, it will always leave some people or some code confused.
So for now I will stick with calculating sizes at runtime when I want to get an idea of the progress.
I might implement option 2 for fun though:
combinationsobject
and permutationsobject
etc., so it might be less than we think).itertools
module).The problem here is that the concept of a length (with or without the "hint" suffix) is not a good fit to iterators.
Aren't iterators the main target of length hints? PEP 424's abstract mentions them twice and doesn't mention any other kind of type:
Abstract CPython currently defines a
__length_hint__
method on several types, such as various iterators. This method is then used by various other functions (such as list) to presize lists based on the estimate returned by__length_hint__
. Types which are not sized, and thus should not define__len__
, can then define__length_hint__
, to allow estimating or computing a size (such as many iterators).
The stated objective, presizing based on the hint, also makes it clear that it's your option 2, it's about the iterator's "remaining" elements. A consumer like list
has nothing to do with the iterator's underlying source. It doesn't have access to that source. It has the iterator. When you do list(iterator)
, why would it want to presize based on the size of anything other than the number of elements that iterator
will produce?
That's also how the existing iterator length hints that I've seen behave, for example:
it = iter([1, 2, 3, 4])
for x in it:
print(x, it.__length_hint__())
Output:
1 3
2 2
3 1
4 0
Some of the itertools convert their input to tuples before doing their combinatory logic.
If that a guaranteed feature of the itertools, or merely a leaky abstraction and current implementation detail which is likely to change in the future?
It's documented for product
:
Before
product()
runs, it completely consumes the input iterables, keeping pools of values in memory to generate the products.
For the others, I'd say it's an implementation detail but it's rather unlikely to change.
Some of the
itertools
convert their input to tuples before doing their combinatory logic. This means that the size of the inputs is known at this time, and therefore the size of the itertool is "knowable". This issue proposes adding a length hint in these cases.Pitch
Having a length hint could improve progress reporting, as shown below with
tqdm
(pending PR https://github.com/tqdm/tqdm/pull/1426):This could be a nice creature comfort, because it's easy to create a Cartesian product with
itertools
that's just not feasible to run. Now it can be spotted a bit earlier.Having length hints could also help presizing lists, if they are created from an itertool. I have not tested this yet though.
Immediately after submitting this issue, I'll submit a work-in-progress pull request to demonstrate it. The PR will still have some TODOs, but I think it might help the discussion. If we reach an agreement on how to do this, of course I can finish the work.
Previous discussion
There's been some discussion previously about the length of
itertools
. Most of the proposals revolve around adding__len__
to someitertools
.Reading back these issues from 2015 and 2016, one objection stood out: Adding
__len__
would change truthyitertools
to falseyitertools
in cases where the itertool is empty.Adding the length as a hint hopefully is less objectionable. It shouldn't break any existing code.
Here's links to some of the previous discussion:
Feel free to respond with more issues if I've missed some important discussion.
This is PEP 424 that specs the
__length_hint__
: https://peps.python.org/pep-0424/I'm looking forward to receiving replies. Thanks for your time!
Linked PRs