python / cpython

The Python programming language
https://www.python.org
Other
63.15k stars 30.24k forks source link

Enhancement: Adding length *hints* to itertools that have a known size #101789

Closed bartbroere closed 1 year ago

bartbroere commented 1 year ago

Some of the itertools convert their input to tuples before doing their combinatory logic. This means that the size of the inputs is known at this time, and therefore the size of the itertool is "knowable". This issue proposes adding a length hint in these cases.

Pitch

Having a length hint could improve progress reporting, as shown below with tqdm (pending PR https://github.com/tqdm/tqdm/pull/1426):

>>> from itertools import product
>>> from more_itertools import consume
>>> from tqdm import tqdm
>>> from operator import length_hint
>>> from string import ascii_letters
>>> consume(tqdm(product(ascii_letters, ascii_letters)))
100%|███████████████████████████████████████| 2704/2704 [00:00<00:00, 7848718.35it/s]
>>> length_hint(product(ascii_letters, ascii_letters)) 
2704

This could be a nice creature comfort, because it's easy to create a Cartesian product with itertools that's just not feasible to run. Now it can be spotted a bit earlier.

Having length hints could also help presizing lists, if they are created from an itertool. I have not tested this yet though.

Immediately after submitting this issue, I'll submit a work-in-progress pull request to demonstrate it. The PR will still have some TODOs, but I think it might help the discussion. If we reach an agreement on how to do this, of course I can finish the work.

Previous discussion

There's been some discussion previously about the length of itertools. Most of the proposals revolve around adding __len__ to some itertools.

Reading back these issues from 2015 and 2016, one objection stood out: Adding __len__ would change truthy itertools to falsey itertools in cases where the itertool is empty.

Adding the length as a hint hopefully is less objectionable. It shouldn't break any existing code.

Here's links to some of the previous discussion:

Feel free to respond with more issues if I've missed some important discussion.

This is PEP 424 that specs the __length_hint__: https://peps.python.org/pep-0424/

I'm looking forward to receiving replies. Thanks for your time!

Linked PRs

stevendaprano commented 1 year ago

Treating iterators as sized by giving them a length hint has been discussed many times on Discuss and the Python-Ideas mailing list, and been rejected each time.

I expect that these objections will apply to length hints just as much as len() itself. For brevity, I will just use "length" to mean "either __len__ or __length_hint__.

The inherent mutability of iterators breaks the common expectation that mere iteration will not change the length of a a sequence. Consider an iterator it that refers back to the sequence "abcd". How do we determine the length of the iterator?

  1. We can always return the length of the underlying sequence, i.e. 4 in this example. But that will violate the expectation that the length of the iterator will equal the length of the iterator turned into a list.
  2. Or we can do more work and have the iterator track the starting length (that of the underlying sequence) less the number of values already yielded.

Both options are reasonable. Both will be surprising under some circumstances.

Consider an example of where both options fail. Using length() just as a short hand for either getting the length or the length hint:

it = iter("abcd")
n = length(it)  # len() or operator.length_hint()
L = list(it)
assert length(it) == n  # Succeeds under option 1, fails under option 2.
assert length(it) == length(L)  # Fails under option 1, succeeds under option 2.

Both assertions are reasonable invariants but no matter which strategy we use for the length of an iterator, we're going to break one of the invariants and annoy people who expect us to use the other strategy.

The problem here is that the concept of a length (with or without the "hint" suffix) is not a good fit to iterators. Iterators have two distinct concepts:

stevendaprano commented 1 year ago

Some of the itertools convert their input to tuples before doing their combinatory logic.

If that a guaranteed feature of the itertools, or merely a leaky abstraction and current implementation detail which is likely to change in the future?

This could be a nice creature comfort, because it's easy to create a Cartesian product with itertools that's just not feasible to run. Now it can be spotted a bit earlier.

Who is going to make that decision? Where is it going to be spotted? Here is an iterator which is clearly infeasible to run to completion, but we can easily grab the first few values:

>>> it = product(range(100000000), repeat=20)
>>> next(it)
(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

If you are talking about preventing list(it) from running, that's fine for what it is worth. But if product was to look at the length hint (1e160) and refuse to run, that would be bad.

SimpleArt commented 1 year ago

To add onto the last point, the length hint is by nature something which is unreliable. You could have a length hint which is larger than the real length, making list(it) cause an error despite being able to feasibly run.

As for optimizations, it has also been tried a bit before without good results. The big question here is how much performance is lost in general trying to optimize for a case which is very special.

It is also worth noting that in the examples you've given and the examples where it would make sense to have a length hint, you could probably just have a length hint value which you calculate alongside it manually, and that that would probably suffice for your needs. Consumers could then support a length hint argument and fall back to trying to get the length hint from the iterable if the user does not provide the length hint.

rhettinger commented 1 year ago

Thanks for the suggestion, but I am going to decline this.

pochmann commented 1 year ago
  • an iterator would have to change its size as the iterator gets consumed. That would entail adding a new field to the object, increasing its size (which is undesirable). The new number of steps field would also need to updated on every iteration (reducing performance which is also undersirable).

Huh? They're talking about product & co, which already has and updates indices:

https://github.com/python/cpython/blob/b652d40f1c88fcd8595cd401513f6b7f8e499471/Modules/itertoolsmodule.c#L2193-L2199

From that, the length hint could just be computed if/when requested, no?

  • mainly only useful when running an iterator to completion and bringing the entire result into memory (i.e. calling list on the ). That use pattern is uncommon with the combinatoric iterators.

Take tqdm, their motivating example. That's used like for p in tqdm(product(...)):, indeed still looping over the tuples one at a time, not bringing the entire result into memory. It just shows a progress bar. And with a length hint, it could show an estimated percentage.

bartbroere commented 1 year ago

Thanks for all the replies! I'm glad I didn't put too much work into the PR yet. This comment is just a reply and not meant to re-open the issue.

Reading all the replies, it turns out there are objections to implementing __length_hint__ as well. Some of these were the same as for __len__ but I've also learned about some new objections. I'm going to use the two options of @stevendaprano to structure my reply:

  1. We can always return the length of the underlying sequence, i.e. 4 in this example. But that will violate the expectation that the length of the iterator will equal the length of the iterator turned into a list.
  2. Or we can do more work and have the iterator track the starting length (that of the underlying sequence) less the number of values already yielded.

In the PR I started implementing option 1. I have to admit that reading all of the comments above I would now prefer to have the length hint report the number of items left in the iterator (option 2). This prevents the presizing of lists that are too big as @SimpleArt pointed out was a problem with option 1. This has the downside that the iterator has to start keeping track of more state as @rhettinger explained.

Who is going to make that decision? Where is it going to be spotted? Here is an iterator which is clearly infeasible to run to completion, but we can easily grab the first few values:

>>> it = product(range(100000000), repeat=20)
>>> next(it)
(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

If you are talking about preventing list(it) from running, that's fine for what it is worth. But if product was to look at the length hint (1e160) and refuse to run, that would be bad.

There's one thing I need to clarify here, about who is going to make that decision: I had the end user in mind. They run a script, see a progress bar, and make the decision they have to change their script. I didn't have CPython in mind as the one making the decision or spotting that something might not be feasible.

Summarizing: the most important objection I think is that whatever number will be in the length hint, it will always leave some people or some code confused.

So for now I will stick with calculating sizes at runtime when I want to get an idea of the progress.

I might implement option 2 for fun though:

pochmann commented 1 year ago

The problem here is that the concept of a length (with or without the "hint" suffix) is not a good fit to iterators.

Aren't iterators the main target of length hints? PEP 424's abstract mentions them twice and doesn't mention any other kind of type:

Abstract CPython currently defines a __length_hint__ method on several types, such as various iterators. This method is then used by various other functions (such as list) to presize lists based on the estimate returned by __length_hint__. Types which are not sized, and thus should not define __len__, can then define __length_hint__, to allow estimating or computing a size (such as many iterators).

The stated objective, presizing based on the hint, also makes it clear that it's your option 2, it's about the iterator's "remaining" elements. A consumer like list has nothing to do with the iterator's underlying source. It doesn't have access to that source. It has the iterator. When you do list(iterator), why would it want to presize based on the size of anything other than the number of elements that iterator will produce?

That's also how the existing iterator length hints that I've seen behave, for example:

it = iter([1, 2, 3, 4])
for x in it:
    print(x, it.__length_hint__())

Output:

1 3
2 2
3 1
4 0

Some of the itertools convert their input to tuples before doing their combinatory logic.

If that a guaranteed feature of the itertools, or merely a leaky abstraction and current implementation detail which is likely to change in the future?

It's documented for product:

Before product() runs, it completely consumes the input iterables, keeping pools of values in memory to generate the products.

For the others, I'd say it's an implementation detail but it's rather unlikely to change.