python / cpython

The Python programming language
https://www.python.org
Other
63.17k stars 30.25k forks source link

itertools.count.(peek, consume) #122586

Closed dg-pb closed 2 months ago

dg-pb commented 2 months ago

Feature or enhancement

Proposal:

count().peek() takes no arguments. count().consume(iterable) takes 1 Iterable argument and consumes it while incrementing the counter with each item.

import itertools
counter = itertools.count()
print(counter.peek())       # 0
counter.consume(range(10))
print(counter.peek())       # 10
print(next(counter))        # 10
print(next(counter))        # 11

Compared to my previous attempt (https://github.com/python/cpython/pull/120483):

  1. This implementation is simpler as it re-uses already existent correct counting
  2. It is thread-safe
  3. It adds to flexibility of count object and can be occasionally useful in synchronisation.

Has this already been discussed elsewhere?

I have already discussed this feature proposal on Discourse

Links to previous discussion of this feature:

https://discuss.python.org/t/itertools-count-enhancements/57945

Linked PRs

terryjreedy commented 2 months ago

Your example defines but does not use itl, giving a NameError. It implies the intent, but does not give a complete specification, nor a real rationale. What would consume(-10) do?

The linked discussion had 3 respondents, none coredevs. 2 posted once, apparently negatively.

ZeroIntensity commented 2 months ago

Your example defines but does not use itl, giving a NameError. It implies the intent, but does not give a complete specification, nor a real rationale. What would consume(-10) do?

The linked discussion had 3 respondents, none coredevs. 2 posted once, apparently negatively.

I do agree that this needs more discussion, but I will say that it can be difficult at times to gauge whether something has been "approved" on discourse, since people have varying opinions.

dg-pb commented 2 months ago

Your example defines but does not use itl, giving a NameError. It implies the intent, but does not give a complete specification, nor a real rationale. What would consume(-10) do?

Thank you, corrected the example and updated specification.

The linked discussion had 3 respondents, none coredevs. 2 posted once, apparently negatively.

That is very correct. I did not have any proper replies to this.

dg-pb commented 2 months ago

I do agree that this needs more discussion, but I will say that it can be difficult at times to gauge whether something has been "approved" on discourse, since people have varying opinions.

Just to clarify. I do not think that this is approved.

I did PR as implementation is sometimes best proposal. Especially when it is small(not much time wasted if rejected) and and simple (answers all questions with minimal effort).

This is very simple functionality adding minimal complexity, but it has 1 issue. It is not consistent with other itertools objects in a sense that none of the other "functions" have methods.

Although this is not ideal, but benefits provided might be enough to justify exception here.

picnixz commented 2 months ago

I like the peek() since you cannot retrieve the value of the counter without advancing it (which is a bit annoying if you need to use it afterwards for instance). However, I don't like the consume since it can be replaced by a simple loop (maybe not as efficient, but I'd say it's about the rule of not having all 3-liners in the stdlib).

dg-pb commented 2 months ago

However, I don't like the consume since it can be replaced by a simple loop (maybe not as efficient, but I'd say it's about the rule of not having all 3-liners in the stdlib).

From iterator element counting POV, the whole point of this is performance. So I would say that the closer parallel to this is sum, which is also a simple loop but performance benefits justify it, as opposed to random 3-liner. Of course, it all depends on how useful the performance benefit is. I.e. How frequently it is used and what is the wider impact.

From personal experience, such counting is very often used in other iterator recipes and having efficient counting function can often increase the performance of those significantly. 20% - 70%.

One of many examples (I can provide many more, but would be more useful if others had any):

def takewhile_select(selectors, iterable):
    """Take while parallel elements in selectors are true
    Examples:
        >>> list(takewhile_select([1, 1, 0, 0], [0, 1, 2, 3]))
        [0, 1]
    """
    stop = ilen(takewhile(bool, selectors))
    return islice(iterable, stop)

a = pred = [1] * 100_000
%timeit consume(takewhile_select(pred, a))    # 6.9 ms (loop)
%timeit consume(takewhile_select(pred, a))    # 4.5 ms (more_itertools.ilen)
%timeit consume(takewhile_select(pred, a))    # 2.1 ms (cpython implementation)

Of course, I am also trying t sell this as an improvement to count which is thread safe and can be occasionally useful in synchronisation. However, it has been a while since I worked in that area and can only offer made up example (see discourse).

I can see how it can be realistically useful nevertheless. E.g. counter is used to count total tasks done and is used by many threads. Say if a certain event happens then it counts as 10 tasks done but it doesn't need to do them. Then counter.consume(range(10)) is a performant and convenient way to increment the counter.

picnixz commented 2 months ago

So I would say that the closer parallel to this is sum, which is also a simple loop but performance benefits justify it, as opposed to random 3-liner

Not really. sum also has optimized paths depending on the inputs (and also different results since it uses compensated sums). It's not just a 3-liner in C (unless you were talking about the idiom sum(1 for _ in a)).

E.g. counter is used to count total tasks done and is used by many threads. Say if a certain event happens then it counts as 10 tasks done but it doesn't need to do them. Then counter.consume(range(10)) is a performant and convenient way to increment the counter.

Well... yes but at that point it's for small iterables whereas all your examples are for large iterables (I also hit that kind of observation when improving fnmatch.filter where for small iterables, a loop is preferrable). So you can just do for _ in range(10): next(counter).

picnixz commented 2 months ago

(Hard to reply when you are editing your answers actually)

dg-pb commented 2 months ago

Not really. sum also has optimized paths depending on the inputs (and also different results since it uses compensated sums).

Compensated sums came much later than the sum itself. And optimised paths is a performance improvement in the same way how this offers performance improvement as well. It is a fair parallel:

sum(iterable)
# -
s = 0
fro el in iterable:
    s += el
# versus
ilen(iterable)
# -
s = 0
for _ in iterable:
    s += 1

Well... yes but at that point it's for small iterables whereas all your examples are for large iterables

counter.consume is more performant for increments of any size. Looping over count produces integers on every iteration, which can be avoided by counter.consume('.' * 10) or several other efficient sequence creation that will be faster than calling counter.__next__.

(Hard to reply when you are editing your answers actually)

I know. My bad.

dg-pb commented 2 months ago

I meant b'.' * 10. Good reference for some investigation into these is: https://discuss.python.org/t/repeat-compound-statement/58944

picnixz commented 2 months ago

Well.. I don't really have a strong opinion. The only thing I would find useful with consume is that it actually returns the number of consumed items. It's a bit like calling write() and getting the number of bytes that were written. Personally, I'm +1 on peek() because there is no way to get its value and not modify the state and +0 on consume (though I'd prefer another name...). But instead of a method, we could have a property (for peek, like value) but I don't know if it was already rejected (and if so, then it's a bit hard to support your proposal).

picnixz commented 2 months ago

(Not really stdlib since it's C only, but it's a standard feature I'd say)

dg-pb commented 2 months ago

But instead of a method, we could have a property (forpeek, likevalue`) but I don't know if it was already rejected (and if so, then it's a bit hard to support your proposal).

I am open to different API to it. Initial idea was value property, but went with a method as it needs to do a bit more than simply get the attribute (i.e. check mode and convert to python object in one case), and picked peek given it is primarily an iterator, and looking at the next value is a lookahead (e.g. more_itertools.peekable). But from users POV it could as well be a value property.

If this goes through, I think the owner of the code is in the best position to say what is most appropriate.

Thanks for review.

ZeroIntensity commented 2 months ago

(Not really stdlib since it's C only, but it's a standard feature I'd say)

Oh, TIL. Is the stdlib label supposed to only extend to Python code? What about the pure-C modules in the stdlib?

terryjreedy commented 2 months ago

https://github.com/python/cpython/issues/122586#issuecomment-2265070962

[This proposal] is not consistent with other itertools objects in a sense that none of the other "functions" have methods.

The problem is much deeper. A) Both functions are generic itertools functions that have nothing to do with count in particular. B) both have been previously rejected. Briefly, and I probably omit something important: B1) consume is a one-liner and rarely needed. B2) In general, peeking slows iteration, and is also rarely needed.

Iteration is an intentionally minimal generic access method that will be kept minimal. 'Generators' conceptually subclass 'iterators' to add more access, with 'send' and 'throw' methods.

If one wants more access for an indexable sequence, for instance, use index access instead. Peek anywhere and reset the index to any value as needed. For a virtual sequence like count, which has a hidden internal index, copy the equivalent generator code and modify as wanted.

Although this is not ideal, but benefits provided might be enough to justify exception here.

It would be a mess. If count.peek were added for your special need, others WILL request/demand/beg for .peek to be added elsewhere for their special needs. Perhaps set_iterator.peek, where set_iterator is the class of iter(set()).

Most proposers think their idea is great, but even when they are, people must be willing to accept that we must reject most anyway as not fitting into Python's overall design. That said, it might be sensible to propose that the itertools recipes section include, if not already and not already rejected, a Peeker class that makes any iterator peekable. Also look to see what is in the 3rd party moreitertools package.

dg-pb commented 2 months ago

A) Both functions are generic itertools functions that have nothing to do with count in particular.

I think only one of them is applicable to this. i.e. peek.

At a much higher cost it can be wrapped in another object more_itertools.peekable-like, but it would be a very big overkill for the intended use of this, where theoretically yes, it does peek at the next object, but practically, it just the returns the number of a current count.

B) both have been previously rejected.

I am not so certain about this.

Briefly, and I probably omit something important: B1) consume is a one-liner and rarely needed.

This is different to consume one-liner. It simply does a completely different thing. Yes, it does consume, but the point of it is that it increments the counter while doing it. Thus, it can be used both as element counting utility and also a method to efficiently increment counter without actually iterating over it.

The closest rejected idea was my proposal of ilen addition, but this is simpler and not adding new function to the namespace, while extends functionality of existing object. So from my POV it is significantly different to give it a try.

B2) In general, peeking slows iteration, and is also rarely needed.

Peek'ing is not meant here to be used on every iteration. But rather to retrieve the current count by the workload manager (after a worker was using counter) without incrementing the count (so that it can be passed on to the next worker unmodified).

This has been (sort of) rejected on discourse, but it was at a time when one was able to use __reduce__ to extract the value, so although the idea is the same, the environment has slightly changed.

ZeroIntensity commented 2 months ago

In light of the usual discourse stock answer: does this need changes in itertools directly, or is it trivial enough to implement yourself?

dg-pb commented 2 months ago

In light of the usual discourse stock answer: does this need changes in itertools directly, or is it trivial enough to implement yourself?

I do have my own efficient implementation already. I am not putting this effort so that I can have this for myself in 2 years. I genuinely think that efficient counting implementation is a good addition.

I appreciate that others might not share my POV.

rhettinger commented 2 months ago

Sorry, I'm going to decline this feature request for the reasons listed by @terryjreedy and because it is at odds with the design of the itertools module (mostly functional, side-effect free except for consuming the inputs, mostly easily modeled with simple generators, etc).

Consider using the functions in the more-itertools package. Also, I recommend not working so hard to avoid using the core Python language. In this case, c=0 and c += 1 meet a variety of counting use cases.