Closed ghost closed 2 years ago
See my comment in 2019.
Here is the implementation of minmax
that I use. Because it is written in Python, not C, it is slower to make a single pass in Python than two passes in C, so I only call the one-pass function for iterator arguments.
I have tried to match the existing API for max
and min
, except that I require two defaults for the empty argument case.
def minmax(*values, defaults=(), key=None):
"""Return the minimum and maximum value.
minmax(iterable, *[, defaults=(obj, obj), key=func]) -> (minimum, maximum)
minmax(arg1, arg2, *args, *[, key=func]) -> (minimum, maximum)
*minmax* is similar to the built-ins *min* and *max*, but can
work with a single pass over the values, allowing it to work with
iterators as well as sequences and other iterables.
With a single iterable argument, return the smallest and largest
values in the iterable.
>>> minmax([3, 2, 1, 6, 5, 4])
(1, 6)
If the iterable is empty, ValueError will be raised if *defaults*
is missing or an empty tuple. Otherwise, *defaults* should be a
tuple of exactly two items to be returned.
>>> minmax([], defaults=(0, 9))
(0, 9)
With two or more arguments, return the smallest and largest
argument.
>>> minmax(4, 5, 6, 1, 2, 3)
(1, 6)
The optional keyword-only argument ``key`` specifies a key function:
>>> minmax('aaa', 'bbbb', 'c', 'dd', 'eee', key=len)
('c', 'bbbb')
"""
if len(values) == 0:
raise TypeError('minmax expected at least one argument, but got zero')
elif len(values) == 1:
values = values[0]
elif defaults != ():
raise TypeError(
"Cannot specify defaults with multiple positional arguments"
)
# Validate the defaults.
if defaults != ():
if not isinstance(defaults, tuple):
raise TypeError('expected a tuple of defaults, but got %s'
% type(defaults).__name__)
elif len(defaults) != 2:
raise ValueError('expected 2 defaults but got %d'
% len(defaults))
if iter(values) is values:
# Iterator argument, so we swap to an algorithm that only makes one
# pass over the values. This makes 3*ceil(N/2) comparisons for an
# iterator of N items.
return _minmax_iter(values, defaults=defaults, key=key)
else:
# For speed, we can use the builtin min() and max() functions.
# The total number of comparisons here is 2N-2.
if defaults == ():
minimum = min(values, key=key)
maximum = max(values, key=key)
else:
minimum = min(values, default=defaults[0], key=key)
maximum = max(values, default=defaults[1], key=key)
return (minimum, maximum)
def _minmax_iter(values, *, defaults=(), key=None):
"""Single pass version of minmax suitable for iterators.
>>> _minmax_iter(iter([4, 3, 2, 1, 5, 9, 8, 7, 6]))
(1, 9)
For an iterator with N elements, the number of comparisons is
3*ceil(N/2), which is approximately 50% fewer than used by two
passes for the min and max.
"""
assert type(defaults) is tuple
assert defaults == () or len(defaults) == 2
if key is None:
it = ((value, value) for value in values)
else:
it = ((key(value), value) for value in values)
try:
keyed_min, minimum = next(it)
except StopIteration:
# Empty iterator.
if defaults == ():
raise ValueError('minmax argument is empty') from None
else:
return defaults
keyed_max, maximum = keyed_min, minimum
for a in it:
b = next(it, a)
if a[0] > b[0]:
a, b = b, a
if a[0] < keyed_min:
keyed_min, minimum = a
if b[0] > keyed_max:
keyed_max, maximum = b
return (minimum, maximum)
itertools
would also be a good place for this is we didn't want it as a builtin.
itertools would also be a good place for this
The itertools module is for functions that produce iterators, not ones that consume them. If anything, min/max would go in the statistics module to be used along with quantiles and other summary statistics.
That said, I don't think this should be in the standard library at all. I published a Python min/max recipe over a decade ago. It is mostly only a minor optimization (and cute trick), almost never a bottleneck in real code.
What night be of more interest is a single-pass statistics module function that computed many statistics in a single pass: summary(data) -> count, total, mean, min, max, stdev, pstdev, ...
However, it seems that the early design decisions for the statistics module was to opt for separate functions, to be applied in separate passes, using only pure python implementations, with efficiency being only of secondary concern. The decision was that for anything more advanced or optimized, a user should opt for scipy/numpy or other more powerful tooling.
The main use case for a single pass minmax function would be iterators, since two separate passes over an iterator is not going to work :-)
But your point is taken. I withdraw the suggestion about itertools.
I don't think that a 25% speed up is a "minor optimization". But you are correct, it is unlikely to be an application bottleneck. And I don't think it is merely a cute trick, it is one of the algorithms given in "Introduction to Algorithms" by Cormen, Leiserson and Rivest. (The same Rivest of MD5 and RSA fame.)
single-pass statistics module function that computed many statistics in a single pass
That's an interesting idea. CAS calculators have similar functionality (although I don't know if they implement it in a single pass).
As for performance, the advantage over two calls of min()
and max()
is trivial. For random data you will need to perform two comparisons for almost every item. It is not a strong argument.
For iterators it can be written as
functools.reduce(lambda x, y: (min(x[0], y), max(x[1], y)), iterator, (next(iterator),)*2)
or
functools.reduce(lambda x, y: ((y, x[1]) if y < x[0] else (x[0], y) if y > x[1] else x), iterator, (next(iterator),)*2)
or
functools.reduce(lambda x, y: sorted([*x, y])[::2], [1, 2, 3, 4, 5], (3, 3))
The main use case for a single pass minmax function would be iterators.
In reality, this is almost never needed. We do routinely consume iterators with min() or max(), but it is rare to need both and not also need some other information such as size.
For random data you will need to perform two comparisons for almost every item. It is not a strong argument.
Yes, I was just about to post this. And if the extreme values occur early in the dataset, there is essentially no relevant reduction in work.
Everybody likes this cute recipe (that's why I posted it many years ago), but almost no one needs it (otherwise, I would have proposed it a decade ago).
I think we should pass on this one.
In Python a more direct approach seems to perform better:
_object = object()
def minmax_oda(iterable, *args, key = None, default = _object):
iterable = (iterable, *args) if args else iterable
iterator = iter(iterable)
try:
minitem = maxitem = next(iterator)
except StopIteration:
if default is _object:
raise ValueError('minmax() arg is an empty sequence')
return default
if key is None:
for x in iterator:
if x > maxitem:
maxitem = x
elif x < minitem:
minitem = x
else:
maxval = minval = key(minitem)
for x in iterator:
if (val := key(x)) > maxval:
maxval, maxitem = val, x
elif val < minval:
minval, minitem = val, x
return (minitem, maxitem)
As for performance, the advantage over two calls of
min()
andmax()
is trivial. For random data you will need to perform two comparisons for almost every item. It is not a strong argument.
It is possible to optimize the suggested implementation. The dream scenario would be to almost halve the time it takes to call min(x)
and max(x)
. This is almost achieved for the key function case above.
What night be of more interest is a single-pass statistics module function that computed many statistics in a single pass: summary(data) -> count, total, mean, min, max, stdev, pstdev, ...
This would be a great idea. Even a Python implementation that calculates summary statistics in one pass may prove useful. The statistics module documentation says "[i]t is aimed at the level of graphing and scientific calculators" and as Steven said, "CAS calculators have similar functionality".
Would the statistics module welcome any C implementations? The goal of the statistics module would not change: the aim would not be to compete with specialized libraries such as NumPy
(in passing, NumPy
also lacks supply for the demand for one-pass statistics).
I think we should pass on this one.
I am still very much for having a one-pass minmax
somewhere in the standard library. If not a built-in I believe it's relative usage is higher than many, many functions in the standard library. The same could be said for built-ins. For applications involving a lot of data, perhaps non-numeric, it can be 60 seconds of waiting versus 100 seconds, which I think is significant.
However, I do also understand your points of views.
I'm going to decline the suggestion, but @stevendaprano can reopen if he thinks it's important for the statistics module.
Since we already have min() and max(), this is mainly being sold as a optimization. For data of a size when min() takes 60 seconds, numpy would likely be a much better choice. It is likely that other costs will dominate (i.e. loading the data), and it is likely that some other processing such as size, sum, etc will be needed.
Thank you for your consideration @rhettinger.
Thank you for the suggestion.
For random data you will need to perform two comparisons for almost every item. It is not a strong argument.
Yes, I was just about to post this. And if the extreme values occur early in the dataset, there is essentially no relevant reduction in work.
@serhiy-storchaka @rhettinger Neither of you were talking about the solution looking at element pairs, were you? That does only 1.5 comparisons per element, instead of 2. Regardless of randomness and when the extremes occur. "need to perform two" is incorrect.
numpy has a minmax
for ndarray
that has seen some use.
minmax
felt like a natural extension following builtin divmod
.
if two operators x, y = a / b, a % b
can be enough to justify a function call, don't see why heavier two passes mn, mx = min(it), max(it)
can't justify a single pass version, especially when the iterator can only be consumed once.
late to the party. just saying.
Feature or enhancement
A
minmax
function that computes and returns (as a 2-tuple) the minimum and maximum items of an iterable in one pass. It should have the same interface asmin
andmax
.Pitch
Many applications require computing both the minimum and maximum items of an iterable. Python code that finds the min and max in one pass tends to be slower than the two-pass algorithm of applying both
min
andmax
. An implementation ofminmax
can be written by making simple modifications tomin_max
inbltinmodule.c
.Here are some benchmarks with
x = range(10_000_000)
andf = lambda x: -x
. The benefits of the one-pass approach are more pronounced when a key function is provided.In this graph,
x = range(n)
wheren
ranges in[2 ** k for k in range(27)]
. TheO(n)
complexity of all four approaches is evident. However, a one-pass algorithm is obviously more efficient.Previous discussion
Possible Implementation This is the implementation used for the above benchmarks. It was written using
min_max
frombltinmodule.c
as a basis. It looks long primarily because there are two virtually identicalwhile
loops depending on whether a key function was provided or not. It has the same interface as themin
andmax
built-in functions.