python / cpython

The Python programming language
https://www.python.org
Other
63.11k stars 30.22k forks source link

minmax function: (minitem, maxitem) in one pass #92720

Closed ghost closed 2 years ago

ghost commented 2 years ago

Feature or enhancement

A minmax function that computes and returns (as a 2-tuple) the minimum and maximum items of an iterable in one pass. It should have the same interface as min and max.

Pitch

Many applications require computing both the minimum and maximum items of an iterable. Python code that finds the min and max in one pass tends to be slower than the two-pass algorithm of applying both min and max. An implementation of minmax can be written by making simple modifications to min_max in bltinmodule.c.

Here are some benchmarks with x = range(10_000_000) and f = lambda x: -x. The benefits of the one-pass approach are more pronounced when a key function is provided.

# min(x), max(x)
295 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# minmax(x) 
211 ms ± 2.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# min(x, key = f), max(x, key = f)
1.22 s ± 2.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# minmax(x, key = f)
725 ms ± 21.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In this graph, x = range(n) where n ranges in [2 ** k for k in range(27)]. The O(n) complexity of all four approaches is evident. However, a one-pass algorithm is obviously more efficient. bench

Previous discussion

Possible Implementation This is the implementation used for the above benchmarks. It was written using min_max from bltinmodule.c as a basis. It looks long primarily because there are two virtually identical while loops depending on whether a key function was provided or not. It has the same interface as the min and max built-in functions.

static PyObject *
myutils_minmax(PyObject *self, PyObject *args, PyObject *kwds)
{
    const char *name = "minmax";
    PyObject *minitem, *maxitem, *item, *result, *v, *iterator;
    const int positional = PyTuple_Size(args) > 1;

    PyObject *emptytuple, *defaultval = NULL, *keyfunc = NULL;
    static char *kwlist[] = {"key", "default", NULL};
    int ret;

    if (positional) {
        v = args;
    }
    else if (!PyArg_UnpackTuple(args, name, 1, 1, &v)) {
        if (PyExceptionClass_Check(PyExc_TypeError)) {
            PyErr_Format(PyExc_TypeError, "%s expected at least 1 argument, "
                                          "got 0", name);
        }
        return NULL;
    }

    emptytuple = PyTuple_New(0);                       /* own ref (new) */
    if (emptytuple == NULL) {
        return NULL;
    }
    ret = PyArg_ParseTupleAndKeywords(emptytuple, kwds, "|$OO:minmax",
                                      kwlist, &keyfunc, &defaultval);
    Py_DECREF(emptytuple);                             /* discard ref */
    if (!ret) {
        return NULL;
    }

    if (positional && defaultval != NULL) {
        PyErr_Format(PyExc_TypeError,
            "Cannot specify a default for %s() with multiple "
            "positional arguments", name);
        return NULL;
    }

    iterator = PyObject_GetIter(v);                    /* own ref (new) */
    if (iterator == NULL) {
        return NULL;
    }
                                                       /* own ref (new) */
    if (( minitem = maxitem = PyIter_Next(iterator) ) == NULL) {
        if (defaultval != NULL) {
            Py_INCREF(defaultval); /* own it */
            Py_DECREF(iterator);
            return defaultval;
        }
        else {
            PyErr_Format(PyExc_ValueError,
                         "%s() arg is an empty sequence", name);
            Py_DECREF(iterator);
            return NULL;
        }

    }
    Py_INCREF(maxitem);

    /* two paths */
    /* (1) keyfunc is NULL or Py_None. */
    if (keyfunc == NULL || keyfunc == Py_None) {
        int cmp_max, cmp_min;
        while (( item = PyIter_Next(iterator) )) {     /* own new ref */

            cmp_max = PyObject_RichCompareBool(item, maxitem, Py_GT);
            cmp_min = PyObject_RichCompareBool(item, minitem, Py_LT);

            if (cmp_max < 0 || cmp_min < 0) {          /* error */
                goto Fail_it_item;
            }
            else if (cmp_max > 0) {                    /* item > maxitem */
                Py_DECREF(maxitem);
                maxitem = item;
            }
            else if (cmp_min > 0) {                    /* item < minitem */
                Py_DECREF(minitem);
                minitem = item;
            }
            /* neither maxitem nor minitem were assigned item */
            else {
                Py_DECREF(item);
            }
        }
        /* error indicator set? */
        if (PyErr_Occurred()) {
            goto Fail_it;
        }
        result = PyTuple_Pack(2, minitem, maxitem);    /* own ref (new) */
        Py_DECREF(maxitem);
        Py_DECREF(minitem);
        Py_DECREF(iterator);

        /* transfer ownership of reference */
        return result;

        Fail_it_item:
            Py_DECREF(item);
        Fail_it:
            Py_XDECREF(minitem);
            Py_XDECREF(maxitem);
            Py_DECREF(iterator);
            return NULL;
    }
    /* (2) keyfunc is not NULL nor Py_None. */
    else {
        int cmp_max, cmp_min;
        PyObject *minval, *maxval, *val;
                                                        /* own ref */
        minval = maxval = PyObject_CallOneArg(keyfunc, minitem);
        if (minval == NULL) {
            Py_DECREF(minitem);
            Py_DECREF(maxitem);
            Py_DECREF(iterator);
            return NULL;
        }
        Py_INCREF(maxval);

        while (( item = PyIter_Next(iterator) )) {      /* own ref (new) */
            val = PyObject_CallOneArg(keyfunc, item);   /* own ref */
            if (val == NULL) {
                goto Fail_it_item_2;
            }

            cmp_max = PyObject_RichCompareBool(val, maxval, Py_GT);
            cmp_min = PyObject_RichCompareBool(val, minval, Py_LT);

            if (cmp_max < 0 || cmp_min < 0) {           /* error */
                goto Fail_it_item_and_val_2;
            }
            else if (cmp_max > 0) {                     /* val > maxval */
                Py_DECREF(maxitem);
                Py_DECREF(maxval);
                maxitem = item;
                maxval = val;
            }
            else if (cmp_min > 0) {                     /* val < minval */
                Py_DECREF(minitem);
                Py_DECREF(minval);
                minitem = item;
                minval = val;
            }
            /* neither maxitem nor minitem were assigned item */
            else {
                Py_DECREF(item);
                Py_DECREF(val);
            }
        }
        /* error indicator set? */
        if (PyErr_Occurred()) {
            goto Fail_it_2;
        }

        result = PyTuple_Pack(2, minitem, maxitem);     /* own new ref */
        Py_DECREF(maxitem);
        Py_DECREF(minitem);
        Py_DECREF(maxval);
        Py_DECREF(minval);
        Py_DECREF(iterator);

        /* transfer ownership of reference */
        return result;

        Fail_it_item_and_val_2:
            Py_DECREF(val);
        Fail_it_item_2:
            Py_DECREF(item);
        Fail_it_2:
            Py_XDECREF(minitem);
            Py_XDECREF(maxitem);
            Py_XDECREF(minval);
            Py_XDECREF(maxval);
            Py_DECREF(iterator);
            return NULL;
    }
}

PyDoc_STRVAR(minmax_doc,
"minmax(iterable, *[, default=obj, key=func]) -> tuple\n\
minmax(arg1, arg2, *args, *[, key=func]) -> tuple\n\
\n\
With a single iterable argument, return its smallest and \n\
largest item. The default keyword-only argument specifies \n\
an object to return if the provided iterable is empty. This\n\
may not be a tuple.\n\
With two or more arguments, return the smallest and largest\n\
arguments.");
stevendaprano commented 2 years ago

See my comment in 2019.

stevendaprano commented 2 years ago

Here is the implementation of minmax that I use. Because it is written in Python, not C, it is slower to make a single pass in Python than two passes in C, so I only call the one-pass function for iterator arguments.

I have tried to match the existing API for max and min, except that I require two defaults for the empty argument case.

def minmax(*values, defaults=(), key=None):
    """Return the minimum and maximum value.

    minmax(iterable, *[, defaults=(obj, obj), key=func]) -> (minimum, maximum)
    minmax(arg1, arg2, *args, *[, key=func]) -> (minimum, maximum)

    *minmax* is similar to the built-ins *min* and *max*, but can
    work with a single pass over the values, allowing it to work with
    iterators as well as sequences and other iterables.

    With a single iterable argument, return the smallest and largest
    values in the iterable.

    >>> minmax([3, 2, 1, 6, 5, 4])
    (1, 6)

    If the iterable is empty, ValueError will be raised if *defaults*
    is missing or an empty tuple. Otherwise, *defaults* should be a
    tuple of exactly two items to be returned.

    >>> minmax([], defaults=(0, 9))
    (0, 9)

    With two or more arguments, return the smallest and largest
    argument.

    >>> minmax(4, 5, 6, 1, 2, 3)
    (1, 6)

    The optional keyword-only argument ``key`` specifies a key function:

    >>> minmax('aaa', 'bbbb', 'c', 'dd', 'eee', key=len)
    ('c', 'bbbb')

    """
    if len(values) == 0:
        raise TypeError('minmax expected at least one argument, but got zero')
    elif len(values) == 1:
        values = values[0]
    elif defaults != ():
        raise TypeError(
            "Cannot specify defaults with multiple positional arguments"
            )

    # Validate the defaults.
    if defaults != ():
        if not isinstance(defaults, tuple):
            raise TypeError('expected a tuple of defaults, but got %s' 
                            % type(defaults).__name__)
        elif len(defaults) != 2:
            raise ValueError('expected 2 defaults but got %d'
                             % len(defaults))

    if iter(values) is values:
        # Iterator argument, so we swap to an algorithm that only makes one
        # pass over the values. This makes 3*ceil(N/2) comparisons for an
        # iterator of N items.
        return _minmax_iter(values, defaults=defaults, key=key)
    else:
        # For speed, we can use the builtin min() and max() functions.
        # The total number of comparisons here is 2N-2.
        if defaults == ():
            minimum = min(values, key=key)
            maximum = max(values, key=key)
        else:
            minimum = min(values, default=defaults[0], key=key)
            maximum = max(values, default=defaults[1], key=key)
        return (minimum, maximum)

def _minmax_iter(values, *, defaults=(), key=None):
    """Single pass version of minmax suitable for iterators.

    >>> _minmax_iter(iter([4, 3, 2, 1, 5, 9, 8, 7, 6]))
    (1, 9)

    For an iterator with N elements, the number of comparisons is
    3*ceil(N/2), which is approximately 50% fewer than used by two
    passes for the min and max.
    """
    assert type(defaults) is tuple
    assert defaults == () or len(defaults) == 2
    if key is None:
        it = ((value, value) for value in values)
    else:
        it = ((key(value), value) for value in values)
    try:
        keyed_min, minimum = next(it)
    except StopIteration:
        # Empty iterator.
        if defaults == ():
            raise ValueError('minmax argument is empty') from None
        else:
            return defaults
    keyed_max, maximum = keyed_min, minimum
    for a in it:
        b = next(it, a)
        if a[0] > b[0]:
            a, b = b, a
        if a[0] < keyed_min:
            keyed_min, minimum = a
        if b[0] > keyed_max:
            keyed_max, maximum = b
    return (minimum, maximum)
stevendaprano commented 2 years ago

itertools would also be a good place for this is we didn't want it as a builtin.

rhettinger commented 2 years ago

itertools would also be a good place for this

The itertools module is for functions that produce iterators, not ones that consume them. If anything, min/max would go in the statistics module to be used along with quantiles and other summary statistics.

That said, I don't think this should be in the standard library at all. I published a Python min/max recipe over a decade ago. It is mostly only a minor optimization (and cute trick), almost never a bottleneck in real code.

What night be of more interest is a single-pass statistics module function that computed many statistics in a single pass: summary(data) -> count, total, mean, min, max, stdev, pstdev, ...

However, it seems that the early design decisions for the statistics module was to opt for separate functions, to be applied in separate passes, using only pure python implementations, with efficiency being only of secondary concern. The decision was that for anything more advanced or optimized, a user should opt for scipy/numpy or other more powerful tooling.

stevendaprano commented 2 years ago

The main use case for a single pass minmax function would be iterators, since two separate passes over an iterator is not going to work :-)

But your point is taken. I withdraw the suggestion about itertools.

I don't think that a 25% speed up is a "minor optimization". But you are correct, it is unlikely to be an application bottleneck. And I don't think it is merely a cute trick, it is one of the algorithms given in "Introduction to Algorithms" by Cormen, Leiserson and Rivest. (The same Rivest of MD5 and RSA fame.)

single-pass statistics module function that computed many statistics in a single pass

That's an interesting idea. CAS calculators have similar functionality (although I don't know if they implement it in a single pass).

serhiy-storchaka commented 2 years ago

As for performance, the advantage over two calls of min() and max() is trivial. For random data you will need to perform two comparisons for almost every item. It is not a strong argument.

For iterators it can be written as

functools.reduce(lambda x, y: (min(x[0], y), max(x[1], y)), iterator, (next(iterator),)*2)

or

functools.reduce(lambda x, y: ((y, x[1]) if y < x[0] else (x[0], y) if y > x[1] else x), iterator, (next(iterator),)*2)

or

functools.reduce(lambda x, y: sorted([*x, y])[::2], [1, 2, 3, 4, 5], (3, 3))
rhettinger commented 2 years ago

The main use case for a single pass minmax function would be iterators.

In reality, this is almost never needed. We do routinely consume iterators with min() or max(), but it is rare to need both and not also need some other information such as size.

For random data you will need to perform two comparisons for almost every item. It is not a strong argument.

Yes, I was just about to post this. And if the extreme values occur early in the dataset, there is essentially no relevant reduction in work.

Everybody likes this cute recipe (that's why I posted it many years ago), but almost no one needs it (otherwise, I would have proposed it a decade ago).

I think we should pass on this one.

ghost commented 2 years ago

In Python a more direct approach seems to perform better:

_object = object()
def minmax_oda(iterable, *args, key = None, default = _object):
    iterable = (iterable, *args) if args else iterable

    iterator = iter(iterable)
    try:
        minitem = maxitem = next(iterator)
    except StopIteration:
        if default is _object:
            raise ValueError('minmax() arg is an empty sequence')
        return default

    if key is None:
        for x in iterator:
            if x > maxitem:
                maxitem = x
            elif x < minitem:
                minitem = x
    else:
        maxval = minval = key(minitem)
        for x in iterator:
            if (val := key(x)) > maxval:
                maxval, maxitem = val, x
            elif val < minval:
                minval, minitem = val, x

    return (minitem, maxitem)

bench

ghost commented 2 years ago

As for performance, the advantage over two calls of min() and max() is trivial. For random data you will need to perform two comparisons for almost every item. It is not a strong argument.

It is possible to optimize the suggested implementation. The dream scenario would be to almost halve the time it takes to call min(x) and max(x). This is almost achieved for the key function case above.

What night be of more interest is a single-pass statistics module function that computed many statistics in a single pass: summary(data) -> count, total, mean, min, max, stdev, pstdev, ...

This would be a great idea. Even a Python implementation that calculates summary statistics in one pass may prove useful. The statistics module documentation says "[i]t is aimed at the level of graphing and scientific calculators" and as Steven said, "CAS calculators have similar functionality".

Would the statistics module welcome any C implementations? The goal of the statistics module would not change: the aim would not be to compete with specialized libraries such as NumPy (in passing, NumPy also lacks supply for the demand for one-pass statistics).

I think we should pass on this one.

I am still very much for having a one-pass minmax somewhere in the standard library. If not a built-in I believe it's relative usage is higher than many, many functions in the standard library. The same could be said for built-ins. For applications involving a lot of data, perhaps non-numeric, it can be 60 seconds of waiting versus 100 seconds, which I think is significant.

However, I do also understand your points of views.

rhettinger commented 2 years ago

I'm going to decline the suggestion, but @stevendaprano can reopen if he thinks it's important for the statistics module.

Since we already have min() and max(), this is mainly being sold as a optimization. For data of a size when min() takes 60 seconds, numpy would likely be a much better choice. It is likely that other costs will dominate (i.e. loading the data), and it is likely that some other processing such as size, sum, etc will be needed.

ghost commented 2 years ago

Thank you for your consideration @rhettinger.

rhettinger commented 2 years ago

Thank you for the suggestion.

pochmann commented 2 years ago

For random data you will need to perform two comparisons for almost every item. It is not a strong argument.

Yes, I was just about to post this. And if the extreme values occur early in the dataset, there is essentially no relevant reduction in work.

@serhiy-storchaka @rhettinger Neither of you were talking about the solution looking at element pairs, were you? That does only 1.5 comparisons per element, instead of 2. Regardless of randomness and when the extremes occur. "need to perform two" is incorrect.

kaltu commented 1 week ago

numpy has a minmax for ndarray that has seen some use. minmax felt like a natural extension following builtin divmod. if two operators x, y = a / b, a % b can be enough to justify a function call, don't see why heavier two passes mn, mx = min(it), max(it) can't justify a single pass version, especially when the iterator can only be consumed once. late to the party. just saying.