python / cpython

The Python programming language

https://www.python.org

Other

62.47k stars 29.99k forks source link

Add zero value support for statistics.geometric_mean() #112540

Closed milthorpe closed 9 months ago

milthorpe commented 9 months ago

Bug report

Bug description:

The implementation of statistics.geometric_mean using logarithms requires that all input values must be positive. However, a real geometric mean is defined for all sets of non-negative real values. The geo mean of any set of numbers containing zero is itself zero.

from statistics import geometric_mean
geometric_mean([1.0, 2.0, 0.0])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/milthorpe/miniconda3/lib/python3.11/statistics.py", line 489, in geometric_mean
    raise StatisticsError('geometric mean requires a non-empty dataset '
statistics.StatisticsError: geometric mean requires a non-empty dataset containing positive numbers

I believe geometric_mean should return 0 if any of the input values are zero. (It should continue to return a StatisticsError if any of the input values are negative.)

CPython versions tested on:

3.11

Operating systems tested on:

Linux

Linked PRs

gh-112880

mattprodani commented 9 months ago

Working on a PR for this. On one hand, having 0 values error out seems to follow a minority opinion that geometric means should only be computed on positive values. But as an example, geometric means are used to computes averages of financial returns. Having a single 0 return makes the entire return 0 and shpuld therefore return 0. Up for discussion. Seems that scipy implements a similar approach in properly dealing with 0 values

terryjreedy commented 9 months ago

There clearly is a design choice, and could have been and could be made otherwise, but the current behavior is intentional and [documented]: (https://docs.python.org/3/library/statistics.html#statistics.geometric_mean) "Raises a StatisticsError if the input dataset is empty, if it contains a zero, or if it contains a negative value." A PR might be premature.

It seems to me that financial returns can be negative as well as 0, and hence not appropriate.

Given that the exception is documented, it can be caught or avoided (if not all(iterable): geomean = 0.0) or the data filtered.

@rhettinger

mattprodani commented 9 months ago

That makes sense. On the other hand classical geometric models do not allow year-over-year returns to be less than 0. (0 in this case means full loss rather than no profit). The purely mathematical definition of a geometric mean seems to support that as well. There doesn’t seem to be an important benefit in supporting that case though apart from conforming with a definition.

On Thu, Nov 30, 2023 at 01:45 Terry Jan Reedy @.***> wrote:

There clearly is a design choice, and could have been and could be made otherwise, but the current behavior is intentional and [documented]: ( https://docs.python.org/3/library/statistics.html#statistics.geometric_mean https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.python.org_3_library_statistics.html-23statistics.geometric-5Fmean&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=xLvTLSyq28zoBOn75eKlRw&m=oGptnDw08PNiM10qrHhXbVUjhryrqYKCq8OEhQ8tWh5REgodDKuciQ2WxtYsWsu5&s=l1RnN8X6hxZswC0dX91SpYsZi8uf3QFdmOom08j6DQM&e=) "Raises a StatisticsError if the input dataset is empty, if it contains a zero, or if it contains a negative value." A PR might be premature.

It seems to me that financial returns can be negative as well as 0, and hence not appropriate.

Given that the exception is documented, it can be caught or avoided (if not all(iterable): geomean = 0.0) or the data filtered.

@rhettinger https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_rhettinger&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=xLvTLSyq28zoBOn75eKlRw&m=oGptnDw08PNiM10qrHhXbVUjhryrqYKCq8OEhQ8tWh5REgodDKuciQ2WxtYsWsu5&s=liWE7Y-TfKCq1cafbFxOsZyrzG8oKfY9HVQeX0J0yqQ&e=

— Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_python_cpython_issues_112540-23issuecomment-2D1833197984&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=xLvTLSyq28zoBOn75eKlRw&m=oGptnDw08PNiM10qrHhXbVUjhryrqYKCq8OEhQ8tWh5REgodDKuciQ2WxtYsWsu5&s=2B5Nl8JhNHWvhTNibsbKABA_OVyZ-jWixHlEnd-QWhM&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ALVS2RA5PTTPB3HOQQBDXOTYHATO7AVCNFSM6AAAAABAAGZJ6WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZTGE4TOOJYGQ&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=xLvTLSyq28zoBOn75eKlRw&m=oGptnDw08PNiM10qrHhXbVUjhryrqYKCq8OEhQ8tWh5REgodDKuciQ2WxtYsWsu5&s=ZqJdgUzAToEeqQ9_KY-a2fnw7NbVyrak0Thy7MXMkaQ&e= . You are receiving this because you commented.Message ID: @.***>

rhettinger commented 9 months ago

This definitely isn't a bug. It was the intended, documented, and tested behavior. The API supported common use cases, allowed for a simple and fast implementation, and corresponded to how some other popular tools behaved, MS Excel's geomean() function for example.

That said, we can entertain this as a feature request for Python 3.13.

On the plus side, it matches the OP's preferred definition of a geometric mean prod(data) ** (1 / len(data)). And there is precedent with scipy.stats.gmean. And it matches a general notion of a mean as being the single repeated input that keeps some measure constant (the sum in the case of the arithmetic mean and the product in the case of a geometric mean).

On the minus side, it doesn't fit well with the typical goal of being a "measure of central tendency" where a zero input corresponds to a rate of -100% effectively nullifying the other inputs. In such a case, raising an error is likely more helpful than harmful. Some current users may be relying on the error being raised to detect erroneous inputs (such as entering 0.0 instead of 1.0 for 0% growth in one period). There is also an important implementation obstacle — I don't see a way to implement zero-in/zero-out support without damaging the performance of all existing, working code. In general, we try to avoid making everyone pay for a feature than almost no one will benefit from.

Unless there is a compelling use case that is not being supported, I think we should leave this as-is. The MS Excel precedent teaches us most users will be fine with what we've offered.

Side note: In the case of the geometric mean, Wolfram Alpha provides a cautionary tale of sticking too closely to a formula without considering use cases: https://www.wolframalpha.com/input?i=geometric+mean+10%2C+14%2C+-5%2C+12

milthorpe commented 9 months ago

Thanks for the detailed reply, and sorry that I did not read the unit tests before filing this - as you say, it was clearly intended behavior, and I was hewing too closely to the mathematical definition. My use case was a script that computes different means of benchmark performance measures across many benchmark platforms, where '0' is interpreted as 'did not work on this platform'. My immediate fix was to check for zeros in the input before calling geometric_mean, which is an easy workaround and will probably be the solution for similar use cases. I'm having a hard time finding other real-world uses of the geo mean that would accept a (non-erroneous) zero input; financial returns (e.g. stock cancelled) and the UN Human Development Index are a couple, but again, it's probably simpler to check for zeros and work around these unusual cases. Your argument that existing code may be relying on the error to detect erroneous input is compelling, so I would be happy to close this issue as 'won't fix'.

rhettinger commented 9 months ago

@mdickinson Do you have any thoughts on whether geometric_mean() with a zero input should return a zero output or raise an error? Scipy does the former. MS Excel does the latter. I can see an argument for either way but at instinctive level I'm drawn toward returning 0.0 instead of raising an error.

FWIW, an implementation turned out easier than I expected:

def geometric_mean(data):
    """Convert data to floats and compute the geometric mean.

    Raises a StatisticsError if the input dataset is empty
    of if it contains a negative value.

    Returns zero if the product of inputs is zero.

    No special efforts are made to achieve exact results.
    (However, this may change in the future.)

    >>> round(geometric_mean([54, 24, 36]), 9)
    36.0
    """
    n = 0
    found_zero = False
    def count_positive(iterable):
        nonlocal n, found_zero
        for n, x in enumerate(iterable, start=1):
            if x > 0.0 or math.isnan(x):
                yield x
            elif x == 0.0:
                found_zero = True
            else:
                raise StatisticsError('no negs', x)
    total = fsum(map(log, count_positive(data)))
    if not n:
        raise StatisticsError('empty dataset')
    if math.isnan(total):
        return math.nan
    if found_zero:
        return 0.0
    return exp(total / n)

Like before, an empty input or negative input always raises StatisticsError. The only new cases are 1) a Nan beats a zero and 2) a zero beats an infinity.

mdickinson commented 9 months ago

Returning zero sounds reasonable to me in this case. My mental model for this is log -> arithmetic mean -> exp, but allowing log to take on the extreme values of -inf and inf for inputs of 0.0 and inf.

With that model, and assuming IEEE 754-like rules for the mean computation we'd get:

if the input sequence contains any negative values (including negative infinity, but not including negative zero), then raise
if the input sequence contains zero but not positive infinity, the result should be zero
if the input sequence contains positive infinity but not zero, the result should be positive infinity
if the input sequence contains both positive infinity and zero, the result should be an error (for the usual reasons that multiplying zero by infinity is undefined)

eendebakpt commented 9 months ago

... On the minus side, it doesn't fit well with the typical goal of being a "measure of central tendency" where a zero input corresponds to a rate of -100% effectively nullifying the other inputs. In such a case, raising an error is likely more helpful than harmful. Some current users may be relying on the error being raised to detect erroneous inputs (such as entering 0.0 instead of 1.0 for 0% growth in one period). There is also an important implementation obstacle — I don't see a way to implement zero-in/zero-out support without damaging the performance of all existing, working code. In general, we try to avoid making everyone pay for a feature than almost no one will benefit from.

@rhettinger With #112880 the performance of geometric_mean has actually improved for small cases. The reason is that a slow path in fmean is avoided. (in the new implementation we get the size for free, so we can use fsum instead of fmean

geometric_mean 5: Mean +- std dev: [main] 1.46 us +- 0.04 us -> [pr] 1.30 us +- 0.05 us: 1.12x faster
geometric_mean 100: Mean +- std dev: [main] 8.67 us +- 0.08 us -> [pr] 12.1 us +- 0.0 us: 1.40x slower

Geometric mean: 1.12x slower

For larger cases the additional checks in count_positive make the code indeed slower.

Benchmark

``` import pyperf runner = pyperf.Runner() setup=""" from statistics import geometric_mean x=[1,2,3,4,5] y=list(range(1,101)) """ runner.timeit(name=f"geometric_mean 5", stmt=f"geometric_mean(x)", setup=setup) runner.timeit(name=f"geometric_mean 100", stmt=f"geometric_mean(y)", setup=setup) ```