Closed milthorpe closed 9 months ago
Working on a PR for this. On one hand, having 0 values error out seems to follow a minority opinion that geometric means should only be computed on positive values. But as an example, geometric means are used to computes averages of financial returns. Having a single 0 return makes the entire return 0 and shpuld therefore return 0. Up for discussion. Seems that scipy implements a similar approach in properly dealing with 0 values
There clearly is a design choice, and could have been and could be made otherwise, but the current behavior is intentional and [documented]: (https://docs.python.org/3/library/statistics.html#statistics.geometric_mean) "Raises a StatisticsError if the input dataset is empty, if it contains a zero, or if it contains a negative value." A PR might be premature.
It seems to me that financial returns can be negative as well as 0, and hence not appropriate.
Given that the exception is documented, it can be caught or avoided (if not all(iterable): geomean = 0.0
) or the data filtered.
@rhettinger
That makes sense. On the other hand classical geometric models do not allow year-over-year returns to be less than 0. (0 in this case means full loss rather than no profit). The purely mathematical definition of a geometric mean seems to support that as well. There doesn’t seem to be an important benefit in supporting that case though apart from conforming with a definition.
On Thu, Nov 30, 2023 at 01:45 Terry Jan Reedy @.***> wrote:
There clearly is a design choice, and could have been and could be made otherwise, but the current behavior is intentional and [documented]: ( https://docs.python.org/3/library/statistics.html#statistics.geometric_mean https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.python.org_3_library_statistics.html-23statistics.geometric-5Fmean&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=xLvTLSyq28zoBOn75eKlRw&m=oGptnDw08PNiM10qrHhXbVUjhryrqYKCq8OEhQ8tWh5REgodDKuciQ2WxtYsWsu5&s=l1RnN8X6hxZswC0dX91SpYsZi8uf3QFdmOom08j6DQM&e=) "Raises a StatisticsError if the input dataset is empty, if it contains a zero, or if it contains a negative value." A PR might be premature.
It seems to me that financial returns can be negative as well as 0, and hence not appropriate.
Given that the exception is documented, it can be caught or avoided (if not all(iterable): geomean = 0.0) or the data filtered.
— Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_python_cpython_issues_112540-23issuecomment-2D1833197984&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=xLvTLSyq28zoBOn75eKlRw&m=oGptnDw08PNiM10qrHhXbVUjhryrqYKCq8OEhQ8tWh5REgodDKuciQ2WxtYsWsu5&s=2B5Nl8JhNHWvhTNibsbKABA_OVyZ-jWixHlEnd-QWhM&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ALVS2RA5PTTPB3HOQQBDXOTYHATO7AVCNFSM6AAAAABAAGZJ6WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZTGE4TOOJYGQ&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=xLvTLSyq28zoBOn75eKlRw&m=oGptnDw08PNiM10qrHhXbVUjhryrqYKCq8OEhQ8tWh5REgodDKuciQ2WxtYsWsu5&s=ZqJdgUzAToEeqQ9_KY-a2fnw7NbVyrak0Thy7MXMkaQ&e= . You are receiving this because you commented.Message ID: @.***>
This definitely isn't a bug. It was the intended, documented, and tested behavior. The API supported common use cases, allowed for a simple and fast implementation, and corresponded to how some other popular tools behaved, MS Excel's geomean() function for example.
That said, we can entertain this as a feature request for Python 3.13.
On the plus side, it matches the OP's preferred definition of a geometric mean prod(data) ** (1 / len(data))
. And there is precedent with scipy.stats.gmean
. And it matches a general notion of a mean as being the single repeated input that keeps some measure constant (the sum in the case of the arithmetic mean and the product in the case of a geometric mean).
On the minus side, it doesn't fit well with the typical goal of being a "measure of central tendency" where a zero input corresponds to a rate of -100%
effectively nullifying the other inputs. In such a case, raising an error is likely more helpful than harmful. Some current users may be relying on the error being raised to detect erroneous inputs (such as entering 0.0
instead of 1.0
for 0%
growth in one period). There is also an important implementation obstacle — I don't see a way to implement zero-in/zero-out support without damaging the performance of all existing, working code. In general, we try to avoid making everyone pay for a feature than almost no one will benefit from.
Unless there is a compelling use case that is not being supported, I think we should leave this as-is. The MS Excel precedent teaches us most users will be fine with what we've offered.
Side note: In the case of the geometric mean, Wolfram Alpha provides a cautionary tale of sticking too closely to a formula without considering use cases: https://www.wolframalpha.com/input?i=geometric+mean+10%2C+14%2C+-5%2C+12
Thanks for the detailed reply, and sorry that I did not read the unit tests before filing this - as you say, it was clearly intended behavior, and I was hewing too closely to the mathematical definition.
My use case was a script that computes different means of benchmark performance measures across many benchmark platforms, where '0' is interpreted as 'did not work on this platform'. My immediate fix was to check for zeros in the input before calling geometric_mean
, which is an easy workaround and will probably be the solution for similar use cases. I'm having a hard time finding other real-world uses of the geo mean that would accept a (non-erroneous) zero input; financial returns (e.g. stock cancelled) and the UN Human Development Index are a couple, but again, it's probably simpler to check for zeros and work around these unusual cases.
Your argument that existing code may be relying on the error to detect erroneous input is compelling, so I would be happy to close this issue as 'won't fix'.
@mdickinson
Do you have any thoughts on whether geometric_mean()
with a zero input should return a zero output or raise an error? Scipy does the former. MS Excel does the latter. I can see an argument for either way but at instinctive level I'm drawn toward returning 0.0
instead of raising an error.
FWIW, an implementation turned out easier than I expected:
def geometric_mean(data):
"""Convert data to floats and compute the geometric mean.
Raises a StatisticsError if the input dataset is empty
of if it contains a negative value.
Returns zero if the product of inputs is zero.
No special efforts are made to achieve exact results.
(However, this may change in the future.)
>>> round(geometric_mean([54, 24, 36]), 9)
36.0
"""
n = 0
found_zero = False
def count_positive(iterable):
nonlocal n, found_zero
for n, x in enumerate(iterable, start=1):
if x > 0.0 or math.isnan(x):
yield x
elif x == 0.0:
found_zero = True
else:
raise StatisticsError('no negs', x)
total = fsum(map(log, count_positive(data)))
if not n:
raise StatisticsError('empty dataset')
if math.isnan(total):
return math.nan
if found_zero:
return 0.0
return exp(total / n)
Like before, an empty input or negative input always raises StatisticsError. The only new cases are 1) a Nan
beats a zero and 2) a zero beats an infinity.
Returning zero sounds reasonable to me in this case. My mental model for this is log -> arithmetic mean -> exp, but allowing log to take on the extreme values of -inf
and inf
for inputs of 0.0
and inf
.
With that model, and assuming IEEE 754-like rules for the mean computation we'd get:
... On the minus side, it doesn't fit well with the typical goal of being a "measure of central tendency" where a zero input corresponds to a rate of
-100%
effectively nullifying the other inputs. In such a case, raising an error is likely more helpful than harmful. Some current users may be relying on the error being raised to detect erroneous inputs (such as entering0.0
instead of1.0
for0%
growth in one period). There is also an important implementation obstacle — I don't see a way to implement zero-in/zero-out support without damaging the performance of all existing, working code. In general, we try to avoid making everyone pay for a feature than almost no one will benefit from.
@rhettinger With #112880 the performance of geometric_mean
has actually improved for small cases. The reason is that a slow path in fmean
is avoided. (in the new implementation we get the size for free, so we can use fsum
instead of fmean
geometric_mean 5: Mean +- std dev: [main] 1.46 us +- 0.04 us -> [pr] 1.30 us +- 0.05 us: 1.12x faster
geometric_mean 100: Mean +- std dev: [main] 8.67 us +- 0.08 us -> [pr] 12.1 us +- 0.0 us: 1.40x slower
Geometric mean: 1.12x slower
For larger cases the additional checks in count_positive
make the code indeed slower.
Bug report
Bug description:
The implementation of
statistics.geometric_mean
using logarithms requires that all input values must be positive. However, a real geometric mean is defined for all sets of non-negative real values. The geo mean of any set of numbers containing zero is itself zero.I believe
geometric_mean
should return 0 if any of the input values are zero. (It should continue to return aStatisticsError
if any of the input values are negative.)CPython versions tested on:
3.11
Operating systems tested on:
Linux
Linked PRs