thouis / numpy-trac-migration

numpy Trac to github issues migration
2 stars 3 forks source link

Wrong result when calculating the mean masking nan values on a 64 bit system (Trac #2170) #5960

Open numpy-gitbot opened 11 years ago

numpy-gitbot commented 11 years ago

Original ticket http://projects.scipy.org/numpy/ticket/2170 on 2012-06-19 by trac user knopfra, assigned to unknown.

The array provided as a pickle file at the following link [http://dl.dropbox.com/u/30592272/data.pkl.gz] can be used to detect the following issue (64 bit Linux system, python 2.7.3, numpy 1.6.1). The array has 15606478 elements.

In [1]: import numpy as np

In [2]: a = np.load('data.pkl')

In [3]: np.nanmin(a)[[BR]] Out[3]: 4.715836

In [4]: np.nanmax(a)[[BR]] Out[4]: 4.7189121

In [5]: idx = np.where(np.isfinite(a))

In [6]: a[idx].mean()[[BR]] Out[6]: 4.1792714738680736

In [7]: from scipy.stats import nanmean

In [8]: nanmean(a)[[BR]] Out[8]: 4.1792714738680727

The mean value obtained is clearly wrong. On a 32 bit system the result is the following:

In [1]: import numpy as np

In [2]: a = np.load('data.pkl')

In [3]: idx = np.where(np.isfinite(a))

In [4]: a[idx].mean()[[BR]] Out[4]: 4.7184738182116019

and this time the mean value is correct.

numpy-gitbot commented 11 years ago

trac user knopfra wrote on 2012-06-20

An additional comment that reduces the priority of the issue reported:

The different results obtained by the 64 bit and 32 bit versions is still not clear

numpy-gitbot commented 11 years ago

_trac user ziotom78 wrote on 2012-06-20

I see that there was at least another report of this kind (#465). That bug was closed as "wontfix", because the developers did not want to force an implicit conversion of floating-point types from 32 to 64 bit in the NumPy code.

However, it seems clear to me that this bug arises from the kind of algorithm used to calculate the average: having a large number of samples causes the sum to throw away the least significant digits, incurring in a significant numerical error in the result.

Instead of accumulating the sum and then dividing by the number of elements, one might use the recurrence relation used e.g. by GSL in the implementation of [http://bzr.savannah.gnu.org/lh/gsl/trunk/annotate/head:/statistics/mean_source.c gsl_stats_mean]. The value of the variable "mean" is always of the same order of magnitude of the result, and therefore overflows are less likely to occur. (I asked knopfra to apply this formula on his dataset, and he reported me that even 32-bit arithmetic yields the correct answer, 4.718489.)