Catastrohpic accuracy loss in large float32 array for nanmean and nanstd

prutschman-iv commented 1 month ago

Describe the bug Starting somewhere between 10 million and 50 million elements, the bn.nanmean and bn.nanstd functions appear to experience a catastrophic loss of accuracy with float32 data.

To Reproduce This code creates float32 arrays of increasing size, and compares the results of the np and Bottleneck versions of nanmean and nanstd:

import numpy as np
import bottleneck as bn
print(f'{np.__version__=} {bn.__version__=}')
million = 10**6
for size in (million, 10*million,50*million, 100*million):
    rand_data = np.random.random(size=size).astype(np.float32)
    print(f"{size}")
    print("    mean\t", np.nanmean(rand_data), bn.nanmean(rand_data))
    print("     std\t", np.nanstd(rand_data), bn.nanstd(rand_data))

When I run it, I get:

np.__version__='1.24.0' bn.__version__='1.4.1'
1000000
    mean         0.5003439 0.5003493428230286
     std         0.28887847 0.28882330656051636
10000000
    mean         0.49992886 0.49994951486587524
     std         0.28866056 0.28725674748420715
50000000
    mean         0.5000019 0.33554431796073914
     std         0.28868446 0.30973944067955017
100000000
    mean         0.4999724 0.16777215898036957
     std         0.2886786 0.38657501339912415

Versions:

Package           Version
----------------- --------------------
astropy           6.1.4
astropy-iers-data 0.2024.10.14.0.32.55
Bottleneck        1.4.1
numpy             1.24.0
packaging         24.1
pip               24.0
pyerfa            2.0.1.4
PyYAML            6.0.2
setuptools        69.2.0
wheel             0.43.0

Expected behavior I expected the differences between numpy and Bottleneck to be zero, or at least small relative to the size of the result.

Additional context I encountered this while trying to track down https://github.com/astropy/astropy/issues/17185 . https://github.com/astropy/astropy/issues/11492 may be related, but there the accuracy loss appeared smaller.

rdbisme commented 1 month ago

This might be related: https://github.com/pydata/bottleneck/issues/164

rdbisme commented 1 month ago

Does this solve the problem? https://github.com/pydata/bottleneck/pull/414

pydata / bottleneck

Catastrohpic accuracy loss in large float32 array for nanmean and nanstd #462