pydata / bottleneck

Fast NumPy array functions written in C
BSD 2-Clause "Simplified" License
1.07k stars 104 forks source link

[BUG] bottleneck gives erroneous standard deviation in Pandas with float32 array. #443

Open jamespreed opened 9 months ago

jamespreed commented 9 months ago

When bottlenecks is installed in an environment with Pandas, it causes pandas to return an incorrect result for .std on a constant array (it should return 0.0).

To Reproduce First install pandas. The result is the same when using conda and pip.

conda create -n testenv python=3.11 conda-forge::pandas==2.2.0 -y
conda activate testenv

Running the following code gives the expected result:

import pandas as pd

print(pd.Series([271.46] * 150000, dtype='float32').std())
# prints: 0.0

Now in the same environment, install bottleneck. It is the only additional package installed.

conda install bottleneck -y

Running the same code gives an incorrect result:

import pandas as pd

print(pd.Series([271.46] * 150000, dtype='float32').std())
# prints: 0.229433074593544

Version info: Windows 11, Python 3.11, conda 23.5.2 Output from conda list:

# packages in environment at C:\Users\JamesReed\miniconda3\envs\testenv:
#
# Name                    Version                   Build  Channel
blas                      1.0                         mkl
bottleneck                1.3.7           py311hd7041d2_0
bzip2                     1.0.8                he774522_0
ca-certificates           2023.12.12           haa95532_0
intel-openmp              2023.1.0         h59b6b97_46320
libffi                    3.4.4                hd77b12b_0
mkl                       2023.1.0         h6b88ed4_46358
mkl-service               2.4.0           py311h2bbff1b_1
mkl_fft                   1.3.8           py311h2bbff1b_0
mkl_random                1.2.4           py311h59b6b97_0
numpy                     1.26.3          py311hdab7c0b_0
numpy-base                1.26.3          py311hd01c5d8_0
openssl                   3.0.13               h2bbff1b_0
pandas                    2.2.0           py311hf63dbb6_0    conda-forge
pip                       23.3.1          py311haa95532_0
python                    3.11.7               he1021f5_0
python-dateutil           2.8.2              pyhd3eb1b0_0
python-tzdata             2023.3             pyhd3eb1b0_0
python_abi                3.11                    2_cp311    conda-forge
pytz                      2023.3.post1    py311haa95532_0
setuptools                68.2.2          py311haa95532_0
six                       1.16.0             pyhd3eb1b0_1
sqlite                    3.41.2               h2bbff1b_0
tbb                       2021.8.0             h59b6b97_0
tk                        8.6.12               h2bbff1b_0
tzdata                    2023d                h04d1e81_0
ucrt                      10.0.20348.0         haa95532_0
vc                        14.2                 h21ff451_1
vc14_runtime              14.38.33130         h82b7239_18    conda-forge
vs2015_runtime            14.38.33130         hcb4865c_18    conda-forge
wheel                     0.41.2          py311haa95532_0
xz                        5.4.5                h8cc25b3_0
zlib                      1.2.13               h8cc25b3_0

Additional context I additionall reported the bug in the Pandas github repo: https://github.com/pandas-dev/pandas/issues/57505

rdbisme commented 8 months ago

Would you be able to do a git bisect and see if this is a regression and has been introduced recently, or it's a bug that always been there?

rdbisme commented 8 months ago

This might be related: https://github.com/pydata/bottleneck/issues/164

rdbisme commented 8 months ago

Also you can check if this fixes your problem: https://github.com/pydata/bottleneck/pull/414

jamespreed commented 7 months ago

Would you be able to do a git bisect and see if this is a regression and has been introduced recently, or it's a bug that always been there?

I am really sorry, I don't know how to do that :(