pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.51k stars 17.87k forks source link

BUG: Rolling variance is negative #52407

Open bhigy opened 1 year ago

bhigy commented 1 year ago

Pandas version checks

Reproducible Example

import pandas as pd

A = [0.00000000e+00, 0.00000000e+00, 3.16188252e-18, 2.95781651e-16,
    2.23153542e-51, 0.00000000e+00, 0.00000000e+00, 5.39943432e-48,
    1.38206260e-73, 0.00000000e+00]
ts = pd.DataFrame(A)
print(ts.rolling(window=3, center=True).var(ddof=1))

Issue Description

I am trying to compute a rolling variance and for some reasons, some of the values I obtain are negative. For example, running the code above gives me:

              0
0           NaN
1  3.332500e-36
2  2.885385e-32
3  2.885385e-32
4  2.916226e-32
5 -5.473822e-48
6 -5.473822e-48
7 -5.473822e-48
8 -5.473822e-48
9           NaN

Expected Behavior

All values should be positive and match following array (with the exception of NaN values at the beginning and the end):

[0.00000000e+000 3.33250036e-036 2.88538519e-032 2.88538519e-032
 2.91622617e-032 1.65991678e-102 9.71796366e-096 9.71796366e-096
 9.71796366e-096 6.36699010e-147]

Right now, only the value from index 1 to 4 seem to be correct.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 4c07e0769c3838bba11729e5f37c1ace4a291c84 python : 3.10.6.final.0 python-bits : 64 OS : Linux OS-release : 5.19.0-38-generic Version : #39~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Mar 17 21:16:15 UTC 2 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.1.0.dev0+408.g4c07e0769c numpy : 1.24.2 pytz : 2023.3 dateutil : 2.8.2 setuptools : 59.6.0 pip : 22.0.2 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None
reddyrg1 commented 1 year ago

take

manjalc commented 1 year ago

The issue is probably in the roll_var() function in _libs/windows/aggregations.pyx

topper-123 commented 1 year ago

Yes this is clearly a bug. A PR would be welcome.

kaixiongg commented 1 month ago

Hello, I'm encountering the same issue. Any updates on this? It seems like roll_var already uses the Welford method combined with Kahan summation for more stable precision. Is there any algorithm that offers even greater stability than that?

kaixiongg commented 1 month ago

take

You've already been dealing with this issue for a year without any updates. Could anyone raise this issue again?

rhshadrach commented 3 weeks ago

You've already been dealing with this issue for a year without any updates.

This is typical on the issue tracker. I think it's safe to consider any assigned issue that hasn't seen action for over a month to be abandoned. Any contributor is welcome to pick this up.

Could anyone raise this issue again?

This issue is still open, it does not need to be raised again.