pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.54k stars 17.89k forks source link

BUG: Window.Rolling.std changes significantly #48950

Closed MichalRIcar closed 1 year ago

MichalRIcar commented 2 years ago

Pandas version checks

Reproducible Example

from pandas import Series as pSe
import numpy  as np

# DATA
mu = 9e-06
si = 5e-06

T0 = [100 for i in range(10)]
T1 = list(np.random.normal(mu, si, 8000))

t = 15

# STD1
T = pSe(T0 + T1)
r1 = T.rolling(t).std().iloc[-1]

# STD2
T = pSe(T1)
r2 = T.rolling(t).std().iloc[-1]

# DIFF (%)
(r2/r1-1)*100

Issue Description

Hello,

I have come to a problematic behavior of rolling window std - and only std as mean works OK always - in the case of the real world data set. Data of length 9000 OBS where only the very first 10 values have a different magnitude and the rest of the data is oscillating around mean, std of its distribution.

However, we have found a problematic behavior as for the last observations of the data set, the rolling window std of length t = 15 was 2 times higher compared to the same data set when just the very first OBS was removed (only 1 obs was removed from the very beginning of the data - 8999 values were just same).

In other words, in the case of the real-world data, the rolling std produced a 2xhigher result at the very end of the data (9000th obs) compared to the very same data set where only the very first observation was removed.

I have tried to reproduce it with artificial data, and I was able to get a result where the difference is 0.5-15% (variance is coming from randomly generated data) as shown below.

image

Expected Behavior

Rolling window std with given lag (e.g. 15) should be consistent as the window.mean is → produces the same results regardless of little and far away (in the data sense) changes.

Installed Versions

INSTALLED VERSIONS ------------------ commit : e8093ba372f9adfe79439d90fe74b0b5b6dea9d6 python : 3.9.7.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19043 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United States.1252 pandas : 1.4.3 numpy : 1.20.3 pytz : 2021.3 dateutil : 2.8.2 setuptools : 58.0.4 pip : 21.2.4 Cython : 0.29.24 pytest : 6.2.4 hypothesis : None sphinx : 4.2.0 blosc : None feather : None xlsxwriter : 3.0.1 lxml.etree : 4.6.3 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 2.11.3 IPython : 7.29.0 pandas_datareader: None bs4 : 4.10.0 bottleneck : 1.3.2 brotli : fastparquet : None fsspec : 2021.10.1 gcsfs : None markupsafe : 1.1.1 matplotlib : 3.5.2 numba : 0.54.1 numexpr : 2.7.3 odfpy : None openpyxl : 3.0.9 pandas_gbq : None pyarrow : 7.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.8.0 snappy : None sqlalchemy : 1.4.22 tables : 3.6.1 tabulate : None xarray : None xlrd : 2.0.1 xlwt : 1.3.0 zstandard : None
phofl commented 2 years ago

Hi, thanks for your report. Can you trim this down, e.g. using a smaller window size? Also could you use non random data and post expected and actual outputs?

MichalRIcar commented 2 years ago

Hello,

trimmed down the data and the result behaves identically with a sample size of 9K. Originally, I used 9K as it is pretty impressive that 1OBS (the very first one) of 9K OBS can impact the very last 9000th std value which is very surprising in both statistical and mathematical ways.

The new example below

from pandas import Series as pSe
import numpy  as np

T0 = [100]

T1 = [1.6308261305290625e-05,
      9.805335698851142e-06,
      1.408486858146467e-05,
      1.1797357972446935e-05,
      1.653045722049812e-05,
      9.01110885313083e-06,
      6.5327497504041435e-06,
      8.96507788785512e-06,
      1.2225732429792821e-05,
      9.839358450287202e-06]

t = 3

# STD1
T = pSe(T0 + T1)
s1 = T.rolling(t).std().iloc[-1]

# STD2
T = pSe(T1)
s2 = T.rolling(t).std().iloc[-1]

# DIFF (%)
(s2/s1-1)*100

The result is 44% difference between s2 and s1 (std value of the last obs) even though the rolling window is 3.

krasch commented 1 year ago

When printing the standard deviations for your example, one can see that they all are different (of course entry number 2 is expected to be different).

>>> pandas.set_option("display.precision", 10)
>>> pSe(T0+T1).rolling(t).std()

0               NaN
1               NaN
2     57.7350193806
3      0.0000030736
4      0.0000017632
5      0.0000020312
6      0.0000036019
7      0.0000050625
8      0.0000007301
9      0.0000025851
10     0.0000011711
dtype: float64

>>> pSe(T1).rolling(t).std()

0             NaN
1             NaN
2    0.0000033052
3    0.0000021415
4    0.0000023670
5    0.0000038014
6    0.0000052064
7    0.0000014178
8    0.0000028565
9    0.0000016878
dtype: float64

The absolute difference between the respective entries is tiny, though (of course relative difference of 44% is impressive). Maybe this is some floating point precision thing happening here?

krasch commented 1 year ago

When simplifying the example even further one can see that there seem to be subtle differences between the two cases:

>>> import pandas as pd

>>> s1 = pd.Series([100_000.0, 0.1, 0.2, 0.3, 0.4])
>>> s2 = pd.Series([0.0, 0.1, 0.2, 0.3, 0.4])

>>> s1

0    100000.0
1         0.1
2         0.2
3         0.3
4         0.4
dtype: float64

>>> s2

0    0.0
1    0.1
2    0.2
3    0.3
4    0.4
dtype: float64

For rolling mean with window size 3, one can already observe small differences in the numeric representation, but the results are as expected.


>>> s1.rolling(window=3).mean()

0             NaN
1             NaN
2    33333.433333
3        0.200000
4        0.300000
dtype: float64

>>> s2.rolling(window=3).mean()

0    NaN
1    NaN
2    0.1
3    0.2
4    0.3
dtype: float64

Whereas for rolling std there is indeed an unexpected, tiny, difference at the last two entries of the series.


>>> s1.rolling(window=3).std()

0             NaN
1             NaN
2    57734.940316
3        0.099998
4        0.099998
dtype: float64

>>> s2.rolling(window=3).std()

0    NaN
1    NaN
2    0.1
3    0.1
4    0.1
phofl commented 1 year ago

Yep this is most certainly a numerical issue. We did some improvements there in either 1.2 or 1.3. I guess we have to dig into the cython implementation there to figure this out. We are already using Welfords Method for increased numerical stability

MichalRIcar commented 1 year ago

Hello, any updates on the issue? Sorry for the push, but this randomness in results is being propagated to many tasks on our side, thus it is very difficult to continue as our algo results debugging is very hard or almost impossible unless the random calc of std is fixed.

On the other note, from math perspective - if rolling().mean() is always correct (as far as I checked) and rolling().std() has a noise in results - the only significant difference in math operation sense between the two is a square root operation. Otherwise, both have the same math operations - subtraction and division.

Do you think the square root operation can cause a numerical issue in the computer float (cython) sense?

phofl commented 1 year ago

The underlying algorithms are different. Investigations are welcome

MichalRIcar commented 1 year ago

Thanks for the quick response.

I've checked and see that you are using Welford's method which is by definition stable computation of var/std so no luck there.

I've checked the rolling.py and all functions inputs are very nicely standardized (Window etc.), thus this issue must be much deeper level. And in such a case, sadly, the only efficient way to fix it is the author of the code I believe.

phofl commented 1 year ago

Yep this is probably on the cython level.

phofl commented 1 year ago

This is most certainly related to https://pandas.pydata.org/docs/whatsnew/v1.3.0.html#removed-artificial-truncation-in-rolling-variance-and-standard-deviation

Your range of significant digits is too large. Not sure if there is a way around it

cc @mroeschke for another look

MichalRIcar commented 1 year ago

I am not sure if I agree as I see real-world data rolling.std results are extra volatile even when we round input vectors hard. I will use an extreme example that clearly proves that it must be something else than significant digits.

x = [1000000000, 0.1, 0.2, 0.3, 0.4, 0.5]
np.std(x[3:], ddof=1)
pandas.Series(x).rolling(3).std().iloc[-1]

numpy result is as expected → 0.1 rolling.std has random value → 11.31

phofl commented 1 year ago

The value 1000000000 is squared internally which leads to a loss of precision. You can check out the Cython implementation, suggestions are welcome. You can not compare the numpy results with out sliding window calculations (with regards to numerical precision). Those are completely different algorithms

MichalRIcar commented 1 year ago

So that rolling std doesn't provide std when input has variance higher than some factor....this should be very highly commented..

MichalRIcar commented 1 year ago

As mentioned, on many real-world data from real processes we see this randomness..and these data are statistically totally right..

MichalRIcar commented 1 year ago

in other words, if pandas provide statistical tools then it has to follow math. If it doesn't it is wrong. And if it si wrong because of computer science reasons - perhaps can be hardly fixed - then there must be a big red warning as it is truly incorrect for stats and math purposes. And as you may imagine some pandas applications are used in important sectors.

phofl commented 1 year ago

This is clearly stated in our user guide https://pandas.pydata.org/docs/user_guide/window.html#overview

MichalRIcar commented 1 year ago

"You can not compare the numpy results with out sliding window calculations"

On that note I strictly disagree - I rather follow an approach where the decision about algo is the output, not its inner logic. If we would use such logic that would lead to justifying any algo with an excuse that inner logic is different.

MichalRIcar commented 1 year ago

Anyways, thank you very much for Your thoughts and time, much appreciated! I assume we will not move forward here. I believe worth to mention that pandas team doing absolutely amazing job and in a daily routine it is an amazing experience with it, thank you for that!

MichalRIcar commented 1 year ago

Problem solved via using engine='numba' which provides correct results regardless the input.

x = [1000000000, 0.1, 0.2, 0.3, 0.4, 0.5]
pandas.Series(x).rolling(window=3).apply(numba_std, engine='numba',raw=True)

result → 0.1 Thus same as np.std(x[3:], ddof=1)

where I defined numba_std based on Welford's algo as follows

def numba_std(X):
    M, S = 0, 0
    for i,x in enumerate(X):
        M0 = M
        M  = M + (x-M)/(i+1)
        S  = S + (x-M)*(x-M0)
    return np.sqrt(S/i)

Note: reason for defining numba_std() is that numba engine doesn't support kwargs and np.std() has default ddof set to 0.

So I believe this is the final proof that it is not about window or number precision but cython implementation problem.

phofl commented 1 year ago

This does not prove anything, since this is not a sliding window calculation. You are calculating the variance for every window completely from scratch. This is a totally different approach than we are taking in the Cython implementations. The numerical instability will become obvious, when you read up on how sliding window calculations work. This is documented as well.

MichalRIcar commented 1 year ago

I believe we are touching more philosophical approach towards coding. Your reasoning to me is totally irrelevant as pandas std claims to provide standard deviation which clearly doesn't as result is extra noisy in real world applications. Thus, I have to conclude that using current cython approach sliding window to functions like std is fundamentally wrong as results are sometimes multiples of the real math value. And no computer science code can justify that, simply it musnt be used. Rather use slower approach or delete it from lib completely. But as I said this the way we work and what set believe, but we are PhD guys in machine learning, so obsessed with stat, I believe your focus is elsewhere.

Once again, thank you very much for the discussion, we have our solution and gladly pay the price in slower computation rather than have wrong tests.

Best, M

jreback commented 1 year ago

@MichalRIcar please try to be constructive

welfords algorithm is used here

https://en.m.wikipedia.org/wiki/Algorithms_for_calculating_variance

this is open source software if you can do better please do

but comments like

"delete it from lib" are totally un helpful to anyone

MichalRIcar commented 1 year ago

I am using Welfords algo just above - numba_std function above is the pure math form of it..

I believe I am being constructive as I described how we do work on our side, we truly delete functions which are not consistent and rather use slower form of them..