Closed MichalRIcar closed 1 year ago
Hi, thanks for your report. Can you trim this down, e.g. using a smaller window size? Also could you use non random data and post expected and actual outputs?
Hello,
trimmed down the data and the result behaves identically with a sample size of 9K. Originally, I used 9K as it is pretty impressive that 1OBS (the very first one) of 9K OBS can impact the very last 9000th std value which is very surprising in both statistical and mathematical ways.
The new example below
from pandas import Series as pSe
import numpy as np
T0 = [100]
T1 = [1.6308261305290625e-05,
9.805335698851142e-06,
1.408486858146467e-05,
1.1797357972446935e-05,
1.653045722049812e-05,
9.01110885313083e-06,
6.5327497504041435e-06,
8.96507788785512e-06,
1.2225732429792821e-05,
9.839358450287202e-06]
t = 3
# STD1
T = pSe(T0 + T1)
s1 = T.rolling(t).std().iloc[-1]
# STD2
T = pSe(T1)
s2 = T.rolling(t).std().iloc[-1]
# DIFF (%)
(s2/s1-1)*100
The result is 44% difference between s2 and s1 (std value of the last obs) even though the rolling window is 3.
When printing the standard deviations for your example, one can see that they all are different (of course entry number 2 is expected to be different).
>>> pandas.set_option("display.precision", 10)
>>> pSe(T0+T1).rolling(t).std()
0 NaN
1 NaN
2 57.7350193806
3 0.0000030736
4 0.0000017632
5 0.0000020312
6 0.0000036019
7 0.0000050625
8 0.0000007301
9 0.0000025851
10 0.0000011711
dtype: float64
>>> pSe(T1).rolling(t).std()
0 NaN
1 NaN
2 0.0000033052
3 0.0000021415
4 0.0000023670
5 0.0000038014
6 0.0000052064
7 0.0000014178
8 0.0000028565
9 0.0000016878
dtype: float64
The absolute difference between the respective entries is tiny, though (of course relative difference of 44% is impressive). Maybe this is some floating point precision thing happening here?
When simplifying the example even further one can see that there seem to be subtle differences between the two cases:
>>> import pandas as pd
>>> s1 = pd.Series([100_000.0, 0.1, 0.2, 0.3, 0.4])
>>> s2 = pd.Series([0.0, 0.1, 0.2, 0.3, 0.4])
>>> s1
0 100000.0
1 0.1
2 0.2
3 0.3
4 0.4
dtype: float64
>>> s2
0 0.0
1 0.1
2 0.2
3 0.3
4 0.4
dtype: float64
For rolling mean with window size 3, one can already observe small differences in the numeric representation, but the results are as expected.
>>> s1.rolling(window=3).mean()
0 NaN
1 NaN
2 33333.433333
3 0.200000
4 0.300000
dtype: float64
>>> s2.rolling(window=3).mean()
0 NaN
1 NaN
2 0.1
3 0.2
4 0.3
dtype: float64
Whereas for rolling std there is indeed an unexpected, tiny, difference at the last two entries of the series.
>>> s1.rolling(window=3).std()
0 NaN
1 NaN
2 57734.940316
3 0.099998
4 0.099998
dtype: float64
>>> s2.rolling(window=3).std()
0 NaN
1 NaN
2 0.1
3 0.1
4 0.1
Yep this is most certainly a numerical issue. We did some improvements there in either 1.2 or 1.3. I guess we have to dig into the cython implementation there to figure this out. We are already using Welfords Method for increased numerical stability
Hello, any updates on the issue? Sorry for the push, but this randomness in results is being propagated to many tasks on our side, thus it is very difficult to continue as our algo results debugging is very hard or almost impossible unless the random calc of std is fixed.
On the other note, from math perspective - if rolling().mean() is always correct (as far as I checked) and rolling().std() has a noise in results - the only significant difference in math operation sense between the two is a square root operation. Otherwise, both have the same math operations - subtraction and division.
Do you think the square root operation can cause a numerical issue in the computer float (cython) sense?
The underlying algorithms are different. Investigations are welcome
Thanks for the quick response.
I've checked and see that you are using Welford's method which is by definition stable computation of var/std so no luck there.
I've checked the rolling.py and all functions inputs are very nicely standardized (Window etc.), thus this issue must be much deeper level. And in such a case, sadly, the only efficient way to fix it is the author of the code I believe.
Yep this is probably on the cython level.
This is most certainly related to https://pandas.pydata.org/docs/whatsnew/v1.3.0.html#removed-artificial-truncation-in-rolling-variance-and-standard-deviation
Your range of significant digits is too large. Not sure if there is a way around it
cc @mroeschke for another look
I am not sure if I agree as I see real-world data rolling.std results are extra volatile even when we round input vectors hard. I will use an extreme example that clearly proves that it must be something else than significant digits.
x = [1000000000, 0.1, 0.2, 0.3, 0.4, 0.5]
np.std(x[3:], ddof=1)
pandas.Series(x).rolling(3).std().iloc[-1]
numpy result is as expected → 0.1 rolling.std has random value → 11.31
The value 1000000000
is squared internally which leads to a loss of precision. You can check out the Cython implementation, suggestions are welcome. You can not compare the numpy results with out sliding window calculations (with regards to numerical precision). Those are completely different algorithms
So that rolling std doesn't provide std when input has variance higher than some factor....this should be very highly commented..
As mentioned, on many real-world data from real processes we see this randomness..and these data are statistically totally right..
in other words, if pandas provide statistical tools then it has to follow math. If it doesn't it is wrong. And if it si wrong because of computer science reasons - perhaps can be hardly fixed - then there must be a big red warning as it is truly incorrect for stats and math purposes. And as you may imagine some pandas applications are used in important sectors.
This is clearly stated in our user guide https://pandas.pydata.org/docs/user_guide/window.html#overview
"You can not compare the numpy results with out sliding window calculations"
On that note I strictly disagree - I rather follow an approach where the decision about algo is the output, not its inner logic. If we would use such logic that would lead to justifying any algo with an excuse that inner logic is different.
Anyways, thank you very much for Your thoughts and time, much appreciated! I assume we will not move forward here. I believe worth to mention that pandas team doing absolutely amazing job and in a daily routine it is an amazing experience with it, thank you for that!
Problem solved via using engine='numba' which provides correct results regardless the input.
x = [1000000000, 0.1, 0.2, 0.3, 0.4, 0.5]
pandas.Series(x).rolling(window=3).apply(numba_std, engine='numba',raw=True)
result → 0.1 Thus same as np.std(x[3:], ddof=1)
where I defined numba_std based on Welford's algo as follows
def numba_std(X):
M, S = 0, 0
for i,x in enumerate(X):
M0 = M
M = M + (x-M)/(i+1)
S = S + (x-M)*(x-M0)
return np.sqrt(S/i)
Note: reason for defining numba_std() is that numba engine doesn't support kwargs and np.std() has default ddof set to 0.
So I believe this is the final proof that it is not about window or number precision but cython implementation problem.
This does not prove anything, since this is not a sliding window calculation. You are calculating the variance for every window completely from scratch. This is a totally different approach than we are taking in the Cython implementations. The numerical instability will become obvious, when you read up on how sliding window calculations work. This is documented as well.
I believe we are touching more philosophical approach towards coding. Your reasoning to me is totally irrelevant as pandas std claims to provide standard deviation which clearly doesn't as result is extra noisy in real world applications. Thus, I have to conclude that using current cython approach sliding window to functions like std is fundamentally wrong as results are sometimes multiples of the real math value. And no computer science code can justify that, simply it musnt be used. Rather use slower approach or delete it from lib completely. But as I said this the way we work and what set believe, but we are PhD guys in machine learning, so obsessed with stat, I believe your focus is elsewhere.
Once again, thank you very much for the discussion, we have our solution and gladly pay the price in slower computation rather than have wrong tests.
Best, M
@MichalRIcar please try to be constructive
welfords algorithm is used here
https://en.m.wikipedia.org/wiki/Algorithms_for_calculating_variance
this is open source software if you can do better please do
but comments like
"delete it from lib" are totally un helpful to anyone
I am using Welfords algo just above - numba_std function above is the pure math form of it..
I believe I am being constructive as I described how we do work on our side, we truly delete functions which are not consistent and rather use slower form of them..
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Hello,
I have come to a problematic behavior of rolling window std - and only std as mean works OK always - in the case of the real world data set. Data of length 9000 OBS where only the very first 10 values have a different magnitude and the rest of the data is oscillating around mean, std of its distribution.
However, we have found a problematic behavior as for the last observations of the data set, the rolling window std of length t = 15 was 2 times higher compared to the same data set when just the very first OBS was removed (only 1 obs was removed from the very beginning of the data - 8999 values were just same).
In other words, in the case of the real-world data, the rolling std produced a 2xhigher result at the very end of the data (9000th obs) compared to the very same data set where only the very first observation was removed.
I have tried to reproduce it with artificial data, and I was able to get a result where the difference is 0.5-15% (variance is coming from randomly generated data) as shown below.
Expected Behavior
Rolling window std with given lag (e.g. 15) should be consistent as the window.mean is → produces the same results regardless of little and far away (in the data sense) changes.
Installed Versions