Closed seth-p closed 10 years ago
A couple questions:
ewmcorr
instability problem is due to the fact that ewmvar
of a constant series isn't always identically zero, but rather is sometimes a tiny positive or negative number.see the rollimg_var stuff which 'fixes' the numerical issues part by using the Welford algorithms
1) yes 2) can u post an example
Re: 1). Should the answer depend on whether bias
is True or False (for the ewm*
functions), or whether ddof
is 0 or 1 (for others)? Perhaps one would want the result to be either NaN
or 0
depending on the value of bias
/ddof
. I'm really not sure, so want to make sure others are before delving into the code...
Re: 2), I added an example to the end of the initial comment. (Sorry, I guess I should have added it as a separate comment.)
Re: rounding near 0, algos.roll_var()
deals with tiny negative errors very simply:
if val < 0:
val = 0
It doesn't deal with tiny positive errors. (The "smartness" of the algorithm has to do with the rolling aspect of the problem. I don't think it has any particular smartness for dealing with constant series (other than the test for negative errors mentioned above).)
Here's an example showing that rolling_var
also has numerical problems.
In [30]: rolling_var(Series([1., 5., 5., 5.]), window=3)
Out[30]:
0 NaN
1 NaN
2 5.333333e+00
3 8.881784e-16
dtype: float64
The final value should be identically 0.0.
This instability plays out in inconsistent results for the correlation of a constant series with itself:
In [33]: rolling_corr(Series([0., 5., 5., 5.]), window=3)
Out[33]:
0 NaN
1 NaN
2 1
3 NaN
dtype: float64
In [34]: rolling_corr(Series([1., 5., 5., 5.]), window=3)
Out[34]:
0 NaN
1 NaN
2 1
3 0
dtype: float64
In both cases the final value should be the correlation of [5., 5., 5.] with itself, yet one produces NaN
and the other 0
.
CC'ing @snth, @jaimefrio, and @kdiether, who appear to have worked on this/related code.
The check for val < 0
in the rolling_var
function is there because it was in the previous algorithm, and I preferred to play it safe, because a negative value would be a catastrophic failure if computing rolling_std
, although it should be mostly unnecessary with the new algorithm.
I don't think that rounding down to 0 very small values of rolling_var
is a good idea. I have quickly hacked the implementation to keep track of consecutive identical values, see #7916. Have only tested it with your small example above, and have made no assessment of the performance impact. But I feel that if the changes are to be made in rolling_var
, something like this should be the way to go. Rounding down to zero within rolling_corr
wouldn't feel that bad, but maybe that is just me.
The other thing I can commit to doing is implementing a variant of Welford's method for rolling_cov
, see e.g. stable one-pass algorithm here so that when the rolling covariance of a series with itself is computed, the operations lead to the exact same value as computing the rolling variance of the series. This will not solve the problem of 0 / 0
being NaN
, but at least the correlation of a series with itself will either be 1
or NaN
, but not 0
.
Sounds good. I think that if you add an explicit check for constant series, then there's no need to round small results to 0.
Yeah, I have no strong opinion about whether the correlation of a constant series with itself should be 1
or NaN
-- at least it shouldn't be 0
or -1
.
Regarding the general inconsistencies between NaN
and 0
for the second moments of a single value, on further reflection I think it makes sense that the result be NaN
for unbiased statistics (bias=False
or ddof>=1
) but 0
for biased statistics (bias=True
or ddof=0
).
I will go through the list of functions above and see how the behavior accords with this principle.
After further review (and once https://github.com/pydata/pandas/issues/7912 is addressed), I think that all biased (bias=True
or ddof=0
) second moments return 0
for single value, while unbiased (bias=False
or ddof=1
) return NaN
, except for the following, which need to be fixed:
rolling_var
, and rolling_std
return 0.0
; andexpanding_std
and expanding_var
produce Value Error: min_periods (2) must be <= window (1)
.@jaimefrio, I'm going to include in https://github.com/pydata/pandas/pull/7926 a simple fix for the rolling/expanding_var/std
problems mentioned in the immediately prior comment. This is so that I can do consistency checks across the various ewm/expanding/rolling functions. (I hope to have it checked in by tomorrow.) Feel free to do something fancier w/ algos.roll_var()
...
The rolling_var
/std
fix is simply to modify that if
statement that has a # pathological case
comment? We probably just need to to replace it by an if nobs <= ddof: val = NaN
, as the 0 value for single observation biased variance will come up naturally, especially with the recent changes I have in the pipeline.
How will users actually get to specify this parameter? Right now it is hardcoded to unbiased estimation, ddof = 1
.
I have the fix for rolling_var
almost done, but I still need to write some test for it. I also have a branch where I am implementing a stable algorithm for skewness and kurtosis, including a similar check for "all non-NaNs in window are identical."
I have lost track of all the changes you are trying to pull together, but just so you know, I silently approve of your efforts from the distance... ;)
Yeah, sorry, I've been encountering lots of little inconsistencies and edge cases that I've been trying to clean up. Some of the stuff I did is already in master (#7603, #7738, #7766, #7896, #7898 (which is obsolete in view of #7977), and #7934); #8059 is ready to be merged; and #7926 should be ready soon -- it has a lot of new consistency checks for the ewm*
, expanding_*
and rolling_*
functions, which have helped me identify many of these issues (e.g. #8084 and #8086, which I don't have any immediate plans to work on). Things should be a bit cleaner / more stable once I finish #7926.
As for ddof
, some of the rolling_
functions support it, e.g. rolling_var
, though it's not documented. I think it would be best to make it explicit and documented in all the relevant rolling_
and expanding_
functions. I actually make use of it in my new tests in #7926. I'll let you know when that's ready...
OK, I think I'm pretty much done w/ https://github.com/pydata/pandas/pull/7926, though I still need to rebase it after #8059 is merged into master.
Note the test_rolling_consistency()
and test_expanding_consistency()
functions. They each do two things for various arguments (window
, min_periods
, center
):
self._test_moments_consistency()
, to test mathematical consistency between the various moments; androlling/expanding_xyz()
functions with rolling/expanding_apply()
applied to Series/DataFrame.xyz()
.You'll see that several tests are commented out with comments of the form # restore once ...
. These indicate tests that are currently failing (at least in some cases), but should pass once ...
is done/fixed.
@jaimefrio, note in particular that I commented out the test for the correlation of a series with itself being identically 1.
because rolling_cov(x,x)
is not identically equal to rolling_var(x)
, resulting in correlations that are not identically 1.
. As you observed, replacing algos.roll_var()
with an algos.roll_cov()
, and implementing rolling_var()
in terms of it (the way ewmcov()
and ewmvar()
are now defined) should fix the problem.
@jaimefrio, note also that the tests call _test_moments_consistency()
with both unbiased (the default) and biased (with ddof=0
) versions of expanding/rolling_std/var()
, though unfortunately currently there's no way to call a biased version of expanding/rolling_cov()
. (One could use the identity cov(x,y)=0.5*(var(x+y)-var(x)-var(y)) with biased versions of var(), but that doesn't help in testing consistency...) Ideally the putative algos.roll_cov()
would take a ddof
parameter, so that one could calculate unbiased and biased versions of all the second moments.
These tests are rather slow, and can probably be trimmed a bit (i.e. fewer distinct window
and min_periods
values), but I've found them extremely useful in identifying bugs and inconsistencies.
OK, #8059 has been merged into master, and I've updated #7926, which I think is now ready to be merged. @jaimefrio, you may find #7926's test_rolling_consistency()
function helpful in testing.
I noticed (in https://github.com/pydata/pandas/issues/7884) that
ewmvar
,ewmstd
,ewmvol
,ewmcov
,rolling_var
, androlling_std
return0.0
for a single value (assumingmin_periods=0
); whereasSeries.std
,Series.var
,ewmcorr
,expanding_cov
,expanding_corr
,rolling_cov
, androlling_corr
all returnNaN
for a single value.expanding_std
andexpanding_var
produceValue Error: min_periods (2) must be <= window (1)
.I think all of these should all return
NaN
for a single value. At any rate, I would expect greater consistency one way or the other.Mildly related, when calculating the correlation of a constant series with itself,
Series.corr()
,expanding_corr
, androlling_corr
returnNaN
, whileewmcorr
sometimes returnsNaN
, sometimes1
and sometimes-1
, due to numerical accuracy issues. Actually, as shown in a separate comment below,rolling_corr
also produces inconsistent results for a constant subseries following different prior values.Inconsistencies in calculating second moments of a single point:
Instability in
ewmcorr
of a constant series with itself: