pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.53k stars 17.88k forks source link

BUG: Inconsistencies in calculating second moments of a single value #7900

Closed seth-p closed 10 years ago

seth-p commented 10 years ago

I noticed (in https://github.com/pydata/pandas/issues/7884) that ewmvar, ewmstd, ewmvol, ewmcov, rolling_var, and rolling_std return 0.0 for a single value (assuming min_periods=0); whereas Series.std, Series.var, ewmcorr, expanding_cov, expanding_corr, rolling_cov, and rolling_corr all return NaN for a single value. expanding_std and expanding_var produce Value Error: min_periods (2) must be <= window (1).

I think all of these should all return NaN for a single value. At any rate, I would expect greater consistency one way or the other.

Mildly related, when calculating the correlation of a constant series with itself, Series.corr(), expanding_corr, and rolling_corr return NaN, while ewmcorr sometimes returns NaN, sometimes 1 and sometimes -1, due to numerical accuracy issues. Actually, as shown in a separate comment below, rolling_corr also produces inconsistent results for a constant subseries following different prior values.

Inconsistencies in calculating second moments of a single point:

Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 10:45:13) [MSC v.1600 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.

IPython 2.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: from pandas import Series, ewmvar, ewmstd, ewmvol, ewmcov, rolling_var, rolling_std, ewmcorr, expanding_cov, expanding_corr,
 expanding_std, expanding_var, rolling_cov, rolling_corr
In [2]: s = Series([1.])

In [3]: ewmvar(s, halflife=2., min_periods=0)
Out[3]:
0    0
dtype: float64

In [4]: ewmstd(s, halflife=2., min_periods=0)
Out[4]:
0    0
dtype: float64

In [5]: ewmvol(s, halflife=2., min_periods=0)
Out[5]:
0    0
dtype: float64

In [6]: ewmcov(s, s, halflife=2., min_periods=0)
Out[6]:
0    0
dtype: float64

In [7]: rolling_var(s, window=2, min_periods=0)
Out[7]:
0    0
dtype: float64

In [8]: rolling_std(s, window=2, min_periods=0)
Out[8]:
0    0
dtype: float64

In [9]: s.std()
Out[9]: nan

In [10]: s.var()
Out[10]: nan

In [11]: ewmcorr(s, s, halflife=2., min_periods=0)
Out[11]:
0   NaN
dtype: float64

In [12]: expanding_cov(s, s, min_periods=0)
Out[12]:
0   NaN
dtype: float64

In [13]: expanding_corr(s, s, min_periods=0)
Out[13]:
0   NaN
dtype: float64

In [16]: rolling_cov(s, s, window=3, min_periods=0)
Out[16]:
0   NaN
dtype: float64

In [17]: rolling_corr(s, s, window=3, min_periods=0)
Out[17]:
0   NaN
dtype: float64

In [14]: expanding_std(s, min_periods=0)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-7320ee579e6a> in <module>()
----> 1 expanding_std(s, min_periods=0)

C:\Python34\lib\site-packages\pandas\stats\moments.py in f(arg, min_periods, freq, center, **kwargs)
    825             return func(arg, window, minp, **kwds)
    826         return _rolling_moment(arg, window, call_cython, min_periods, freq=freq,
--> 827                                center=center, **kwargs)
    828
    829     return f

C:\Python34\lib\site-packages\pandas\stats\moments.py in _rolling_moment(arg, window, func, minp, axis, freq, center, how, args, kwa
rgs, **kwds)
    345         result = np.apply_along_axis(calc, axis, values)
    346     else:
--> 347         result = calc(values)
    348
    349     rs = return_hook(result)

C:\Python34\lib\site-packages\pandas\stats\moments.py in <lambda>(x)
    339     arg = _conv_timerule(arg, freq, how)
    340     calc = lambda x: func(x, window, minp=minp, args=args, kwargs=kwargs,
--> 341                           **kwds)
    342     return_hook, values = _process_data_structure(arg)
    343     # actually calculate the moment. Faster way to do this?

C:\Python34\lib\site-packages\pandas\stats\moments.py in call_cython(arg, window, minp, args, kwargs, **kwds)
    823         def call_cython(arg, window, minp, args=(), kwargs={}, **kwds):
    824             minp = check_minp(minp, window)
--> 825             return func(arg, window, minp, **kwds)
    826         return _rolling_moment(arg, window, call_cython, min_periods, freq=freq,
    827                                center=center, **kwargs)

C:\Python34\lib\site-packages\pandas\stats\moments.py in <lambda>(*a, **kw)
    604                                how='median')
    605
--> 606 _ts_std = lambda *a, **kw: _zsqrt(algos.roll_var(*a, **kw))
    607 rolling_std = _rolling_func(_ts_std, 'Unbiased moving standard deviation.',
    608                             check_minp=_require_min_periods(1))

C:\Python34\lib\site-packages\pandas\algos.pyd in pandas.algos.roll_var (pandas\algos.c:28990)()

C:\Python34\lib\site-packages\pandas\algos.pyd in pandas.algos._check_minp (pandas\algos.c:16394)()

ValueError: min_periods (2) must be <= window (1)

In [15]: expanding_var(s, min_periods=0)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-15-28e2018aee68> in <module>()
----> 1 expanding_var(s, min_periods=0)

C:\Python34\lib\site-packages\pandas\stats\moments.py in f(arg, min_periods, freq, center, **kwargs)
    825             return func(arg, window, minp, **kwds)
    826         return _rolling_moment(arg, window, call_cython, min_periods, freq=freq,
--> 827                                center=center, **kwargs)
    828
    829     return f

C:\Python34\lib\site-packages\pandas\stats\moments.py in _rolling_moment(arg, window, func, minp, axis, freq, center, how, args, kwa
rgs, **kwds)
    345         result = np.apply_along_axis(calc, axis, values)
    346     else:
--> 347         result = calc(values)
    348
    349     rs = return_hook(result)

C:\Python34\lib\site-packages\pandas\stats\moments.py in <lambda>(x)
    339     arg = _conv_timerule(arg, freq, how)
    340     calc = lambda x: func(x, window, minp=minp, args=args, kwargs=kwargs,
--> 341                           **kwds)
    342     return_hook, values = _process_data_structure(arg)
    343     # actually calculate the moment. Faster way to do this?

C:\Python34\lib\site-packages\pandas\stats\moments.py in call_cython(arg, window, minp, args, kwargs, **kwds)
    823         def call_cython(arg, window, minp, args=(), kwargs={}, **kwds):
    824             minp = check_minp(minp, window)
--> 825             return func(arg, window, minp, **kwds)
    826         return _rolling_moment(arg, window, call_cython, min_periods, freq=freq,
    827                                center=center, **kwargs)

C:\Python34\lib\site-packages\pandas\algos.pyd in pandas.algos.roll_var (pandas\algos.c:28990)()

C:\Python34\lib\site-packages\pandas\algos.pyd in pandas.algos._check_minp (pandas\algos.c:16394)()

ValueError: min_periods (2) must be <= window (1)

In [20]: show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.1.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.14.1
nose: 1.3.3
Cython: 0.20.2
numpy: 1.9.0b1
scipy: 0.14.0
statsmodels: 0.5.0
IPython: 2.1.0
sphinx: 1.2.2
patsy: 0.3.0
scikits.timeseries: None
dateutil: 2.2
pytz: 2014.4
bottleneck: 0.8.0
tables: 3.1.1
numexpr: 2.4
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.7
pymysql: None
psycopg2: None

Instability in ewmcorr of a constant series with itself:

In [1]: from pandas import Series, ewmcorr, expanding_corr, rolling_corr

In [2]: s = Series([5., 5., 5.])

In [3]: s.corr(s)
Out[3]: nan

In [4]: expanding_corr(s, s)
Out[4]:
0   NaN
1   NaN
2   NaN
dtype: float64

In [5]: rolling_corr(s, s, window=3)
Out[5]:
0   NaN
1   NaN
2   NaN
dtype: float64

In [6]: ewmcorr(s, s, halflife=3.)
Out[6]:
0   -1
1   -1
2   -1
dtype: float64

In [9]: ewmcorr(Series([3., 3., 3.]), halflife=3.)
Out[9]:
0   NaN
1     1
2     1
dtype: float64
seth-p commented 10 years ago

A couple questions:

  1. Do you agree that the second moment function should all return NaN for a single value?
  2. Is there an existing function in Pandas that rounds tiny numbers near 0 to 0? The ewmcorr instability problem is due to the fact that ewmvar of a constant series isn't always identically zero, but rather is sometimes a tiny positive or negative number.
jreback commented 10 years ago

see the rollimg_var stuff which 'fixes' the numerical issues part by using the Welford algorithms

1) yes 2) can u post an example

seth-p commented 10 years ago

Re: 1). Should the answer depend on whether bias is True or False (for the ewm* functions), or whether ddof is 0 or 1 (for others)? Perhaps one would want the result to be either NaN or 0 depending on the value of bias/ddof. I'm really not sure, so want to make sure others are before delving into the code...

Re: 2), I added an example to the end of the initial comment. (Sorry, I guess I should have added it as a separate comment.)

seth-p commented 10 years ago

Re: rounding near 0, algos.roll_var() deals with tiny negative errors very simply:

                if val < 0:
                    val = 0

It doesn't deal with tiny positive errors. (The "smartness" of the algorithm has to do with the rolling aspect of the problem. I don't think it has any particular smartness for dealing with constant series (other than the test for negative errors mentioned above).)

seth-p commented 10 years ago

Here's an example showing that rolling_var also has numerical problems.

In [30]: rolling_var(Series([1., 5., 5., 5.]), window=3)
Out[30]:
0             NaN
1             NaN
2    5.333333e+00
3    8.881784e-16
dtype: float64

The final value should be identically 0.0.

This instability plays out in inconsistent results for the correlation of a constant series with itself:

In [33]: rolling_corr(Series([0., 5., 5., 5.]), window=3)
Out[33]:
0   NaN
1   NaN
2     1
3   NaN
dtype: float64

In [34]: rolling_corr(Series([1., 5., 5., 5.]), window=3)
Out[34]:
0   NaN
1   NaN
2     1
3     0
dtype: float64

In both cases the final value should be the correlation of [5., 5., 5.] with itself, yet one produces NaN and the other 0.

seth-p commented 10 years ago

CC'ing @snth, @jaimefrio, and @kdiether, who appear to have worked on this/related code.

jaimefrio commented 10 years ago

The check for val < 0 in the rolling_var function is there because it was in the previous algorithm, and I preferred to play it safe, because a negative value would be a catastrophic failure if computing rolling_std, although it should be mostly unnecessary with the new algorithm.

I don't think that rounding down to 0 very small values of rolling_var is a good idea. I have quickly hacked the implementation to keep track of consecutive identical values, see #7916. Have only tested it with your small example above, and have made no assessment of the performance impact. But I feel that if the changes are to be made in rolling_var, something like this should be the way to go. Rounding down to zero within rolling_corr wouldn't feel that bad, but maybe that is just me.

The other thing I can commit to doing is implementing a variant of Welford's method for rolling_cov, see e.g. stable one-pass algorithm here so that when the rolling covariance of a series with itself is computed, the operations lead to the exact same value as computing the rolling variance of the series. This will not solve the problem of 0 / 0 being NaN, but at least the correlation of a series with itself will either be 1 or NaN, but not 0.

seth-p commented 10 years ago

Sounds good. I think that if you add an explicit check for constant series, then there's no need to round small results to 0.

Yeah, I have no strong opinion about whether the correlation of a constant series with itself should be 1 or NaN -- at least it shouldn't be 0 or -1.

seth-p commented 10 years ago

Regarding the general inconsistencies between NaN and 0 for the second moments of a single value, on further reflection I think it makes sense that the result be NaN for unbiased statistics (bias=False or ddof>=1) but 0 for biased statistics (bias=True or ddof=0).

I will go through the list of functions above and see how the behavior accords with this principle.

seth-p commented 10 years ago

After further review (and once https://github.com/pydata/pandas/issues/7912 is addressed), I think that all biased (bias=True or ddof=0) second moments return 0 for single value, while unbiased (bias=False or ddof=1) return NaN, except for the following, which need to be fixed:

seth-p commented 10 years ago

@jaimefrio, I'm going to include in https://github.com/pydata/pandas/pull/7926 a simple fix for the rolling/expanding_var/std problems mentioned in the immediately prior comment. This is so that I can do consistency checks across the various ewm/expanding/rolling functions. (I hope to have it checked in by tomorrow.) Feel free to do something fancier w/ algos.roll_var()...

jaimefrio commented 10 years ago

The rolling_var/std fix is simply to modify that if statement that has a # pathological case comment? We probably just need to to replace it by an if nobs <= ddof: val = NaN, as the 0 value for single observation biased variance will come up naturally, especially with the recent changes I have in the pipeline.

How will users actually get to specify this parameter? Right now it is hardcoded to unbiased estimation, ddof = 1.

I have the fix for rolling_var almost done, but I still need to write some test for it. I also have a branch where I am implementing a stable algorithm for skewness and kurtosis, including a similar check for "all non-NaNs in window are identical."

I have lost track of all the changes you are trying to pull together, but just so you know, I silently approve of your efforts from the distance... ;)

seth-p commented 10 years ago

Yeah, sorry, I've been encountering lots of little inconsistencies and edge cases that I've been trying to clean up. Some of the stuff I did is already in master (#7603, #7738, #7766, #7896, #7898 (which is obsolete in view of #7977), and #7934); #8059 is ready to be merged; and #7926 should be ready soon -- it has a lot of new consistency checks for the ewm*, expanding_* and rolling_* functions, which have helped me identify many of these issues (e.g. #8084 and #8086, which I don't have any immediate plans to work on). Things should be a bit cleaner / more stable once I finish #7926.

As for ddof, some of the rolling_ functions support it, e.g. rolling_var, though it's not documented. I think it would be best to make it explicit and documented in all the relevant rolling_ and expanding_ functions. I actually make use of it in my new tests in #7926. I'll let you know when that's ready...

seth-p commented 10 years ago

OK, I think I'm pretty much done w/ https://github.com/pydata/pandas/pull/7926, though I still need to rebase it after #8059 is merged into master.

Note the test_rolling_consistency() and test_expanding_consistency() functions. They each do two things for various arguments (window, min_periods, center):

You'll see that several tests are commented out with comments of the form # restore once .... These indicate tests that are currently failing (at least in some cases), but should pass once ... is done/fixed.

@jaimefrio, note in particular that I commented out the test for the correlation of a series with itself being identically 1. because rolling_cov(x,x) is not identically equal to rolling_var(x), resulting in correlations that are not identically 1.. As you observed, replacing algos.roll_var() with an algos.roll_cov(), and implementing rolling_var() in terms of it (the way ewmcov() and ewmvar() are now defined) should fix the problem.

@jaimefrio, note also that the tests call _test_moments_consistency() with both unbiased (the default) and biased (with ddof=0) versions of expanding/rolling_std/var(), though unfortunately currently there's no way to call a biased version of expanding/rolling_cov(). (One could use the identity cov(x,y)=0.5*(var(x+y)-var(x)-var(y)) with biased versions of var(), but that doesn't help in testing consistency...) Ideally the putative algos.roll_cov() would take a ddof parameter, so that one could calculate unbiased and biased versions of all the second moments.

These tests are rather slow, and can probably be trimmed a bit (i.e. fewer distinct window and min_periods values), but I've found them extremely useful in identifying bugs and inconsistencies.

seth-p commented 10 years ago

OK, #8059 has been merged into master, and I've updated #7926, which I think is now ready to be merged. @jaimefrio, you may find #7926's test_rolling_consistency() function helpful in testing.