pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.18k stars 17.77k forks source link

Rolling skewness and kurtosis fail on a sample of all equal values #5749

Closed yieldsfalsehood closed 10 years ago

yieldsfalsehood commented 10 years ago

For a sample of data like this:

d = pd.Series([1] * 25)

Both of these throw an exception (during an attempt to divide by zero):

pd.rolling_skew(d, window=25)
pd.rolling_kurt(d, window=25)

The issue is in algos.pyx. There are no checks for what amounts to zero variance in the data. If one value occurs more times in a row than than the size of the window, the entire rolling computation fails, rather than just returning NaN for that one period (which is what I'd expect). For reference, scipy gives a kurtosis of -3 and a skewness of 0 (plus a warning) for this situation, which is not what I'd expect (since the higher moments are all zero, implying a division by zero).

>>> from scipy import stats
>>> stats.kurtosis([1,1,1,1,1,1,1])
-3.0
>>> stats.skew([1,1,1,1,1,1,1])
/usr/lib/python2.7/dist-packages/scipy/stats/stats.py:1067: RuntimeWarning: invalid value encountered in double_scalars
  vals = np.where(zero, 0, m3 / m2**1.5)
0.0

Below is the approach I was taking to weed out any possible divide by zero issues. I'll submit a proper pull request tomorrow, in the meantime this is here in case I can get any feedback, preferably on whether these added conditions are enough (I think the kurtosis could still break) and how to add some tests for both of these.

diff --git a/pandas/algos.pyx b/pandas/algos.pyx
index 08ec707..78b619f 100644
--- a/pandas/algos.pyx
+++ b/pandas/algos.pyx
@@ -1160,7 +1160,7 @@ def roll_skew(ndarray[double_t] input, int win, int minp):

                 nobs -= 1

-        if nobs >= minp:
+        if nobs >= minp and not (x == 0 and xx == 0) and nobs != 2:
             A = x / nobs
             B = xx / nobs - A * A
             C = xxx / nobs - A * A * A - 3 * A * B
@@ -1227,7 +1227,7 @@ def roll_kurt(ndarray[double_t] input,

                 nobs -= 1

-        if nobs >= minp:
+        if nobs >= minp and not (x == 0 and xx == 0) and nobs != 2:
             A = x / nobs
             R = A * A
             B = xx / nobs - R
jreback commented 10 years ago

yep...prob nice to have some nice edge tests for these

yieldsfalsehood commented 10 years ago

I sent in a pull request for this - https://github.com/pydata/pandas/pull/5760