pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.65k stars 17.92k forks source link

BUG: expanding/rolling_skew/kurt() inconsistent with Series.skew/kurt() #8086

Closed seth-p closed 6 years ago

seth-p commented 10 years ago

Note that for a constant series, Series.skew() returns 0, while rolling/expanding_skew() return NaN.

In [532]: s = Series([1]*4)

In [533]: s.skew()
Out[533]: 0

In [534]: expanding_skew(s)
Out[534]:
0   NaN
1   NaN
2   NaN
3   NaN
dtype: float64

In [535]: rolling_skew(s, 4)
Out[535]:
0   NaN
1   NaN
2   NaN
3   NaN
dtype: float64

While rolling/expanding_kurt() similarly return NaN for a constant series, Series.kurt() produces an error.

In [541]: expanding_kurt(s)
Out[541]:
0   NaN
1   NaN
2   NaN
3   NaN
dtype: float64

In [542]: rolling_kurt(s, 4)
Out[542]:
0   NaN
1   NaN
2   NaN
3   NaN
dtype: float64

In [543]: s.kurt()
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-543-e41f2ec435bd> in <module>()
----> 1 s.kurt()

C:\Python34\lib\site-packages\pandas\core\generic.py in stat_func(self, axis, skipna, level, numeric_only, **kwargs)
   3783                                               skipna=skipna)
   3784                 return self._reduce(f, axis=axis,
-> 3785                                     skipna=skipna, numeric_only=numeric_only)
   3786             stat_func.__name__ = name
   3787             return stat_func

C:\Python34\lib\site-packages\pandas\core\series.py in _reduce(self, op, axis, skipna, numeric_only, filter_type, **kwds)
   2007                 filter_type=None, **kwds):
   2008         """ perform a reduction operation """
-> 2009         return op(_values_from_object(self), skipna=skipna, **kwds)
   2010
   2011     def _reindex_indexer(self, new_index, indexer, copy):

C:\Python34\lib\site-packages\pandas\core\nanops.py in _f(*args, **kwargs)
     46                                 'this dtype'.format(f.__name__.replace('nan',
     47                                                                        '')))
---> 48             return f(*args, **kwargs)
     49         return _f
     50

C:\Python34\lib\site-packages\pandas\core\nanops.py in nankurt(values, axis, skipna)
    504     D = _zero_out_fperr(D)
    505
--> 506     result = (((count * count - 1.) * D / (B * B) - 3 * ((count - 1.) ** 2)) /
    507               ((count - 2.) * (count - 3.)))
    508     if isinstance(result, np.ndarray):

ZeroDivisionError: float division by zero
seth-p commented 10 years ago

My 2c: there really shouldn't be two different implementations of skew/kurtosis functions.

jreback commented 10 years ago

@seth-p maybe edit the top section with a code-reference so that when this is addressed the section can be fixed (as you have commented for #7926)

seth-p commented 10 years ago

@jreback, afraid I'm not sure I follow your last comment. Do you mean the following?

As a test for a fix, un-comment the following lines in test_moments.py:

        #(mom.expanding_skew, lambda v: Series(v).skew(), 3), # restore once GH 8086 is fixed
        #(mom.expanding_kurt, lambda v: Series(v).kurt(), 4), # restore once GH 8086 is fixed
        #(mom.rolling_skew, lambda v: Series(v).skew(), 3), # restore once GH 8086 is fixed
        #(mom.rolling_kurt, lambda v: Series(v).kurt(), 4), # restore once GH 8086 is fixed
jreback commented 10 years ago

@seth-p I realize now why you commented these out

jaimefrio commented 10 years ago

The solution to this is to modify the behavior of Series.skew() and Series.kurt() so that they return np.nan instead of 0, right?

seth-p commented 10 years ago

I think so, assuming they are supposed to return unbiased estimates. Though to be honest I haven't examined them closely beyond observing the inconsistencies noted above.

jaimefrio commented 10 years ago

I was looking into this to remove the commented tests for the rolling versions. There are a bunch of tests making sure that the return value is 0, and the code itself explicitly sets things to 0. So even if it is a bug, it was clearly considered a feature when the code was written. Not sure whether changing the behavior can be happily done without breaking someone's code...

jreback commented 10 years ago

relatd to this: https://github.com/pydata/pandas/pull/7928

I wouldn't return nan unless you have all missing values or its empty. 0 is the calculated value no?

seth-p commented 10 years ago

I think, ideally, the commented-out consistency tests (https://github.com/pydata/pandas/issues/8086#issuecomment-55687801) should be there. Obviously some of the cases are 0 / NaN, but others aren't.

Separately, I think should add to _test_moments_consistency() additional parameters skew_unbiased=None, kurt_unbiased=None and skew_biased=None, kurt_biased=None, and consistency checks between them and the corresponding lower-order moments (biased or unbiased, as appropriate), i.e. comparable to this test for biased variance estimates:

                if var is var_biased:
                    # check that biased var(x) == mean(x^2) - mean(x)^2
                    mean_x2 = mean(x * x)
                    assert_equal(var_x, mean_x2 - (mean_x * mean_x))
seth-p commented 10 years ago

Regarding 0 vs NaN, even if for now we are keeping the disparate behavior ofrolling/expanding_kurt/skew() and Series.skew/kurt(), I would leave those commented-out tests there (still commented out), as an aspirational goal...

seth-p commented 10 years ago

@jreback, if need to divide by a denominator that is 0, then I think should return NaN (analogous to calculating an unbiased standard deviation where need to divide by N-1 when N=1).

jreback commented 10 years ago

@seth-p yep that too! I believe var/std do that as well shouldn't be filled by 0 by default (though is ATM)

seth-p commented 10 years ago

@jreback, no, at least in master, calculating unbiased (ddof=1, the default) var/std will produce NaN for a single non-NaN value (since it wants to divide by N-1 = 0). This should be the case for rolling/expanding_var(), ewmvar() (with bias=False, the default), and Series.var(). If that's not the case, and any of these produces 0 for a single non-NaN value, then I missed something in my tests...

jreback commented 10 years ago

@seth-p what I mean is that var/std DO provide a nan when dividing by 0. These should all be consistent (hence this issue).

seth-p commented 10 years ago

Ah, sorry, I misunderstood you. Yes, all the var/std functions do produce a NaN when dividing by 0...

jreback commented 10 years ago

as far as the tests go, yes I think will have to change those tests and test for nan in those edge cases.

TomAugspurger commented 6 years ago
In [133]: s.rolling(4).kurt()
Out[133]:
0   NaN
1   NaN
2   NaN
3   NaN
dtype: float64

In [134]: s.kurt()
Out[134]: 0