Closed seth-p closed 6 years ago
My 2c: there really shouldn't be two different implementations of skew/kurtosis functions.
@seth-p maybe edit the top section with a code-reference so that when this is addressed the section can be fixed (as you have commented for #7926)
@jreback, afraid I'm not sure I follow your last comment. Do you mean the following?
As a test for a fix, un-comment the following lines in test_moments.py
:
#(mom.expanding_skew, lambda v: Series(v).skew(), 3), # restore once GH 8086 is fixed
#(mom.expanding_kurt, lambda v: Series(v).kurt(), 4), # restore once GH 8086 is fixed
#(mom.rolling_skew, lambda v: Series(v).skew(), 3), # restore once GH 8086 is fixed
#(mom.rolling_kurt, lambda v: Series(v).kurt(), 4), # restore once GH 8086 is fixed
@seth-p I realize now why you commented these out
The solution to this is to modify the behavior of Series.skew()
and Series.kurt()
so that they return np.nan
instead of 0
, right?
I think so, assuming they are supposed to return unbiased estimates. Though to be honest I haven't examined them closely beyond observing the inconsistencies noted above.
I was looking into this to remove the commented tests for the rolling versions. There are a bunch of tests making sure that the return value is 0, and the code itself explicitly sets things to 0. So even if it is a bug, it was clearly considered a feature when the code was written. Not sure whether changing the behavior can be happily done without breaking someone's code...
relatd to this: https://github.com/pydata/pandas/pull/7928
I wouldn't return nan
unless you have all missing values or its empty. 0 is the calculated value no?
I think, ideally, the commented-out consistency tests (https://github.com/pydata/pandas/issues/8086#issuecomment-55687801) should be there. Obviously some of the cases are 0
/ NaN
, but others aren't.
Separately, I think should add to _test_moments_consistency()
additional parameters skew_unbiased=None, kurt_unbiased=None
and skew_biased=None, kurt_biased=None
, and consistency checks between them and the corresponding lower-order moments (biased or unbiased, as appropriate), i.e. comparable to this test for biased variance estimates:
if var is var_biased:
# check that biased var(x) == mean(x^2) - mean(x)^2
mean_x2 = mean(x * x)
assert_equal(var_x, mean_x2 - (mean_x * mean_x))
Regarding 0
vs NaN
, even if for now we are keeping the disparate behavior ofrolling/expanding_kurt/skew()
and Series.skew/kurt()
, I would leave those commented-out tests there (still commented out), as an aspirational goal...
@jreback, if need to divide by a denominator that is 0, then I think should return NaN
(analogous to calculating an unbiased standard deviation where need to divide by N-1
when N=1
).
@seth-p yep that too! I believe var/std do that as well shouldn't be filled by 0 by default (though is ATM)
@jreback, no, at least in master, calculating unbiased (ddof=1
, the default) var/std will produce NaN
for a single non-NaN
value (since it wants to divide by N-1 = 0
). This should be the case for rolling/expanding_var()
, ewmvar()
(with bias=False
, the default), and Series.var()
. If that's not the case, and any of these produces 0
for a single non-NaN
value, then I missed something in my tests...
@seth-p what I mean is that var/std
DO provide a nan
when dividing by 0. These should all be consistent (hence this issue).
Ah, sorry, I misunderstood you. Yes, all the var/std
functions do produce a NaN
when dividing by 0...
as far as the tests go, yes I think will have to change those tests and test for nan in those edge cases.
In [133]: s.rolling(4).kurt()
Out[133]:
0 NaN
1 NaN
2 NaN
3 NaN
dtype: float64
In [134]: s.kurt()
Out[134]: 0
Note that for a constant series,
Series.skew()
returns0
, whilerolling/expanding_skew()
returnNaN
.While
rolling/expanding_kurt()
similarly returnNaN
for a constant series,Series.kurt()
produces an error.