Open j7168908jx opened 6 months ago
Good point. Reproducing your example, this does happen in your example. Trying to scale it up to larger input distributions alleviates the issue though.
Your example is a sweet spot for this error, rescaling your distribution to be larger, the zeroing out stops happening very quickly due to the O(count^2) and O(count^3) terms in the numerator and denominator equations counteracting lifting the very small m4 and m2^2 above the e-14 threshold.
Doing a check of the form (pseudocode)
count < 100 and abs(frexp(denominator) - frexp(numerator)) < 24
before doing the zeroing out should alleviate this issue, but I would like to hear someone else's opinion before putting in a PR.
Another note: the kurtosis fomulation then still deviates from the scipy implementation by 3, up until a distribution size of about 10x your example, using the same shape of your example.
I was not able to iron out that instability, though.
Another note: the kurtosis fomulation then still deviates from the scipy implementation by 3, up until a distribution size of about 10x your example, using the same shape of your example.
I was not able to iron out that instability, though.
Do you mean that the difference of their output is roughly 3? If you have not set bias=False
in scipy
or polars
, the difference here will be roughly 3.
Do you mean that the difference of their output is roughly 3?
Exactly
If you have not set
bias=False
inscipy
orpolars
, the difference here will be roughly 3.
I did not, so then that's also explained. Then I see no issues with my solution anymore.
Why not apply welford method for skew and kurt?
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
The output of
pandas
kurtosis function is incorrect.After simple debugging I found a comment at
core/nanops.py
line 1360, in functionnankurt
, saying to fix #18044 it manually zeros out values less than 1e-14, which is in any way improper. This affects whatever data comes with not much variance but lots of data.Expected Behavior
Output of provided example:
Expected output: roughly 14.9161 for unbiased (
pandas
's default behaviour) is correct.Installed Versions