pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.41k stars 17.84k forks source link

BUG: Wrong kurtosis outcome due to inadequate fix to previous issues #57972

Open j7168908jx opened 6 months ago

j7168908jx commented 6 months ago

Pandas version checks

Reproducible Example

import polars as pl
import pandas as pd
import numpy as np
import scipy.stats as st

data = np.array([-2.05191341e-05,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00, -4.10391103e-05,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00])

print(pl.Series(data).kurtosis())
print(pd.Series(data).kurt())
print(st.kurtosis(data))

Issue Description

The output of pandas kurtosis function is incorrect.

After simple debugging I found a comment at core/nanops.py line 1360, in function nankurt, saying to fix #18044 it manually zeros out values less than 1e-14, which is in any way improper. This affects whatever data comes with not much variance but lots of data.

Expected Behavior

Output of provided example:

14.916104870028523
0.0
14.916104870028551

Expected output: roughly 14.9161 for unbiased (pandas's default behaviour) is correct.

Installed Versions

INSTALLED VERSIONS ------------------ commit : bdc79c146c2e32f2cab629be240f01658cfb6cc2 python : 3.10.13.final.0 python-bits : 64 OS : Linux OS-release : 5.19.0-1010-nvidia-lowlatency Version : #10-Ubuntu SMP PREEMPT_DYNAMIC Wed Apr 26 00:40:27 UTC 2023 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.1 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.8.2 setuptools : 65.5.0 pip : 24.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 8.22.1 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.2.0 gcsfs : None matplotlib : 3.8.3 numba : 0.59.0 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 15.0.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.12.0 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None
dontgoto commented 6 months ago

Good point. Reproducing your example, this does happen in your example. Trying to scale it up to larger input distributions alleviates the issue though.

Your example is a sweet spot for this error, rescaling your distribution to be larger, the zeroing out stops happening very quickly due to the O(count^2) and O(count^3) terms in the numerator and denominator equations counteracting lifting the very small m4 and m2^2 above the e-14 threshold.

Doing a check of the form (pseudocode) count < 100 and abs(frexp(denominator) - frexp(numerator)) < 24 before doing the zeroing out should alleviate this issue, but I would like to hear someone else's opinion before putting in a PR.

dontgoto commented 6 months ago

Another note: the kurtosis fomulation then still deviates from the scipy implementation by 3, up until a distribution size of about 10x your example, using the same shape of your example.

I was not able to iron out that instability, though.

j7168908jx commented 6 months ago

Another note: the kurtosis fomulation then still deviates from the scipy implementation by 3, up until a distribution size of about 10x your example, using the same shape of your example.

I was not able to iron out that instability, though.

Do you mean that the difference of their output is roughly 3? If you have not set bias=False in scipy or polars, the difference here will be roughly 3.

dontgoto commented 6 months ago

Do you mean that the difference of their output is roughly 3?

Exactly

If you have not set bias=False in scipy or polars, the difference here will be roughly 3.

I did not, so then that's also explained. Then I see no issues with my solution anymore.

kaixiongg commented 4 weeks ago

Why not apply welford method for skew and kurt?