pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.26k stars 17.79k forks source link

BUG: qcut returns incorrect results #58240

Closed YerdnY closed 1 month ago

YerdnY commented 5 months ago

Pandas version checks

Reproducible Example

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(1, 182), columns=['val'])
pd.qcut(df.val, 10, labels=False).value_counts()

Issue Description

The code above produces the following output: val 0 19 7 19 1 18 2 18 3 18 4 18 5 18 8 18 9 18 6 17 Name: count, dtype: int64 The issue is there are three unique counts of items in bins - 17, 18, 19. I expect no more then two unique counts. Ideally one, but that is only possible if input size is divisible by nbins.

Expected Behavior

The same code produces this, correct output in pandas 2.1.4:

val 0 19 1 18 2 18 3 18 4 18 5 18 6 18 7 18 8 18 9 18 Name: count, dtype: int64

Installed Versions

C:\ProgramData\anaconda3\envs\quant2\Lib\site-packages\_distutils_hack\__init__.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.") INSTALLED VERSIONS ------------------ commit : bdc79c146c2e32f2cab629be240f01658cfb6cc2 python : 3.11.7.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder : little LC_ALL : None LANG : en LOCALE : English_United States.1252 pandas : 2.2.1 numpy : 1.26.4 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 68.2.2 pip : 23.3.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 3.1.1 lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.3 IPython : 8.20.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.2 bottleneck : 1.3.7 dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.8.0 numba : None numexpr : 2.8.7 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.11.4 sqlalchemy : 2.0.25 tables : None tabulate : None xarray : None xlrd : 2.0.1 zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None
kymcglyn commented 5 months ago

take

ashorkey-umich commented 5 months ago

take

Nrezhang commented 4 months ago

I found that this seems to be an issue with floats when passing an array of floats to index(): these are the bins in the function qcut, 0.0 1.0 0.1 19.0 0.2 37.0 0.3 55.0 0.4 73.0 0.5 91.0 0.6 109.0 0.7 127.0 0.8 145.0 0.9 163.0 1.0 181.0

However, when you convert it as Index(bins) in the argument of _bins_to_cut, the values are: Index([1.0, 19.0, 37.0, 55.00000000000001, 73.0, 91.0, 109.00000000000001, 126.99999999999999, 145.0, 163.0, 181.0] and the issue is at 126.9999999. Any suggestions for how to resolve this

yuanx749 commented 4 months ago

https://github.com/pandas-dev/pandas/blob/bfe5be01fef4eaecf4ab033e74139b0a3cac4a39/pandas/core/reshape/tile.py#L339-L341 It is caused by the floating point of np.linspace in qcut.

quantiles = np.linspace(0, 1, 11)
with np.printoptions(precision=20):
    print(quantiles)

Note that the output is actually:

[0.                  0.1                 0.2
 0.30000000000000004 0.4                 0.5
 0.6000000000000001  0.7000000000000001  0.8
 0.9                 1.                 ]
yuanx749 commented 1 month ago

Thanks to @rob-sil, it seems this issue has been solved in #59409 @mroeschke