pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.59k stars 17.57k forks source link

BUG: rank is not supported for large_string[pyarrow] dtype #59088

Open cvm-a opened 1 week ago

cvm-a commented 1 week ago

Pandas version checks

Reproducible Example

import pandas as pd
import pyarrow as pa

df = pd.DataFrame(
  {
    "a": list(range(10)) * 2,
    "b":list(range(20,30)) * 2,
    "c":list(range(50,70)),
  },
 dtype=pd.ArrowDtype(pa.string())
)

result = df2.groupby(["a", "b"], as_index=False)["c"].rank(method="first", ascending=False, na_option="bottom")

Issue Description

This is similar to #51996. When grouping a dataframe and applying the rank function on a column with data type string[pyarrow] or large_string[pyarrow] I get the following error: TypeError: rank is not supported for string[pyarrow] dtype or TypeError: rank is not supported for large_string[pyarrow] dtype respectively

Expected Behavior

I would expect rank to work for string[pyarrow] just as it works for the pandas "string" dtype, using lexicographic ordering.

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.11.6.final.0 python-bits : 64 OS : Darwin OS-release : 23.5.0 Version : Darwin Kernel Version 23.5.0: Wed May 1 20:14:38 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6020 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.2 numpy : 1.24.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 70.0.0 pip : 23.2.1 Cython : 3.0.10 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 3.2.0 lxml.etree : 5.2.2 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.4 IPython : 8.24.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.5.0 gcsfs : 2024.5.0 matplotlib : 3.9.0 numba : 0.57.1 numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : 16.1.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.13.1 sqlalchemy : None tables : None tabulate : 0.9.0 xarray : None xlrd : 2.0.1 zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None