pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.91k stars 18.03k forks source link

BUG: user defined function does not work with `DataFrame.apply` #59908

Open shobsi opened 2 months ago

shobsi commented 2 months ago

Pandas version checks

Reproducible Example

import pandas as pd
df = pd.DataFrame(
    [[1]],
    index=pd.MultiIndex.from_frame(
        pd.DataFrame(
            {
                "idx0": pd.Series("a", dtype=pd.StringDtype()), 
                "idx1": pd.Series(100, dtype=pd.Int64Dtype())
            }
        )
    ),
)
df.loc[("a", 100)].name
('a', np.int64(100))
def foo(row): return row.name[1].item()
foo(df.loc[("a", 100)])
100
df.apply(foo, axis=1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/google/home/shobs/code/bigframes1/venv/lib/python3.12/site-packages/pandas/core/frame.py", line 10374, in apply
    return op.apply().__finalize__(self, method="apply")
           ^^^^^^^^^^
  File "/usr/local/google/home/shobs/code/bigframes1/venv/lib/python3.12/site-packages/pandas/core/apply.py", line 916, in apply
    return self.apply_standard()
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/google/home/shobs/code/bigframes1/venv/lib/python3.12/site-packages/pandas/core/apply.py", line 1063, in apply_standard
    results, res_index = self.apply_series_generator()
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/google/home/shobs/code/bigframes1/venv/lib/python3.12/site-packages/pandas/core/apply.py", line 1081, in apply_series_generator
    results[i] = self.func(v, *self.args, **self.kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<stdin>", line 1, in foo
AttributeError: 'int' object has no attribute 'item'

Issue Description

I ran into this issue where I have a series processing udf, which works fine if I pass a dataframe row to it, but breaks when I dataframe.apply it with axis=1. It seems the numpy value in the index is being lost in the latter.

Expected Behavior

A user defined function should have the same behavior, whether a dataframe row is passed to it directly or via DataFrame.apply.

Installed Versions

commit : 0691c5cf90477d3503834d983f69350f250a6ff7 python : 3.12.0b4 python-bits : 64 OS : Linux OS-release : 6.9.10-1rodete4-amd64 Version : #1 SMP PREEMPT_DYNAMIC Debian 6.9.10-1rodete4 (2024-08-02) machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.3 numpy : 2.1.1 pytz : 2024.2 dateutil : 2.9.0.post0 pip : 23.1.2 Cython : None sphinx : None IPython : 8.27.0 adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None blosc : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.9.0 html5lib : None hypothesis : None gcsfs : 2024.9.0post1 jinja2 : None lxml.etree : None matplotlib : 3.9.2 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : 0.23.2 psycopg2 : None pymysql : None pyarrow : 17.0.0 pyreadstat : None pytest : 8.3.3 python-calamine : None pyxlsb : None s3fs : None scipy : 1.14.1 sqlalchemy : 2.0.35 tables : None tabulate : 0.9.0 xarray : None xlrd : None xlsxwriter : None zstandard : None tzdata : 2024.2 qtpy : None pyqt5 : None
rhshadrach commented 2 months ago

Thanks for the report. This is due to the use of lib.fast_zip in MultiIndex._values. That uses PyArray_GETITEM which returns a Python int instead of the NumPy scalar. In addition, in MultiIndex._values we cast to object dtype of any ExtensionDtype. Patching these two so that the OP example works, I'm only seeing a handful of test failures, mostly in join.

Without a Multiindex, df.index._values returns an IntegerArray that when indexed do give NumPy scalars. I'm wondering if we want to make this change to MultiIndex for the sake of consistency.

cc @jbrockmendel