pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.72k stars 17.94k forks source link

PERF: selection inside a row using a list of labels 30 slower than selection by the same individual labels #60203

Open xstreck1 opened 3 days ago

xstreck1 commented 3 days ago

Pandas version checks

Reproducible Example

The code below accesses two columns either together or separately:

import pandas as pd
import numpy as np
import timeit

# Create large DataFrame
n_rows = 100_000
df = pd.DataFrame({
    'a': np.random.random(n_rows),
    'b': np.random.random(n_rows)
})

# Method 1: Access columns inside function
def sum_inside(row):
    return row[0] + row[1]

# Method 2: Pass values directly
def sum_outside(a, b):
    return a + b

# Time Method 1
t1 = timeit.timeit(lambda: df.apply(lambda row: sum_inside(row[['a','b']].values), axis=1), number=1)

# Time Method 2  
t2 = timeit.timeit(lambda: df.apply(lambda row: sum_outside(row['a'], row['b']), axis=1), number=1)

print(f"Method 1 (access inside): {t1:.4f} seconds")
print(f"Method 2 (pass values): {t2:.4f} seconds")

Output:

Method 1 (access inside): 15.6217 seconds
Method 2 (pass values): 0.5135 seconds

Using iloc does not suffer the same issue. Replace:

t1 = timeit.timeit(lambda: df.apply(lambda row: sum_inside(row.iloc[0:2].values), axis=1), number=1)
t2 = timeit.timeit(lambda: df.apply(lambda row: sum_outside(row.iloc[0], row.iloc[1]), axis=1), number=1)

Output:

Method 1 (access inside): 1.5293 seconds
Method 2 (pass values): 0.8811 seconds

Installed Versions

INSTALLED VERSIONS ------------------ commit : 0691c5cf90477d3503834d983f69350f250a6ff7 python : 3.11.7 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.26100 machine : AMD64 processor : AMD64 Family 23 Model 49 Stepping 0, AuthenticAMD byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : English_Germany.1252 pandas : 2.2.3 numpy : 1.26.3 pytz : 2024.1 dateutil : 2.8.2 pip : 24.0 Cython : None sphinx : 8.1.3 IPython : 8.22.1 adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 blosc : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.2.0 html5lib : None hypothesis : None gcsfs : None jinja2 : 3.1.2 lxml.etree : 4.9.4 matplotlib : 3.8.3 numba : 0.59.0 numexpr : None odfpy : None openpyxl : None pandas_gbq : None psycopg2 : None pymysql : None pyarrow : 14.0.2 pyreadstat : None pytest : 8.2.0 python-calamine : None pyxlsb : None s3fs : None scipy : 1.12.0 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlsxwriter : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None

Prior Performance

No response

rhshadrach commented 2 days ago

Thanks for the report. This can be reproduced on just a Series.

ser = pd.Series({"a": 1, "b": 2})

%timeit row[["a", "b"]]
# 103 μs ± 273 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%timeit row["a"], row["b"]
# 1.76 μs ± 12.4 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

I haven't checked, but likely looking up a single label has a fastpath, and requires less validation. Certainly any perf improvement is always welcome, but I'm not sure there is an issue here.

I presume the example in the OP is just a simplified version of the operation you're trying to perform. In any case, avoiding apply at all costs is the best way to get good performance out of pandas.