PERF: DataFrame fragmentation when calling DataFrame.assign() with large kwargs

y-koj commented 1 month ago

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this issue exists on the latest version of pandas.
[ ] I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

When running the following code, pandas emits PerformanceWarning which describes DataFrame fragmentation.

import pandas as pd
import numpy as np
from timeit import timeit

N = 1_000_000
df = pd.DataFrame({"x": np.random.rand(N)})

# this part causes fragmentation
dict = {
    "x_" + str(key): np.random.rand(N) for key in range(100)
}
df = df.assign(**dict)

# defragmented DataFrame
df2 = df.copy()

print(timeit(lambda: df.sort_values(by="x_99"), number=10))  # 4.16105025 seconds
print(timeit(lambda: df2.sort_values(by="x_99"), number=10))  # 2.6979567919999994 seconds

Note

I guess this problem is caused by the for loop in DataFrame.assign() implementation: https://github.com/pandas-dev/pandas/blob/0691c5cf90477d3503834d983f69350f250a6ff7/pandas/core/frame.py#L5238-L5240

Full program output

$ python pd.py
/Users/yk/tmp/pd.py:11: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df = df.assign(**dict)
4.16105025
2.6979567919999994

Installed Versions

INSTALLED VERSIONS ------------------ commit : 0691c5cf90477d3503834d983f69350f250a6ff7 python : 3.9.6 python-bits : 64 OS : Darwin OS-release : 24.0.0 Version : Darwin Kernel Version 24.0.0: Tue Sep 24 23:38:45 PDT 2024; root:xnu-11215.1.12~1/RELEASE_ARM64_T8122 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : None.UTF-8 pandas : 2.2.3 numpy : 2.0.2 pytz : 2024.2 dateutil : 2.9.0.post0 pip : 24.2 Cython : None sphinx : None IPython : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None blosc : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.9.0 html5lib : None hypothesis : None gcsfs : None jinja2 : 3.1.4 lxml.etree : None matplotlib : 3.9.2 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None psycopg2 : None pymysql : None pyarrow : None pyreadstat : None pytest : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.13.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlsxwriter : None zstandard : None tzdata : 2024.2 qtpy : None pyqt5 : None

Prior Performance

No response

samukweku commented 1 month ago

The error message points you a bit in the right direction:

PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

A better approach (to avoid the fragmentation) as suggested is to concatenate:

pd.concat([_dict, df],axis=1)

y-koj commented 1 month ago

The error message points you a bit in the right direction:

That's right. But in my opinion, df.assign() will be more useful if it concatenates kwargs internally to avoid fragmentation. Although there is concern about overheads for small kwargs, it worth considering.

samukweku commented 1 month ago

Cc @mroeschke @phofl for your thoughts

rhshadrach commented 3 weeks ago

I'm positive on using concat in the general case.

Although there is concern about overheads for small kwargs, it worth considering.

For len(kwargs) == 1, we should not use concat. For len(kwargs) == 3, I'm seeing the perf of setattr vs concat quite narrow already (6%). For length 2, I'm seeing 34%, and that this appears to be independent of the number of columns in the input DataFrame (due to CoW?).

phofl commented 3 weeks ago

concat will keep the columns separate with CoW, you still have to copy. That makes it quite cheap but won't help much with fragmentation

pandas-dev / pandas