Open y-koj opened 1 month ago
The error message points you a bit in the right direction:
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
A better approach (to avoid the fragmentation) as suggested is to concatenate:
pd.concat([_dict, df],axis=1)
The error message points you a bit in the right direction:
That's right.
But in my opinion, df.assign()
will be more useful if it concatenates kwargs internally to avoid fragmentation. Although there is concern about overheads for small kwargs, it worth considering.
Cc @mroeschke @phofl for your thoughts
I'm positive on using concat in the general case.
Although there is concern about overheads for small kwargs, it worth considering.
For len(kwargs) == 1
, we should not use concat. For len(kwargs) == 3
, I'm seeing the perf of setattr vs concat quite narrow already (6%). For length 2, I'm seeing 34%, and that this appears to be independent of the number of columns in the input DataFrame (due to CoW?).
concat will keep the columns separate with CoW, you still have to copy. That makes it quite cheap but won't help much with fragmentation
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this issue exists on the latest version of pandas.
[ ] I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
When running the following code, pandas emits PerformanceWarning which describes DataFrame fragmentation.
Note
I guess this problem is caused by the for loop in
DataFrame.assign()
implementation: https://github.com/pandas-dev/pandas/blob/0691c5cf90477d3503834d983f69350f250a6ff7/pandas/core/frame.py#L5238-L5240Full program output
Installed Versions
Prior Performance
No response