Open rhshadrach opened 8 months ago
Thanks for the list! I was just planning to look at ASV for CoW today. Investigation of a few of the list:
reshape.Unstack.time_full_product
m = 100
n = 1000
levels = np.arange(m)
index = pd.MultiIndex.from_product([levels] * 2)
columns = np.arange(n)
values = np.arange(m * m * n).reshape(m * m, n)
df = pd.DataFrame(values, index, columns)
%timeit df.unstack()
I am using pandas 2.2.0 to easily be able to switch between then default and CoW mode. With that, I can reproduce a slowdown in unstack
. Profiling this with CoW enabled, this seems to be caused by a slower DataFrame construction of the final result in Unstacker.get_result
. I assume this is because of the DataFrame constructor now copying an input numpy array, and in this case we can pass copy=False
as we know we created the new values
ourselves:
arithmetic.FrameWithFrameWide.time_op_different_blocks op=; shape=(1000000, 10)
n_rows, n_cols = 1_000_000, 10
# construct dataframe with 2 blocks
arr1 = np.random.randn(n_rows, n_cols // 2).astype("f8")
arr2 = np.random.randn(n_rows, n_cols // 2).astype("f4")
df = pd.concat([DataFrame(arr1), DataFrame(arr2)], axis=1, ignore_index=True)
df._consolidate_inplace()
arr1 = np.random.randn(n_rows, max(n_cols // 4, 3)).astype("f8")
arr2 = np.random.randn(n_rows, n_cols // 2).astype("i8")
arr3 = np.random.randn(n_rows, n_cols // 4).astype("f8")
df2 = pd.concat(
[DataFrame(arr1), DataFrame(arr2), DataFrame(arr3)],
axis=1,
ignore_index=True,
)
df2._consolidate_inplace()
%timeit df > df2
This I can also reproduce, and a profile indicates the slowdown comes entirely from the actual >
operation on the underlying data. So the one difference seems to be that with CoW, the underlying block values seem to be row major (column major in the transposed Block.values), while without CoW it's the expected column major layout.
Simpler example:
>>> pd.options.mode.copy_on_write = False
>>> arr = np.random.randn(n_rows, n_cols)
>>> df = pd.concat([pd.DataFrame(arr), pd.DataFrame(arr)], axis=1, ignore_index=True)
>>> df._mgr.blocks[0].values.shape
(10, 1000000)
>>> df._mgr.blocks[0].values.flags
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
>>> pd.options.mode.copy_on_write = True
>>> df = pd.concat([pd.DataFrame(arr), pd.DataFrame(arr)], axis=1, ignore_index=True)
>>> df._mgr.blocks[0].values.flags
C_CONTIGUOUS : False
F_CONTIGUOUS : True
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
So something seems to go wrong in the DataFrame constructor. With CoW enabled, this should make a copy by default, but when making this copy, it seems to not use the correct C vs F order.
series_methods.ToNumpy.time_to_numpy
N = 1_000_000
ser = pd.Series(np.random.randn(N,))
%timeit ser.to_numpy()
gives
In [80]: %timeit ser.to_numpy()
846 ns ± 57.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) # witout CoW
2.38 µs ± 10.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) # with CoW
In this case there is some extra work being done to return a read-only view:
So it might be this is just that (but haven't checked in detail, from a quick profile nothing stood really out apart from just the body of to_numpy
taking a bit longer)
Starting a list here what is a false-positive/expected:
both to_dict cases are only as fast because of the item_cache which is something we removed, so nothing we can do
also item_cache, iterating over the columns once is actually twice as fast now, but doing it repeatedly scales linearly now, but this is expected
groupby.Fillna.time_srs_bfill and fill actually are faster now if you increase the objects, the actual object is tiny, so it's probably the constant cost caused the increase (we are 20% faster if we create an object that runs around 100ms)
time_loc scales well, it's only on small dfs that we see the slowdown
For the several time_stack
ones (eg reshape.ReshapeMaskedArrayDtype.time_stack - dtype='Float64'), it seems that the CoW PR only gave a small perf regression (which I think is again due to the item cache being gone), but those benchmarks show a much bigger slowdown some commits later (I see @rhshadrach and @DeaMariaLeon were discussing those yesterday on slack).
Given that those are not related to CoW, marking them as "done" in the top post here.
For the various arithmetic.NumericInferOps.time_add - dtype=<class 'numpy.int16'> cases (also for time_multiply
and time_subtract
, I assume): the small slowdown is also due to the item_cache removal.
And the fact that we only see it for the lower bitwidth dtypes, is because it seems the actual addition operation for those is much faster, and so the overhead of getting the column becomes relatively bigger, making this noticeable in the benchmark.
Marked those as "done".
For strings.Construction.time_construction, the reason of the slowdown is that Series(nparr)
now by default makes a copy, so that is expected (without the benchmark using copy=False
)
dtype = "float"
axis = 0
values = np.random.randn(100000, 4)
df = pd.DataFrame(values).astype(dtype)
%timeit df.sum(axis=axis)
This was also because of the constructor copying the array but not creating the correct columnar layout, and so this should be fixed on main with https://github.com/pandas-dev/pandas/pull/57459 (confirmed locally, in a few days we should also see this on the ASV results)
series_methods.Fillna.time_fillna - dtype='string'
dtype = "string"
N = 10**6
data = np.array([str(i) * 5 for i in range(N)], dtype=object)
na_value = pd.NA
fill_value = data[0]
ser = pd.Series(data, dtype=dtype)
ser[::2] = na_value
%timeit ser.fillna(fill_value)
This is a slowdown that I can reproduce, and is due to the additional hasna
check that we added (linking to the 2.2 code, so its clearer what is run in case of CoW vs before):
So for CoW, we added the additional check to see if there are actually NAs to fill, and if not just return the original values + the correct refs.
But for the case where we are actually filling NAs, it seems this _hasna
check gives a significant overhead (around 30% in this example, and at least for string dtype, since that's the only dtype that shows a regression in the ASV results)
Now, that is also only for our own object-dtype string dtype, so given we have pyarrow-backed strings as a faster alternative if you care about performance, this is probably an OK trade-off (since the hasna
check gives other benefits, like avoiding a copy if no NAs)
Although after taking a quick look at isna
for strings, it's also easy to speed it up, to offset the additional overhead a bit -> https://github.com/pandas-dev/pandas/pull/57733
I've looked through the most recent benchmarks, confirmed https://github.com/pandas-dev/pandas/issues/57431#issuecomment-1978212744, and checked off the box.
PR #56633 may have induced a performance regression. If it was a necessary behavior change, this may have been expected and everything is okay.
Please check the links below. If any ASVs are parameterized, the combinations of parameters that a regression has been detected for appear as subbullets.
Subsequent benchmarks may have skipped some commits. The link below lists the commits that are between the two benchmark runs where the regression was identified.
Commit Range
cc @phofl