Closed jorisvandenbossche closed 1 year ago
One design question that was still left open is what to do with Series, for which we currently have a SingleBlockManager
. The two obvious options I can think of:
A while ago I thought the second option could be an attractive simplification (because in the end, a Series "just" consists of an array and an index, so why needing a manager?). But that was probably a bit naive ;) The Manager still does quite some things, and moreover, doing a SingleArrayManager keeps the changes more limited (we can still see later if getting rid of Single(Block/Array)Manager is an option we want to explore, independent from the BlockManager vs ArrayManager debate) and for implementing certain features consistently between Series and DataFrame, having both with an underlying manager is probably useful.
Now, for the actual SingleArrayManager
, some thoughts:
SingleBlockManager
, that's actually a subclass of BlockManager
. But many methods of the BlockManager are not written to work with SingleBlockManager, which means you have quite some methods on the SingleBlockManager that are never used / would error if used. Which I don't find a very nice design pattern. An alternative for SingleArrayManager
could be to not subclass ArrayManager itself (but only a base Manager class). We could of course still have a mixin to share those parts that can be shared.I am currently testing out the approach of a separate SingleArrayManager class to see what is needed to implement it fully.
cc @jreback @jbrockmendel @TomAugspurger
Posting here some results from parts of the benchmark suite. I always ran the benchmark on a commit where I changed the default to ArrayManager (HEAD) vs the previous commit with the normal default of BlockManager (HEAD~1). So using the diff
formatting to get some color: green is slower and red is faster for the ArrayManager.
The "am-benchmarks" branch I am using is currently master + the 2 perf-related PRs I opened today + the commit to change the default.
First batch of results are benchmarks related to reshaping / take / reindex:
join_merge
$ asv continuous -f 1.01 -b join_merge HEAD~1 HEAD
concat( , axis=0)
cases. Many of the join/merge/merge_asof cases show some speedup.concat(.., axis=1)
benchmarks give a huge difference (factor "0.00" for after), but this is largely explained by copy
(the BlockManager based method copies in this case of no reindexing, while the current ArrayManager version doesn't do that (yet)). If I add a copy, it becomes a more modest speedup of 2-3x (instead of >100x).concat(.., axis=0)
case that shows a 2x slowdown with ArrayManager is the expected case, I think. That specific benchmark is using a DataFrame with a single float dtype with 200 columns (single block) and concatting this 20 times, so concatenating 20 2D arrays is always going to be faster as concatenating 200 times 20 1D arrays (the benchmark case has 200 columns, and does pd.concat([df] * 20, axis=0)
). From profiling the operation, almost all time is spent in the actual numpy concatenation routine (for both cases), so I don't expect there is much room for improvement here.reindex
$ asv continuous -f 1.01 -b reindex HEAD~1 HEAD
frame_methods.Reindex.time_reindex_axis0
) with a slowdown factor of 1.8. But since this is reindexing a wide (1000 columns) and homogenously dtyped (1 block) DataFrame, I think this is a "worst case", and so an acceptable slowdown IMO.reshape
$ asv continuous -f 1.01 -b reshape HEAD~1 HEAD
reshape.py
(eg includes cases like pd.cut
, so that's expected), and for unstack
it shows a bit varying results: some cases are faster, some are slower (it depends on the number of columns, different types, whether the BlockManager version could take the fast implementation or not, etc).reshape.Unstack
with int dtype) creates a DataFrame with a shape of 100 rows × 100000 columns (so a very wide DataFrame, and thus a slowdown is expected / acceptable IMO). For the "full product" case (no missing values get introduced), the ArrayManager is "only" 5 times slower compared to reshaping the single block. When missing values get introduced, the difference gets bigger, but that is potentially something that can be further improved.
reshape.SimpleReshape.time_unstack
is similar, but there the slowdown is much smaller (x1.29), because it produces a less wide dataframe (100 rows × 400 columns) and uses floats (in which case we don't need to check for upcasting if nulls get introduced).stack
case can be probably be optimized, as currently that goes through a conversion of the DataFrame to a numpy arrays (which is more expensive for non-single Block ArrayManager).Are there any consistent benchmark results that dont make sense to you? (xref #40066)
If I add a copy, it becomes a more modest speedup of 2-3x (instead of >100x).
can you mark the ones that are significantly affected by this?
Big picture, can you give an update on what is left to implement (e.g. im guessing json and pytables)
Are there any consistent benchmark results that dont make sense to you?
I didn't see anything particularly strange, up to now. (in the next batch about reductions, I noticed some strange speed-ups, that probably point to an issue in the BlockManager implementation, but comment about that in detail in the next post, probably tomorrow)
If I add a copy, it becomes a more modest speedup of 2-3x (instead of >100x).
can you mark the ones that are significantly affected by this?
Only the onces doing a simple concat(.., axis=1), so the 4
ConcatDataFramesbenchmarks (the 4 that show the biggest speedup, so at the bottom of the list). Other benchmarks like
merge` that also have a speedup don't have that issue, since those already did a reindexing (and thus a copy) anyway.
But note that, if we decide to go with copy-on-write, we don't need to add a copy to preserve behaviour.
Big picture, can you give an update on what is left to implement (e.g. im guessing json and pytables)
Yes, with the several indexing related PRs I opened today, we are running most of the tests. And the biggest chunks of skipped tests are related to JSON and PyTables IO. Related to IO, there is also Parquet (but that's depending on downstream libraries) and pickling.
Next to that, there are still various smaller corner cases to fix (the TODO(ArrayManager)
skipped tests), but in the grand scheme of things, the ones that are left are relatively minor, I think. I also need to get back to concat(.., axis=0)
with reindexing (#39612), and there are still some usages of apply_with_block
that ideally would be cleaned up (although that doesn't prevent AraryManager being usable).
For performance related issues, I mainly need to get back to the element-wise ops I started with a while ago (#39772, and related PRs).
Some more benchmark results:
groupby
$ asv continuous -f 1.01 -b groupby HEAD~1 HEAD
A few things are significantly slower:
groupby.Apply.time_scalar_function_single_col(4)
and groupby.Apply.time_scalar_function_multi_col(4)
) are 2-3x slower. This would be fixed by https://github.com/pandas-dev/pandas/pull/40171 (supporting ArrayManager in libreduction for dataframes), but so that's pending a decision on getting rid of libreduction alltogether or not.groupby.GroupManyLabels.time_sum(1000)
case is 16x slower. This is a case of a small but wide dataframe (shape of 1000x1000), and it's per-column overhead of checking dtypes etc in BaseGrouper._cython_operation
. https://github.com/pandas-dev/pandas/pull/40317 already improves this a little bit, but there is certainly still room to further cut down this overhead by using more specialized dtype checks. What's faster:
groupby.Apply.time_copy_overhead_single_col(4)
is considerably faster (4x), but that might actually an (existing) bug in inconsistency between libreduction fast_apply and the fallback python apply. As with AM (taking the python fallback), it seems to infer it's doing a transform (same indexed result) and thus not reordering the result / not prepending the index with the groupby keys. Will open an issue about this -> https://github.com/pandas-dev/pandas/issues/40446GroupByMethods
benchmarks are slightly faster with ArrayManager, which from a quick look seems to stem from a smaller overhead in _get_numeric_data
, but I think this overhead for BlockManager is also only visible because of the relatively small size of the data.Reductions (stat_ops)
$ asv continuous -f 1.01 -b stat_ops HEAD~1 HEAD
DataFrame.corrwith(..., method="pearson", axis=1)
is a lot slower:
But in general, this is a tiny but wide dataframe, so where the fixed overhead is significant. If I increase the size of the dataset, both have more or less the same performance.
Reductions with axis=1
(so taking eg the mean of each row, instead of each column) are slower, as can be expected (since this needs to go through a transpose / conversion to numpy array).
The FrameMultiIndexOps
are 1.2-1.8x slower. This are basically groupby
benchmarks, and thus showing that the built-in groupby aggregations are slightly slower when done column-wise (but so in the future we might be able to optimize this, if we can go all-in on 1D arrays in the cython code, xref #39861)
The FrameOps
with float dtypes are a bit faster with ArrayManager. From a quick look at the profiles, it seems that the handling of min_count
is more efficient for 1D arrays (which have scalars as resut) as for 2D arrays (probably something that can be optimized).
The FrameOps
with int dtypes are quite a bit faster with ArrayManager, but that seems to be caused by a suboptimal memory layout for the BlockManager case (so that being slower as it could be), so that might be an artefact of the setup code of the benchmark case.
Element-wise ops (arithmetic)
$ asv continuous -f 1.01 -b arithmetic HEAD~1 HEAD
The above is with a branch that combines several WIP changes (https://github.com/pandas-dev/pandas/compare/master...jorisvandenbossche:ops-refactor-combined?expand=1, for some there is already an open PR, eg #40445, #40444, #40396, #39772, for others I still need to open a PR but are dependent on other changes).
In general the element-wise ops are probably that set of operations that can see the biggest impact of performing column-by-column instead of on a single block.
A few first notes:
dot()
is slower, but since that's basically a 2D array operation, that's to be expected.IntFrameWithScalar.time_frame_op_with_scalar
benchmarks in the output above with lt/gt/eq/ne/... ops). That's something I need to investigate a bit more in detail, because this slowdown doesn't seem to come from per-column overhead, but actually numexpr being slower.FrameWithFrameWide
benchmarks actually don't show up in the above output (so meaning that they didn't give a significant difference), and the ones that do show up with the biggest slowdown are the cases with shape (1000, 10000), so wide dataframes. For wide dataframes, some slowdown will be inevitable (question is of course how much we find acceptable).UPDATE 2021-04-01: Using latest master + https://github.com/pandas-dev/pandas/pull/40482
$ asv continuous -f 1.01 -b arithmetic HEAD~1 HEAD
Two notes about a few benchmarks that are (unexpectedly) a lot faster with ArrayManager:
IntFrameWithScalar
benchmark is more than 10x faster for the pow
operations for certain cases. I could reproduce this outside of ASV, and turns out this is due to numexpr being slower, and because of the size of the dataframe, the column-wise op (ArrayManager) doesn't use numexpr, while block-wise (BlockManager) uses numexpr. In an environment without numexpr, both were more or less the same. IntFrameWithScalar
benchmark with the floordiv
operation is faster with ArrayManager. This turns out because when using numexpr, we incorrectly end up falling back using _masked_arith_op
(which is slower), because the operation is not supported by numexpr.Over the last weeks I have been updating the status of this project (and fixing some regressions), and rerunning the benchmarks. This (long) post gives an overview of the current ASV benchmarks with the ArrayManager.
Technical notes: I always ran the benchmark on a commit where I changed the default to ArrayManager (HEAD) vs the previous commit with the normal default of BlockManager (HEAD~1) -> asv continuous -f 1.0 HEAD~1 HEAD
. So using the diff
formatting to get some color: green is slower and red is faster for the ArrayManager.
The "am-benchmarks" branch I am using is master (of yesterday morning) + a few open close-to-mergeable (all-green) ArrayManager related PRs (#41104 (fillna) , #44736 (interpolate), #44791 (eval)) + the commit to change the default.
I am going to split the results here by topic / file (each time with a small discussion, repeating some stuff from above), but the results of the full run are also included at the bottom.
ToNumpy
$ asv continuous -f 1.01 -b ToNumpy HEAD~1 HEAD
time_values_mixed_tall
) doesn't show up here, so for this case there is no significant difference (in this case a new array needs to be created for both ArrayManager or BlockManager).reshape
$ asv continuous -f 1.01 -b reshape HEAD~1 HEAD
The big slowdown is in transposing a (10, 1000) shaped DataFrame with all datetimetz columns. Most of the time here is spent in boxing/unboxing Timestamp scalars (because for datetimetz we use object dtype array). I think it would be relatively straightforward to have a specialized method that avoids going through scalars but uses a native numpy dtype for the transpose operation.
In general, transpose
is an operation that would of course always do worse when using a columnar layout in an otherwise single-block case.
The case with the biggest slowdown for unstack (reshape.Unstack
with int dtype) creates a DataFrame with a shape of 100 rows × 100000 columns (so a very wide DataFrame, and thus a slowdown is expected / acceptable IMO). For the "full product" case (no missing values get introduced), the ArrayManager is "only" 5 times slower compared to reshaping the single block.
Many of the reshape benchmarks also don't show up above, meaning there is no significant difference in performance for those.
(the speedup in reshape.Unstack.time_without_last_row('category')
can be ignored, as that has been optimized for BlockManager on master since then (https://github.com/pandas-dev/pandas/pull/44758))
reindex
$ asv continuous -f 1.01 -b reindex HEAD~1 HEAD
frame_methods.Reindex.time_reindex_axis0
) with a slowdown factor of ~2. But since this is reindexing a wide (1000 columns) and homogenously dtyped (1 block) DataFrame, I think this is a "worst case", and so an acceptable slowdown IMO.other frame_methods (leaving out ToNumpy and reindex from above)
$ asv continuous -f 1.01 -b frame_methods HEAD~1 HEAD
The quantile with axis=1 is much slower (similar as reductions over the rows instead of the columns, in general, see below in the "Reductions" section). In this case it will probably be more efficient to calculate the quantiles on the 2D array instead of column-by-column (since we already transpose the DataFrame in case of axis=1). This is also what happens for the other row-wise reductions, which only show a 1-5x slowdown
XS.time_frame_xs(0)
selects a row as a Series from a homogeneously dtyped DataFrame with 10,000 columns. So constructing the array for the Series instead of selecting from the 2D block array is of course more expensive.
There are some slowdowns in isnull
related benchmarks: 1) those are using square DataFrames with shape of (1000, 1000) (take a 10000x100 instead, and there is hardly a difference anymore), and 2) some overhead can easily be avoided (see eg https://github.com/jorisvandenbossche/pandas/commit/2921441359b9ae09ad040e201f9d262ad630414b, which uses isna_array instead of isna in the ArrayManager method, and optimizes some dtype checks)
The MaskBool
benchmarks still go through BlockManager.apply, causing some overhead (this is a TODO in ArrayManager).
join_merge
$ asv continuous -f 1.01 -b join_merge HEAD~1 HEAD
concat( , axis=0)
cases.concat(.., axis=0)
case that shows a 2x slowdown with ArrayManager is the expected case, I think. That specific benchmark is using a DataFrame with a single float dtype with 200 columns (single block) and concatting this 20 times, so concatenating 20 2D arrays is always going to be faster as concatenating 200 times 20 1D arrays (the benchmark case has 200 columns, and does pd.concat([df] * 20, axis=0)
). From profiling the operation, almost all time is spent in the actual numpy concatenation routine (for both cases), so I don't expect there is much room for improvement here.groupby
$ asv continuous -f 1.01 -b groupby HEAD~1 HEAD
The groupby.GroupManyLabels.time_sum(1000)
case is ~13x slower. This is a case of a small but wide dataframe (shape of 1000x1000), and it's per-column overhead of checking dtypes etc in BaseGrouper._cython_operation
and _call_cython_op
. https://github.com/pandas-dev/pandas/pull/44738 shows that this overhead can be reduced a bit.
Also worth noting that if you go from (1000, 1000) to (10,000, 100), the overhead reduces to a 3x slowdown (on current master), and with another 10x more rows, to a 2x slowdown.
I can't reproduce the differences in GroupByMethods.time_dtype_as_group
in an interactive python session with %timeit
. For example the groupby cumsum()
with 5 columns would show a 3x slowdown, but interactively this is within 10% of each other. I didn't yet figure out why we see those slowdowns (for some cases, speedups for others) through ASV.
(while looking into this, I also noticed that the actual group_cumsum
cython algo is only 1-2% of the overall time (for both ArrayManager and BlockManager), so those GroupByMethods benchmarks, due to the size of the dataframe / number of groups, is mostly benchmarking the factorize step of groupby, which is the same for each of the groupby methods)
Note that there can be in general some room for small speed-up if we simplify our groupby algos for 1D arrays, see the discussion in https://github.com/pandas-dev/pandas/pull/39861
Reductions (stat_ops)
$ asv continuous -f 1.01 -b stat_ops HEAD~1 HEAD
DataFrame.corrwith(..., method="pearson", axis=1)
is a lot slower:
But in general, this is a tiny but wide dataframe, so where the fixed overhead is significant. If I increase the size of the dataset, both have more or less the same performance.
Reductions with axis=1
(so taking eg the mean of each row, instead of each column) are slower, as can be expected (since this needs to go through a transpose / conversion to numpy array).
For the rest, column-wise reductions are more or less the same for ArrayManager vs BlockManager on those benchmarks (some a bit faster, some a bit slower)
Element-wise ops (arithmetic)
$ asv continuous -f 1.01 -b arithmetic HEAD~1 HEAD
In general the element-wise ops are probably that set of operations that can see the biggest impact of performing column-by-column instead of on a single block:
dot()
is slower, but since that's basically a 2D array operation, that's to be expected.
The biggest slowdown is seen in the FrameWithFrameWide
class. However, many of the cases actually don't show up in the above output (so meaning that they didn't give a significant difference), and the ones that do show up with the biggest slowdown are the cases with shape (1000, 10000), so very wide dataframes. For wide dataframes, some slowdown will be inevitable (question is of course how much we find acceptable).
Similarly, the next set of benchmarks that show the biggest slowdown is for the MixedFrameWithSeriesAxis
class, which is using a DataFrame of shape (1000, 1000), i.e. also a wide DataFrame. When changing the shape to (10,000, 100) the difference already reduces from a 7x slowdown to 2x, for the slowest case.
For the wide dataframes, the main issue is the per-column overhead (checking dtypes, checking wich array op should be used, ...). I think there is quite some room for improvement to reduce this per-column overhead for those operations to specifically improve the wide dataframe case. Some examples:
Currently, the functions in array_ops.py
involve a lot of array type checking (should_extension_dispatch
), dtype checking, validating of the right operand, etc that can be optimized or moved upwards.
For example, in the FrameWithFrameWide
case of summing two dataframes (df + df
), the actual summing (the numpy operation) takes only around 20% of the overall time, while should_extension_dispatch
and dispatch_fill_zeros
together already take more time than that (both functions called in the arithmetic_op
array op function, where should_extension_dispatch
is mainly slow because of using ABC check, and dispatch_fill_zeros
is actually a no-op for addition, so should never be called)
In the step before the array op (_dispatch_frame_op
), we could avoiding broadcasting Series to DataFrame as explored in https://github.com/pandas-dev/pandas/pull/40482, which improves the time_frame_op_with_series_axis0/1
cases (eg frame + series
) quite a bit.
One note about the IntFrameWithScalar.time_frame_op_with_scalar
: those have such a shape of DataFrame, that the BlockManager version is using numexpr, and the ArrayManager version not (because the minimum number per elements has to be reached here per column, and not per block). It seems that the cases that are ~2x slower for ArrayManager, are mostly comparison operations, and apparently numexpr is quite a bit faster for those ops compared to numpy, and so the difference in those benchmarks is mostly showing the effect of numexpr. It might be worth investigating if we should reduce the minimum of elements specifically for those operations, but also worth noting that when using a larger dataframe, also the ArrayManager case will start using numexpr.
The remainder
$ asv continuous -f 1.01 HEAD~1 HEAD
with the above entries removedThe pd.eval
benchmarks using the numexpr engine show 5 to 10x slowdown. This seems to be due to the fact that our numexpr engine works on the full dataframe as one array (so effectively calling df.values
for something like pd.eval("df1 + df2")
). It is this df.values
call that causes the slowdown, and we will probably need to refactor the numexpr engine to also be able to work column-by-column when dealing with dataframes.
The rolling.TableMethod.time_apply('table')
shows a 6x slowdown, but uses a DataFrame of shape (10, 1000), so a very wide dataframe. When changing this to (1000, 10), there is no difference between both.
The SelectDtypes.time_select_dtype_...
benchmarks show a ~2x slowdown, because now the dtype of every column instead of every block needs to be checked (i.e. more dtype checks). But for a DataFrame of 50 columns, we are still speaking of times around 50µs, so not something to worry about I think.
All results combined
$ asv continuous -f 1.01 HEAD~1 HEAD
Related to the discussion in https://github.com/pandas-dev/pandas/issues/10556, and following up on the mailing list discussion "A case for a simplified (non-consolidating) BlockManager with 1D blocks" (archive).
Initial proof of concept for a non-consolidating "ArrayManager" (storing the columns of a DataFrame as a list of 1D arrays instead of blocks) is merged in https://github.com/pandas-dev/pandas/pull/36010.
This issue is meant to track the required follow-up work items to get this to a more feature-complete implementation.
Functionality: get all tests passing
quantile
/describe
related (ArrayManager.quantile is not yet implemented) -> https://github.com/pandas-dev/pandas/pull/40189equals
related (ArrayManager.equals is not yet implemented) -> #39721groupby
related tests (there are still a few parts of groupby that directly uses the blocks) -> #39885, #40050concat
related (internals/concat.py
only deals with the simple case when no reindexing is needed for ArrayManager at the moment, the full functionality (similarly to whatconcatenate_block_managers
/ theJoinUnits
now cover) still needs to be implemented) -> https://github.com/pandas-dev/pandas/pull/39612setitem
,iset
,insert
are not yet fully implementated for all corner cases + get indexing tests passing)replace
,where
,interpolate
,shift
,diff
,downcast
,putmask
, ... (those could all be refactored one at a time).@td.skip_array_manager_invalid_test
.Design questions:
Performance