Closed jbrockmendel closed 8 months ago
@jbrockmendel the resulting partitions do not necessarily have the same shapes as the input partitions of (even a single one of). We may have to "copartition" along both axes in order to split the binary operation blockwise. Consider df1
and df2
below:
import modin.pandas as pd
ab = pd.DataFrame([[0, 1]], columns=['a', 'b'])
c = pd.DataFrame([2], columns=['c'])
df1 = pd.concat([ab, c], axis=1)
a = pd.DataFrame([3], columns=['a'])
bc = pd.DataFrame([[4, 5]], columns=['b', 'c'])
df2 = pd.concat([a, bc], axis=1)
df1 + df2
df1
has one column partition with columns [a]
and the next with columns [b, c]
. df2
has first column partition with columns [a, b]
and the next with columns [c]
. Because the binary operation joins columns by label, Modin repartitions the second frame that it matches the partitioning of the first one, then does the
The axis=1 reindexing gets more complicated when the two frames have different column labels.
Similarly if the two dataframes each have 3 rows with the same labels, but the first frame has row partition sizes (2, 1) and the second has row partition sizes (1, 2), Modin has to repartition the second frame along the row axis.
You can see the repartition in PandasDataframe.binary_op
[https://github.com/modin-project/modin/blob/88c3c33cf9b8aab5f5b7077555bea9f2d8466534/modin/core/dataframe/pandas/dataframe/dataframe.py#L2459). We do preserve the lengths of the repartitioned frames and eventually pass them into the PandasDataframe
constructors.
@mvashishtha I understand copartitioning may occur. But after co-partitioning, aren't the partition-by-partition operations shape-preserving? That's why im suggesting they be preserved in binary_operation, not in binary_op.
@jbrockmendel I think that binary_op
is doing its best it to preserve the sizes caches at the PandasDataframe
level. I think you might be thinking of the caches at the partition level? We do lose those when we apply a function per partition in binary_operation
here:
and I think you're correct that the binary operations at that point should preserve shape.
As I mentioned here, there are many places like that in Modin, but I doubt we'd get much marginal utility from preserving caches at the partition level in addition to the overall dataframe level. Should we follow up on #4732, where you mentioned that you had an example where losing the partition shape cache was hurting performance?
@jbrockmendel is this bug a dupe of #4727?
edit: never mind, you said you were working @anmyachev on #4740, not this one.
@mvashishtha I understand copartitioning may occur. But after co-partitioning, aren't the partition-by-partition operations shape-preserving? That's why im suggesting they be preserved in binary_operation, not in binary_op.
I might be missing some context, but I would like to add that co-partitioning produces a copy of the right operand which is discarded after the operation.
I mean, if you do e.g. df3 = df1 + df2
, then df3
should have similar shape to df1
, but df2
will still have its original shape even if we repartitioned it to perform the binary operation.
@YarShev i think you said you confirmed this optimization is correct but found it didn't make a huge difference in the use case you were profiling? worth making a PR?
Yes, I didn't see any perf gain from that by profiling our use case so I don't think it is worth making a PR for now.
We do preserve lengths and widths for a binary operation so closing this issue. Feel free to reopen if needed. https://github.com/modin-project/modin/blob/7c835a2761ede41d402c1febd29826c1d0b9512f/modin/core/dataframe/pandas/dataframe/dataframe.py#L3752-L3753
IIUC PandasDataframePartitionManager.binary_operation is used for calling arithmetic/comparison ops partition-by-partition, and the resulting partitions will have the same shapes as the inputs. If this is accurate, then we should be able to have the result partitions copy any cached length/width info on the inputs, right?