modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.89k stars 653 forks source link

PERF: infer length/width in binary_operation? #4738

Closed jbrockmendel closed 8 months ago

jbrockmendel commented 2 years ago

IIUC PandasDataframePartitionManager.binary_operation is used for calling arithmetic/comparison ops partition-by-partition, and the resulting partitions will have the same shapes as the inputs. If this is accurate, then we should be able to have the result partitions copy any cached length/width info on the inputs, right?

mvashishtha commented 2 years ago

@jbrockmendel the resulting partitions do not necessarily have the same shapes as the input partitions of (even a single one of). We may have to "copartition" along both axes in order to split the binary operation blockwise. Consider df1 and df2 below:

import modin.pandas as pd

ab = pd.DataFrame([[0, 1]], columns=['a', 'b'])
c = pd.DataFrame([2], columns=['c'])
df1 = pd.concat([ab, c], axis=1)

a = pd.DataFrame([3], columns=['a'])
bc = pd.DataFrame([[4, 5]], columns=['b', 'c'])
df2 = pd.concat([a, bc], axis=1)

df1 + df2

df1 has one column partition with columns [a] and the next with columns [b, c]. df2 has first column partition with columns [a, b] and the next with columns [c]. Because the binary operation joins columns by label, Modin repartitions the second frame that it matches the partitioning of the first one, then does the

The axis=1 reindexing gets more complicated when the two frames have different column labels.

Similarly if the two dataframes each have 3 rows with the same labels, but the first frame has row partition sizes (2, 1) and the second has row partition sizes (1, 2), Modin has to repartition the second frame along the row axis.

You can see the repartition in PandasDataframe.binary_op[https://github.com/modin-project/modin/blob/88c3c33cf9b8aab5f5b7077555bea9f2d8466534/modin/core/dataframe/pandas/dataframe/dataframe.py#L2459). We do preserve the lengths of the repartitioned frames and eventually pass them into the PandasDataframe constructors.

jbrockmendel commented 2 years ago

@mvashishtha I understand copartitioning may occur. But after co-partitioning, aren't the partition-by-partition operations shape-preserving? That's why im suggesting they be preserved in binary_operation, not in binary_op.

mvashishtha commented 2 years ago

@jbrockmendel I think that binary_op is doing its best it to preserve the sizes caches at the PandasDataframe level. I think you might be thinking of the caches at the partition level? We do lose those when we apply a function per partition in binary_operation here:

https://github.com/modin-project/modin/blob/88c3c33cf9b8aab5f5b7077555bea9f2d8466534/modin/core/dataframe/pandas/partitioning/partition_manager.py#L1290-L1293

and I think you're correct that the binary operations at that point should preserve shape.

As I mentioned here, there are many places like that in Modin, but I doubt we'd get much marginal utility from preserving caches at the partition level in addition to the overall dataframe level. Should we follow up on #4732, where you mentioned that you had an example where losing the partition shape cache was hurting performance?

mvashishtha commented 2 years ago

@jbrockmendel is this bug a dupe of #4727?

edit: never mind, you said you were working @anmyachev on #4740, not this one.

vnlitvinov commented 2 years ago

@mvashishtha I understand copartitioning may occur. But after co-partitioning, aren't the partition-by-partition operations shape-preserving? That's why im suggesting they be preserved in binary_operation, not in binary_op.

I might be missing some context, but I would like to add that co-partitioning produces a copy of the right operand which is discarded after the operation.

I mean, if you do e.g. df3 = df1 + df2, then df3 should have similar shape to df1, but df2 will still have its original shape even if we repartitioned it to perform the binary operation.

jbrockmendel commented 2 years ago

@YarShev i think you said you confirmed this optimization is correct but found it didn't make a huge difference in the use case you were profiling? worth making a PR?

YarShev commented 2 years ago

Yes, I didn't see any perf gain from that by profiling our use case so I don't think it is worth making a PR for now.

YarShev commented 8 months ago

We do preserve lengths and widths for a binary operation so closing this issue. Feel free to reopen if needed. https://github.com/modin-project/modin/blob/7c835a2761ede41d402c1febd29826c1d0b9512f/modin/core/dataframe/pandas/dataframe/dataframe.py#L3752-L3753