modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.85k stars 651 forks source link

Binary operations on Series + DataFrame doesn't work #4578

Open dchigarev opened 2 years ago

dchigarev commented 2 years ago

System information

df = pd.DataFrame({"a": [1, 2, 3]}).T sr = pd.Series([10, 20, 30])

print(f"Pandas:\n{sr._to_pandas() + df._to_pandas()}") print(f"Modin:\n{sr + df}")


<details><summary>Output</summary>

Pandas:
0 1 2 a 11 22 33 Traceback (most recent call last): File "t3.py", line 8, in print(f"Modin:\n{sr + df}") File "C:\Users\rp-re\OneDrive\Desktop\rep\modin\modin\logging\logger_metaclass.py", line 68, in log_wrap return method(*args, kwargs) File "C:\Users\rp-re\OneDrive\Desktop\rep\modin\modin\pandas\series.py", line 163, in add return self.add(right) File "C:\Users\rp-re\OneDrive\Desktop\rep\modin\modin\logging\logger_metaclass.py", line 68, in log_wrap return method(*args, *kwargs) File "C:\Users\rp-re\OneDrive\Desktop\rep\modin\modin\pandas\series.py", line 514, in add return super(Series, new_self).add( File "C:\Users\rp-re\OneDrive\Desktop\rep\modin\modin\logging\logger_metaclass.py", line 68, in log_wrap return method(args, kwargs) File "C:\Users\rp-re\OneDrive\Desktop\rep\modin\modin\pandas\base.py", line 593, in add return self._binary_op( File "C:\Users\rp-re\OneDrive\Desktop\rep\modin\modin\logging\logger_metaclass.py", line 68, in log_wrap return method(args, kwargs) File "C:\Users\rp-re\OneDrive\Desktop\rep\modin\modin\pandas\base.py", line 431, in _binary_op new_query_compiler = getattr(self._query_compiler, op)(other, kwargs) File "C:\Users\rp-re\OneDrive\Desktop\rep\modin\modin\logging\logger_metaclass.py", line 68, in log_wrap return method(args, kwargs) File "C:\Users\rp-re\OneDrive\Desktop\rep\modin\modin\core\dataframe\algebra\binary.py", line 92, in caller query_compiler._modin_frame.binary_op( File "C:\Users\rp-re\OneDrive\Desktop\rep\modin\modin\logging\logger_metaclass.py", line 68, in log_wrap return method(*args, *kwargs) File "C:\Users\rp-re\OneDrive\Desktop\rep\modin\modin\core\dataframe\pandas\dataframe\dataframe.py", line 115, in run_f_on_minimally_updated_metadata result = f(self, args, kwargs) File "C:\Users\rp-re\OneDrive\Desktop\rep\modin\modin\core\dataframe\pandas\dataframe\dataframe.py", line 2516, in binary_op return self.constructor( File "C:\Users\rp-re\OneDrive\Desktop\rep\modin\modin\logging\logger_metaclass.py", line 68, in log_wrap return method(*args, **kwargs) File "C:\Users\rp-re\OneDrive\Desktop\rep\modin\modin\core\dataframe\pandas\dataframe\dataframe.py", line 210, in init ErrorMessage.catch_bugs_and_request_email( File "C:\Users\rp-re\OneDrive\Desktop\rep\modin\modin\error_message.py", line 70, in catch_bugs_and_request_email
raise Exception( Exception: Internal Error. Please visit https://github.com/modin-project/modin/issues to file an issue with the traceback and the command that caused this error. If you can't file a GitHub issue, please email bug_reports@modin.org. Column widths: 1 != 4



</details>

<!--
You can obtain the Modin version with

python -c "import modin; print(modin.__version__)"
-->

### Describe the problem
The code fails on the [column widths check](https://github.com/modin-project/modin/blob/4ec7f6347903f9133c65ebc5b6e0e15553b98577/modin/core/dataframe/pandas/dataframe/dataframe.py#L204-L219) when constructing the [binary operation result](https://github.com/modin-project/modin/blob/4ec7f6347903f9133c65ebc5b6e0e15553b98577/modin/core/dataframe/pandas/dataframe/dataframe.py#L2516-L2522).

The problem is that the [`binary_op`](https://github.com/modin-project/modin/blob/4ec7f6347903f9133c65ebc5b6e0e15553b98577/modin/core/dataframe/pandas/dataframe/dataframe.py#L2489) is designed for `df + df` operations only. The handling of mixin a frame and a series has to be done via broadcasting a series to every column of the frame instead of attempting to align the shapes of two operands. We already have the [broadcasting logic inside `Binary`](https://github.com/modin-project/modin/blob/4ec7f6347903f9133c65ebc5b6e0e15553b98577/modin/core/dataframe/algebra/binary.py#L71-L89) operator, the logic is triggered when `broadcast` parameter is True (happens in cases of `df + series`), however, the parameter appears to be False when `series + df`.
vnlitvinov commented 2 years ago

We don't fail with exception anymore, but the output is wrong anyway:

>>> print(f"Pandas:\n{sr._to_pandas() + df._to_pandas()}")
Pandas:
    0   1   2
a  11  22  33
>>> print(f"Modin:\n{sr + df}")
Modin:
   __reduced__   0   1   2
0          NaN NaN NaN NaN
1          NaN NaN NaN NaN
2          NaN NaN NaN NaN
a          NaN NaN NaN NaN