modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.74k stars 650 forks source link

BUG: `groupby.ngroup()` fails with experimental (reshuffling) groupby implementation #6083

Open dchigarev opened 1 year ago

dchigarev commented 1 year ago

Modin version checks

Reproducible Example

import modin.pandas as pd
import modin.config as cfg

cfg.ExperimentalGroupbyImpl.put(True)

df = pd.concat(
    [
        pd.DataFrame({"a": [1, 1, 2, 2], "b": [1, 2, 3, 4]}),
        pd.DataFrame({"a": [3, 3, 2, 2], "b": [5, 6, 7, 8]}),
    ]
)
assert df._query_compiler._modin_frame._partitions.shape == (2, 1)

print(df.groupby("a").ngroup())  # IndexError

Issue Description

The method fails with an IndexError.

Expected Behavior

This happens because we return a pandas.Series from the kernel executing this groupby. We would likely need to wrap the result into a dataframe

Error Logs

```python-traceback Traceback (most recent call last): File "t3.py", line 14, in print(df.groupby("a").ngroup()) # IndexError File "repos/modin/modin/logging/logger_decorator.py", line 128, in run_and_log return obj(*args, **kwargs) File "repos/modin/modin/pandas/groupby.py", line 945, in ngroup result = result.squeeze(axis=1) File "repos/modin/modin/logging/logger_decorator.py", line 128, in run_and_log return obj(*args, **kwargs) File "repos/modin/modin/pandas/dataframe.py", line 2139, in squeeze if axis == 1 and len(self.columns) == 1: File "repos/modin/modin/pandas/base.py", line 4036, in __getattribute__ attr = super().__getattribute__(item) File "repos/modin/modin/pandas/dataframe.py", line 288, in _get_columns return self._query_compiler.columns File "repos/modin/modin/core/storage_formats/pandas/query_compiler.py", line 89, in return lambda self: self._modin_frame.columns File "repos/modin/modin/core/dataframe/pandas/dataframe/dataframe.py", line 502, in _get_columns columns, column_widths = self._compute_axis_labels_and_lengths(1) File "repos/modin/modin/logging/logger_decorator.py", line 128, in run_and_log return obj(*args, **kwargs) File "repos/modin/modin/core/dataframe/pandas/dataframe/dataframe.py", line 588, in _compute_axis_labels_and_lengths new_index, internal_idx = self._partition_mgr_cls.get_indices(axis, partitions) File "repos/modin/modin/logging/logger_decorator.py", line 128, in run_and_log return obj(*args, **kwargs) File "repos/modin/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 926, in get_indices new_idx = cls.get_objects_from_partitions(new_idx) File "repos/modin/modin/logging/logger_decorator.py", line 128, in run_and_log return obj(*args, **kwargs) File "repos/modin/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 867, in get_objects_from_partitions return cls._execution_wrapper.materialize( File "repos/modin/modin/core/execution/ray/common/engine_wrapper.py", line 92, in materialize return ray.get(obj_id) File "miniconda3/envs/modinc/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper return func(*args, **kwargs) File "miniconda3/envs/modinc/lib/python3.8/site-packages/ray/_private/worker.py", line 2380, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(IndexError): ray::_apply_func() (pid=2005977, ip=10.34.123.21) File "repos/modin/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition.py", line 382, in _apply_func result = func(partition, *args, **kwargs) File "repos/modin/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 920, in index_func = lambda df: df.axes[axis] # noqa: E731 IndexError: list index out of range ```

Installed Versions

Replace this line with the output of pd.show_versions()
dchigarev commented 8 months ago

p2 because of low demand for this operation, we can always fall back to the old implementation