Open dchigarev opened 1 year ago
Still suffering from this one
The #6307 only fixed some of the problematic cases.
Here are more that still trigger the same problem:
import pandas
import modin.pandas as pd
from modin.test.storage_formats.pandas.test_internals import (
construct_modin_df_by_scheme,
)
md_df = construct_modin_df_by_scheme(
pandas_df=pandas.DataFrame({"a": [1, 1, 2, 2], "b": [3, 4, 5, 6]}),
partitioning_scheme={"row_lengths": [2, 2], "column_widths": [2]},
)
md_res = md_df.query("a > 1")
grp_obj = md_res.groupby(md_res["a"])
print(grp_obj.count()) # fails with an assertion error
import pandas
import modin.pandas as pd
from modin.test.storage_formats.pandas.test_internals import (
construct_modin_df_by_scheme,
)
md_df = construct_modin_df_by_scheme( pandas_df=pandas.DataFrame({"a": [1, 1, 2, 2], "b": [3, 4, 5, 6]}), partitioning_scheme={"row_lengths": [2, 2], "column_widths": [2]}, ) by_df = construct_modin_df_by_scheme( pandas_df=pandas.DataFrame({"a": [1, 1, 2, 2, None, None]}), partitioning_scheme={"row_lengths": [2, 2, 2], "column_widths": [1]}, ).squeeze()
grp_obj = md_df.groupby(by_df.dropna()) print(grp_obj.count()) # fails with an assertion error
Modin version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest released version of Modin.
[X] I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)
Reproducible Example
Issue Description
(see under the 'error logs' spoiler for the reproducer's output)
Lazy filtering of empty partitions appears to act badly in cases where partitions broadcasting is required for such semi-filtered frames.
Here in the reproducer, we have a source frame
md_res
having one empty partition (in the result of the.query()
) and its projection as a column to group on (grp_obj._by
) that do not have empty partitions as empties were filtered out when making this projection.As we don't verify partitioning when broadcasting projections of the frame to the frame itself (we believe that they must be partitioned identically) the broadcasting during groupby results into an error right here because
len(rt_axis_parts) == 1 (without empty partition)
andlen(left) == 2 (includes an empty partition)
: https://github.com/modin-project/modin/blob/810072cfaf2d7c1fad584b14855574db1b9066b7/modin/core/dataframe/pandas/partitioning/partition_manager.py#L372-L387Expected Behavior
It's expected to work properly :)
BTW, this problem has a simple workaround found by @Egor-Krivov. Users can manually trigger filtering empty partitions out by calling an
.iloc
with an indexer bigger than the frame's length:Error Logs
Installed Versions