intel / hdk

A low-level execution library for analytic data processing.
Apache License 2.0
31 stars 14 forks source link

COUNT aggregation fails if the number of groups > 44_739_242 #695

Closed AndreyPavlenko closed 1 year ago

AndreyPavlenko commented 1 year ago

The following code:

import modin.pandas as pd

df = pd.DataFrame({"a": range(44_739_242 + 1)})
print(df["a"].value_counts())

Fails with error:

[info]    0 0 RelAlgExecutor.cpp:699 Check failed: agg

However, it does not fail if the range is <= 44_739_242 .

ienkovich commented 1 year ago

This problem is caused by two subsequent sort nodes. I'll fix the bug, but you should check a tree generated by Modin. There is no use in making two sorts in a row because it's not guaranteed to be stable and therefore only the last one would matter.