Open Egor-Krivov opened 1 year ago
@Egor-Krivov, I will take a look at the root cause. Meanwhile, as a workaround you can call squeeze(axis=1) on the results to get Modin Series.
I do not see a good solution to determine the resultant object for such a specific case other than to add some logic to groupby.apply like we have in Series.apply. https://github.com/modin-project/modin/blob/8d3db2b4a7fa79716796b33f7cb3673ed729b652/modin/pandas/series.py#L661-L668 That will get groupby.apply slower as we will have to materialize some data in the main process. @modin-project/modin-core, if you have some thoughts on how we can determine the resultant object in a better way, please let me know.
Now suffering from the same problem on different benchmark. So now both optiver volatility and HM.full depend on this
@Egor-Krivov, does squeeze(axis=1)
as a workaround not suit you?
It is a fix, but:
Do you think doing squeeze(axis=1)
after every apply would work? That could simplify 2 problem for me
You could do something like that.
if IMPL == "pandas":
import pandas as pd
elif IMPL == "modin":
import modin.pandas as pd
...
res = df.groupby(['id']).apply(udf1)
if isinstance(res, pd.DataFrame):
res = res.squeeze(axis=1)
This way you could bypass both problems. squeeze(axis=1)
always squeezes a single column DataFrame to a Series.
I do not see a good solution to determine the resultant object for such a specific case other than to add some logic to groupby.apply like we have in Series.apply.
That will get groupby.apply slower as we will have to materialize some data in the main process. @modin-project/modin-core, if you have some thoughts on how we can determine the resultant object in a better way, please let me know.
@YarShev I guess in some cases we can get the type using artificial data. For example the code below should be defined in _wrap_aggregation
function. It is drafted, but for example, it helps in the cases indicated in this issue.
_type = None
def try_compute_result_type():
from .dataframe import DataFrame
from .series import Series
if type(self._df) is Series:
return None
synthetic_data = [
[x for x in range(len(self._columns))],
[x + 1 for x in range(len(self._columns))],
]
test = pandas.DataFrame(synthetic_data, columns=self._columns)
_type = None
try:
_by = self._internal_by
if len(self._internal_by) == 1:
_by = self._internal_by[0]
_type = type(test.groupby(_by).apply(kwargs["agg_func"]))
except:
pass
if _type is pandas.Series:
_type = Series
elif _type is pandas.DataFrame:
_type = DataFrame
return _type
if "agg_func" in kwargs and not self._squeeze:
_type = try_compute_result_type()
result = (_type or type(self._df))(
query_compiler=qc_method(
groupby_qc,
by=self._by,
axis=self._axis,
groupby_kwargs=self._kwargs,
agg_args=agg_args,
agg_kwargs=agg_kwargs,
drop=self._drop,
**kwargs,
)
)
Modin version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest released version of Modin.
[X] I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)
Reproducible Example
Issue Description
After apply I get
pandas.core.series.Series
from Pandas andmodin.pandas.dataframe.DataFrame
from modin.Expected Behavior
Consistent behavior with pandas
Error Logs
Installed Versions