BUG: after apply modin returns DataFrame, while pandas returns Series

Egor-Krivov commented 1 year ago

Modin version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest released version of Modin.
[X] I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

import pandas as pd
import modin.pandas as modin_pd

data = {'id': [1, 1, 2, 2], 'vals': [1, 2, 3, 4]}

def udf1(df):
    return df['vals'].sum()

df = pd.DataFrame(data)
modin_df = modin_pd.DataFrame(data)

print('Pandas return type:', type(df.groupby(['id']).apply(udf1)))
print('Modin return type:', type(modin_df.groupby(['id']).apply(udf1)))

def udf2(df):
    return df['vals'].diff()

print('Pandas return type:', type(df.groupby(['id']).apply(udf2)))
print('Modin return type:', type(modin_df.groupby(['id']).apply(udf2)))

Issue Description

After apply I get pandas.core.series.Series from Pandas and modin.pandas.dataframe.DataFrame from modin.

Expected Behavior

Consistent behavior with pandas

Error Logs

``` Pandas return type: Modin return type: Pandas return type: Modin return type: ```

Installed Versions

INSTALLED VERSIONS ------------------ commit : 2e218d10984e9919f0296931d92ea851c6a6faf5 python : 3.10.9.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-136-generic Version : #153-Ubuntu SMP Thu Nov 24 15:56:58 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.5.3 numpy : 1.24.2 pytz : 2022.7.1 dateutil : 2.8.2 setuptools : 65.6.3 pip : 22.3.1 Cython : None pytest : None hypothesis : None sphinx : 6.1.3 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.10.0 pandas_datareader: None bs4 : 4.11.2 bottleneck : None brotli : None fastparquet : None fsspec : 2023.1.0 gcsfs : None matplotlib : 3.7.0 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 11.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.10.0 snappy : None sqlalchemy : 1.4.0 tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : None

YarShev commented 1 year ago

@Egor-Krivov, I will take a look at the root cause. Meanwhile, as a workaround you can call squeeze(axis=1) on the results to get Modin Series.

YarShev commented 1 year ago

I do not see a good solution to determine the resultant object for such a specific case other than to add some logic to groupby.apply like we have in Series.apply. https://github.com/modin-project/modin/blob/8d3db2b4a7fa79716796b33f7cb3673ed729b652/modin/pandas/series.py#L661-L668 That will get groupby.apply slower as we will have to materialize some data in the main process. @modin-project/modin-core, if you have some thoughts on how we can determine the resultant object in a better way, please let me know.

Egor-Krivov commented 1 year ago

Now suffering from the same problem on different benchmark. So now both optiver volatility and HM.full depend on this

YarShev commented 1 year ago

@Egor-Krivov, does squeeze(axis=1) as a workaround not suit you?

Egor-Krivov commented 1 year ago

It is a fix, but:

I can't have the same code for pandas and modin, I have to have branching
When I am working with new code I have to debug this part manually

Do you think doing squeeze(axis=1) after every apply would work? That could simplify 2 problem for me

YarShev commented 1 year ago

You could do something like that.

if IMPL == "pandas":
    import pandas as pd
elif IMPL == "modin":
    import modin.pandas as pd
...
res = df.groupby(['id']).apply(udf1)
if isinstance(res, pd.DataFrame):
    res = res.squeeze(axis=1)

This way you could bypass both problems. squeeze(axis=1) always squeezes a single column DataFrame to a Series.

anmyachev commented 1 year ago

I do not see a good solution to determine the resultant object for such a specific case other than to add some logic to groupby.apply like we have in Series.apply.

https://github.com/modin-project/modin/blob/8d3db2b4a7fa79716796b33f7cb3673ed729b652/modin/pandas/series.py#L661-L668

That will get groupby.apply slower as we will have to materialize some data in the main process. @modin-project/modin-core, if you have some thoughts on how we can determine the resultant object in a better way, please let me know.

@YarShev I guess in some cases we can get the type using artificial data. For example the code below should be defined in _wrap_aggregation function. It is drafted, but for example, it helps in the cases indicated in this issue.

        _type = None

        def try_compute_result_type():
            from .dataframe import DataFrame
            from .series import Series

            if type(self._df) is Series:
                return None

            synthetic_data = [
                [x for x in range(len(self._columns))],
                [x + 1 for x in range(len(self._columns))],
            ]
            test = pandas.DataFrame(synthetic_data, columns=self._columns)
            _type = None
            try:
                _by = self._internal_by
                if len(self._internal_by) == 1:
                    _by = self._internal_by[0]
                _type = type(test.groupby(_by).apply(kwargs["agg_func"]))
            except:
                pass
            if _type is pandas.Series:
                _type = Series
            elif _type is pandas.DataFrame:
                _type = DataFrame
            return _type

        if "agg_func" in kwargs and not self._squeeze:
            _type = try_compute_result_type()

        result = (_type or type(self._df))(
            query_compiler=qc_method(
                groupby_qc,
                by=self._by,
                axis=self._axis,
                groupby_kwargs=self._kwargs,
                agg_args=agg_args,
                agg_kwargs=agg_kwargs,
                drop=self._drop,
                **kwargs,
            )
        )

modin-project / modin