modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.88k stars 651 forks source link

BUG: after apply modin returns DataFrame, while pandas returns Series #5763

Open Egor-Krivov opened 1 year ago

Egor-Krivov commented 1 year ago

Modin version checks

Reproducible Example

import pandas as pd
import modin.pandas as modin_pd

data = {'id': [1, 1, 2, 2], 'vals': [1, 2, 3, 4]}

def udf1(df):
    return df['vals'].sum()

df = pd.DataFrame(data)
modin_df = modin_pd.DataFrame(data)

print('Pandas return type:', type(df.groupby(['id']).apply(udf1)))
print('Modin return type:', type(modin_df.groupby(['id']).apply(udf1)))

def udf2(df):
    return df['vals'].diff()

print('Pandas return type:', type(df.groupby(['id']).apply(udf2)))
print('Modin return type:', type(modin_df.groupby(['id']).apply(udf2)))

Issue Description

After apply I get pandas.core.series.Series from Pandas and modin.pandas.dataframe.DataFrame from modin.

Expected Behavior

Consistent behavior with pandas

Error Logs

``` Pandas return type: Modin return type: Pandas return type: Modin return type: ```

Installed Versions

INSTALLED VERSIONS ------------------ commit : 2e218d10984e9919f0296931d92ea851c6a6faf5 python : 3.10.9.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-136-generic Version : #153-Ubuntu SMP Thu Nov 24 15:56:58 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.5.3 numpy : 1.24.2 pytz : 2022.7.1 dateutil : 2.8.2 setuptools : 65.6.3 pip : 22.3.1 Cython : None pytest : None hypothesis : None sphinx : 6.1.3 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.10.0 pandas_datareader: None bs4 : 4.11.2 bottleneck : None brotli : None fastparquet : None fsspec : 2023.1.0 gcsfs : None matplotlib : 3.7.0 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 11.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.10.0 snappy : None sqlalchemy : 1.4.0 tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : None
YarShev commented 1 year ago

@Egor-Krivov, I will take a look at the root cause. Meanwhile, as a workaround you can call squeeze(axis=1) on the results to get Modin Series.

YarShev commented 1 year ago

I do not see a good solution to determine the resultant object for such a specific case other than to add some logic to groupby.apply like we have in Series.apply. https://github.com/modin-project/modin/blob/8d3db2b4a7fa79716796b33f7cb3673ed729b652/modin/pandas/series.py#L661-L668 That will get groupby.apply slower as we will have to materialize some data in the main process. @modin-project/modin-core, if you have some thoughts on how we can determine the resultant object in a better way, please let me know.

Egor-Krivov commented 1 year ago

Now suffering from the same problem on different benchmark. So now both optiver volatility and HM.full depend on this

YarShev commented 1 year ago

@Egor-Krivov, does squeeze(axis=1) as a workaround not suit you?

Egor-Krivov commented 1 year ago

It is a fix, but:

  1. I can't have the same code for pandas and modin, I have to have branching
  2. When I am working with new code I have to debug this part manually

Do you think doing squeeze(axis=1) after every apply would work? That could simplify 2 problem for me

YarShev commented 1 year ago

You could do something like that.

if IMPL == "pandas":
    import pandas as pd
elif IMPL == "modin":
    import modin.pandas as pd
...
res = df.groupby(['id']).apply(udf1)
if isinstance(res, pd.DataFrame):
    res = res.squeeze(axis=1)

This way you could bypass both problems. squeeze(axis=1) always squeezes a single column DataFrame to a Series.

anmyachev commented 1 year ago

I do not see a good solution to determine the resultant object for such a specific case other than to add some logic to groupby.apply like we have in Series.apply.

https://github.com/modin-project/modin/blob/8d3db2b4a7fa79716796b33f7cb3673ed729b652/modin/pandas/series.py#L661-L668

That will get groupby.apply slower as we will have to materialize some data in the main process. @modin-project/modin-core, if you have some thoughts on how we can determine the resultant object in a better way, please let me know.

@YarShev I guess in some cases we can get the type using artificial data. For example the code below should be defined in _wrap_aggregation function. It is drafted, but for example, it helps in the cases indicated in this issue.

        _type = None

        def try_compute_result_type():
            from .dataframe import DataFrame
            from .series import Series

            if type(self._df) is Series:
                return None

            synthetic_data = [
                [x for x in range(len(self._columns))],
                [x + 1 for x in range(len(self._columns))],
            ]
            test = pandas.DataFrame(synthetic_data, columns=self._columns)
            _type = None
            try:
                _by = self._internal_by
                if len(self._internal_by) == 1:
                    _by = self._internal_by[0]
                _type = type(test.groupby(_by).apply(kwargs["agg_func"]))
            except:
                pass
            if _type is pandas.Series:
                _type = Series
            elif _type is pandas.DataFrame:
                _type = DataFrame
            return _type

        if "agg_func" in kwargs and not self._squeeze:
            _type = try_compute_result_type()

        result = (_type or type(self._df))(
            query_compiler=qc_method(
                groupby_qc,
                by=self._by,
                axis=self._axis,
                groupby_kwargs=self._kwargs,
                agg_args=agg_args,
                agg_kwargs=agg_kwargs,
                drop=self._drop,
                **kwargs,
            )
        )