modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.89k stars 653 forks source link

rolling and window functions throw Internal Error instead of dropping "nuisance" (non-numeric) columns #4135

Open c3-cjazra opened 2 years ago

c3-cjazra commented 2 years ago

System information

Ray 1.9.2 modin 0.12.0 python 3.9.7

Describe the problem

modin_df = modin_pd.DataFrame([
                {'a': 1, 'b': 2., 'c': True, 'd': 'a'},
                {'a': 5, 'b': 10., 'c': False, 'd': 'b'},
                {'a': 10, 'b': 50., 'c': True, 'd': 'c'}
            ])

modin_df.rolling(window=2, min_periods=1).min()

throws error

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
/python3.9/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

/python3.9/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    392                         if cls is not object \
    393                                 and callable(cls.__dict__.get('__repr__')):
--> 394                             return _repr_pprint(obj, self, cycle)
    395 
    396             return _default_pprint(obj, self, cycle)

/python3.9/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    698     """A pprint that just redirects to the normal repr function."""
    699     # Find newlines and replace them with p.break_()
--> 700     output = repr(obj)
    701     lines = output.splitlines()
    702     with p.group():

/python3.9/site-packages/modin/pandas/dataframe.py in __repr__(self)
    208 
    209             num_cols += len(self.columns) - i
--> 210         result = repr(self._build_repr_df(num_rows, num_cols))
    211         if len(self.index) > num_rows or len(self.columns) > num_cols:
    212             # The split here is so that we don't repr pandas row lengths.

/python3.9/site-packages/modin/pandas/base.py in _build_repr_df(self, num_rows, num_cols)
    195         else:
    196             indexer = row_indexer
--> 197         return self.iloc[indexer]._query_compiler.to_pandas()
    198 
    199     def _update_inplace(self, new_query_compiler):

/python3.9/site-packages/modin/core/storage_formats/pandas/query_compiler.py in to_pandas(self)
    253 
    254     def to_pandas(self):
--> 255         return self._modin_frame.to_pandas()
    256 
    257     @classmethod

/python3.9/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py in to_pandas(self)
   2257         else:
   2258             for axis in [0, 1]:
-> 2259                 ErrorMessage.catch_bugs_and_request_email(
   2260                     not df.axes[axis].equals(self.axes[axis]),
   2261                     f"Internal and external indices on axis {axis} do not match.",

/python3.9/site-packages/modin/error_message.py in catch_bugs_and_request_email(cls, failure_condition, extra_log)
     58     def catch_bugs_and_request_email(cls, failure_condition, extra_log=""):
     59         if failure_condition:
---> 60             raise Exception(
     61                 "Internal Error. "
     62                 "Please email bug_reports@modin.org with the traceback and command that"

Exception: Internal Error. Please email bug_reports@modin.org with the traceback and command that caused this error.
Internal and external indices on axis 1 do not match.

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
/python3.9/site-packages/IPython/core/formatters.py in __call__(self, obj)
    343             method = get_real_method(obj, self.print_method)
    344             if method is not None:
--> 345                 return method()
    346             return None
    347         else:

/python3.9/site-packages/modin/pandas/dataframe.py in _repr_html_(self)
    230         # We use pandas _repr_html_ to get a string of the HTML representation
    231         # of the dataframe.
--> 232         result = self._build_repr_df(num_rows, num_cols)._repr_html_()
    233         if len(self.index) > num_rows or len(self.columns) > num_cols:
    234             # We split so that we insert our correct dataframe dimensions.

/python3.9/site-packages/modin/pandas/base.py in _build_repr_df(self, num_rows, num_cols)
    195         else:
    196             indexer = row_indexer
--> 197         return self.iloc[indexer]._query_compiler.to_pandas()
    198 
    199     def _update_inplace(self, new_query_compiler):

/python3.9/site-packages/modin/core/storage_formats/pandas/query_compiler.py in to_pandas(self)
    253 
    254     def to_pandas(self):
--> 255         return self._modin_frame.to_pandas()
    256 
    257     @classmethod

/python3.9/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py in to_pandas(self)
   2257         else:
   2258             for axis in [0, 1]:
-> 2259                 ErrorMessage.catch_bugs_and_request_email(
   2260                     not df.axes[axis].equals(self.axes[axis]),
   2261                     f"Internal and external indices on axis {axis} do not match.",

/python3.9/site-packages/modin/error_message.py in catch_bugs_and_request_email(cls, failure_condition, extra_log)
     58     def catch_bugs_and_request_email(cls, failure_condition, extra_log=""):
     59         if failure_condition:
---> 60             raise Exception(
     61                 "Internal Error. "
     62                 "Please email bug_reports@modin.org with the traceback and command that"

Exception: Internal Error. Please email bug_reports@modin.org with the traceback and command that caused this error.
Internal and external indices on axis 1 do not match.

Works on pandas:

pandas_df = pandas_pd.DataFrame([
                {'a': 1, 'b': 2., 'c': True, 'd': 'a'},
                {'a': 5, 'b': 10., 'c': False, 'd': 'b'},
                {'a': 10, 'b': 50., 'c': True, 'd': 'c'}
            ])
display(pandas_df)
a   b   c   d
0   1   2.0 True    a
1   5   10.0    False   b
2   10  50.0    True    c

pandas_df.rolling(window=2, min_periods=1).min()

----
a   b   c
0   1.0 2.0 1.0
1   1.0 2.0 0.0
2   5.0 10.0    0.0
mvashishtha commented 2 years ago

@c3-cjazra Thanks for reporting this. I can reproduce it on Modin version 0.13.0+7.g32cff0d8. I believe that the bug is in calculating any rolling result on a dataframe where one of the columns can't be aggregated with the rolling operation because it's the wrong type. Here's a copy-pastable example of Modin failing:

import modin.pandas as pd

df = pd.DataFrame({"a": [1], "b": ["x"]})
df.rolling(window=1).min()

whereas pandas lets you do that and gives the result for just the numerical column:

     a
0  1.0