modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.74k stars 651 forks source link

BUG: series compare() for equal series produces frame with internal/external columns mismatch #5697

Open mvashishtha opened 1 year ago

mvashishtha commented 1 year ago

Modin version checks

Reproducible Example

import modin.pandas as pd

s = pd.Series(1)
print(s.compare(s))

Issue Description

Getting ValueError

Expected Behavior

while pandas gives an empty series.

Error Logs

```python-traceback --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Input In [3], in () ----> 1 s.compare(s) File ~/software_sources/modin/modin/logging/logger_decorator.py:128, in enable_logging..decorator..run_and_log(*args, **kwargs) 113 """ 114 Compute function with logging if Modin logging is enabled. 115 (...) 125 Any 126 """ 127 if LogMode.get() == "disable": --> 128 return obj(*args, **kwargs) 130 logger = get_logger() 131 logger_level = getattr(logger, log_level) File ~/software_sources/modin/modin/pandas/series.py:779, in Series.compare(self, other, align_axis, keep_shape, keep_equal, result_names) 769 result = self.to_frame().compare( 770 other.to_frame(), 771 align_axis=align_axis, (...) 774 result_names=result_names, 775 ) 776 if align_axis == "columns" or align_axis == 1: 777 # Pandas.DataFrame.Compare returns a dataframe with a multidimensional index object as the 778 # columns so we have to change column object back. --> 779 result.columns = pandas.Index(["self", "other"]) 780 else: 781 result = result.squeeze().rename(None) File ~/software_sources/modin/modin/logging/logger_decorator.py:128, in enable_logging..decorator..run_and_log(*args, **kwargs) 113 """ 114 Compute function with logging if Modin logging is enabled. 115 (...) 125 Any 126 """ 127 if LogMode.get() == "disable": --> 128 return obj(*args, **kwargs) 130 logger = get_logger() 131 logger_level = getattr(logger, log_level) File ~/software_sources/modin/modin/pandas/dataframe.py:2670, in DataFrame.__setattr__(self, key, value) 2665 elif is_list_like(value) and key not in ["index", "columns"]: 2666 warnings.warn( 2667 SET_DATAFRAME_ATTRIBUTE_WARNING, 2668 UserWarning, 2669 ) -> 2670 object.__setattr__(self, key, value) File ~/software_sources/modin/modin/pandas/dataframe.py:286, in DataFrame._set_columns(self, new_columns) 277 def _set_columns(self, new_columns): 278 """ 279 Set the columns for this ``DataFrame``. 280 (...) 284 The new index to set. 285 """ --> 286 self._query_compiler.columns = new_columns File ~/software_sources/modin/modin/core/storage_formats/pandas/query_compiler.py:106, in _set_axis..set_axis(self, cols) 105 def set_axis(self, cols): --> 106 self._modin_frame.columns = cols File ~/software_sources/modin/modin/core/dataframe/pandas/dataframe/dataframe.py:417, in PandasDataframe._set_columns(self, new_columns) 415 self._columns_cache = ensure_index(new_columns) 416 else: --> 417 new_columns = self._validate_set_axis(new_columns, self._columns_cache) 418 self._columns_cache = new_columns 419 if self._dtypes is not None: File ~/software_sources/modin/modin/logging/logger_decorator.py:128, in enable_logging..decorator..run_and_log(*args, **kwargs) 113 """ 114 Compute function with logging if Modin logging is enabled. 115 (...) 125 Any 126 """ 127 if LogMode.get() == "disable": --> 128 return obj(*args, **kwargs) 130 logger = get_logger() 131 logger_level = getattr(logger, log_level) File ~/software_sources/modin/modin/core/dataframe/pandas/dataframe/dataframe.py:351, in PandasDataframe._validate_set_axis(self, new_labels, old_labels) 349 new_len = len(new_labels) 350 if old_len != new_len: --> 351 raise ValueError( 352 f"Length mismatch: Expected axis has {old_len} elements, " 353 + f"new values have {new_len} elements" 354 ) 355 return new_labels ValueError: Length mismatch: Expected axis has 0 elements, new values have 2 elements ```

Installed Versions

INSTALLED VERSIONS ------------------ commit : f4ed1c803c1d0139c5f731f6d723252025645134 python : 3.10.6.final.0 python-bits : 64 OS : Darwin OS-release : 21.5.0 Version : Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:22 PDT 2022; root:xnu-8020.121.3~4/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 Modin dependencies ------------------ modin : 0.18.0+92.gf4ed1c803 ray : 2.1.0 dask : 2022.7.1 distributed : 2022.7.1 hdk : None pandas dependencies ------------------- pandas : 1.5.3 numpy : 1.23.2 pytz : 2022.2.1 dateutil : 2.8.2 setuptools : 63.4.1 pip : 22.3.1 Cython : None pytest : 7.1.2 hypothesis : None sphinx : 4.5.0 blosc : None feather : 0.4.1 xlsxwriter : None lxml.etree : 4.9.1 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.4.0 pandas_datareader: None bs4 : 4.11.1 bottleneck : None brotli : 1.0.9 fastparquet : 0.8.1 fsspec : 2022.7.1 gcsfs : None matplotlib : 3.5.2 numba : None numexpr : 2.8.3 odfpy : None openpyxl : 3.0.10 pandas_gbq : 0.17.7 pyarrow : 8.0.0 pyreadstat : None pyxlsb : None s3fs : 2022.7.1 scipy : 1.9.0 snappy : None sqlalchemy : 1.4.39 tables : 3.7.0 tabulate : None xarray : 2022.6.0 xlrd : 2.0.1 xlwt : None zstandard : None tzdata : None
sfc-gh-mvashishtha commented 2 months ago

update: as of Modin version 2eff03cd7db298ed283c6ff124dd6fbc7515f66c, the behavior is a bit different. We don't get an error immediately, but the internal and external columns are different, so we get downstream errors:

import modin.pandas as pd

s = pd.Series(1)
result = s.compare(s)
# raises IndexError: positional indexers are out-of-bounds
print(result['self'])

The root cause is that this doesn't work when result is empty: https://github.com/modin-project/modin/blob/2eff03cd7db298ed283c6ff124dd6fbc7515f66c/modin/pandas/series.py#L828