[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest released version of Modin.
[X] I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)
Reproducible Example
import modin.pandas as pd
abbreviations = pd.Series(['Major League Baseball', 'National Basketball Association'], index=['MLB', 'NBA'])
teams = pd.DataFrame({'name': ['Mariners', 'Lakers'] * 500, 'league_abbreviation': ['MLB', 'NBA'] * 500})
# This all works correctly (or seems to -- the sort_values breaks things below the surface)
joined = teams.set_index('league_abbreviation').join(abbreviations.rename('league_name'))
print(joined)
sort_values_result = joined.sort_values('league_name')
print(sort_values_result)
# This breaks!
print(joined)
Issue Description
Calling sort_values has a destructive effect on the value of the dataframe. Some minimal debugging shows it has somehow lost the sorted column:
> ~/src/modin/modin/core/dataframe/pandas/dataframe/dataframe.py(4028)to_pandas()
-> ErrorMessage.catch_bugs_and_request_email(
(Pdb) ll
4006 @lazy_metadata_decorator(apply_axis="both")
4007 def to_pandas(self):
4008 """
4009 Convert this Modin DataFrame to a pandas DataFrame.
4010
4011 Returns
4012 -------
4013 pandas.DataFrame
4014 """
4015 df = self._partition_mgr_cls.to_pandas(self._partitions)
4016 if df.empty:
4017 df = pandas.DataFrame(columns=self.columns, index=self.index)
4018 if len(df.columns) and self.has_materialized_dtypes:
4019 df = df.astype(self.dtypes)
4020 else:
4021 for axis, has_external_index in enumerate(
4022 ["has_materialized_index", "has_materialized_columns"]
4023 ):
4024 # no need to check external and internal axes since in that case
4025 # external axes will be computed from internal partitions
4026 if getattr(self, has_external_index):
4027 external_index = self.columns if axis else self.index
4028 -> ErrorMessage.catch_bugs_and_request_email(
4029 not df.axes[axis].equals(external_index),
4030 f"Internal and external indices on axis {axis} do not match.",
4031 )
4032 # have to do this in order to assign some potentially missing metadata,
4033 # the ones that were set to the external index but were never propagated
4034 # into the internal ones
4035 df = df.set_axis(axis=axis, labels=external_index, copy=False)
4036
4037 return df
(Pdb) df.axes[axis]
Index(['name'], dtype='object')
(Pdb) external_index
Index(['name', 'league_name'], dtype='object'
Note that simply changing joined.sort_values to joined.copy().sort_values fixes the problem in the example above.
I am guessing this does not actually have to do with joining, but probably is a result of these columns being in different partitions?
Expected Behavior
The joined dataframe is unaffected by the sort_values call.
Error Logs
```python-traceback
Traceback (most recent call last):
File "~/mambaforge/envs/modin/lib/python3.10/pdb.py", line 1723, in main
pdb._runscript(mainpyfile)
File "~/mambaforge/envs/modin/lib/python3.10/pdb.py", line 1583, in _runscript
self.run(statement)
File "~/mambaforge/envs/modin/lib/python3.10/bdb.py", line 598, in run
exec(cmd, globals, locals)
File "", line 1, in
File "~/src/modin/test_sort_values_join.py", line 18, in
print(joined)
File "~/mambaforge/envs/modin/lib/python3.10/site-packages/ray/experimental/tqdm_ray.py", line 48, in safe_print
_print(*args, **kwargs)
File "~/src/modin/modin/logging/logger_decorator.py", line 129, in run_and_log
return obj(*args, **kwargs)
File "~/src/modin/modin/pandas/base.py", line 3997, in __str__
return repr(self)
File "~/src/modin/modin/logging/logger_decorator.py", line 129, in run_and_log
return obj(*args, **kwargs)
File "~/src/modin/modin/pandas/dataframe.py", line 246, in __repr__
result = repr(self._build_repr_df(num_rows, num_cols))
File "~/src/modin/modin/logging/logger_decorator.py", line 129, in run_and_log
return obj(*args, **kwargs)
File "~/src/modin/modin/pandas/base.py", line 261, in _build_repr_df
return self.iloc[indexer]._query_compiler.to_pandas()
File "~/src/modin/modin/logging/logger_decorator.py", line 129, in run_and_log
return obj(*args, **kwargs)
File "~/src/modin/modin/core/storage_formats/pandas/query_compiler.py", line 282, in to_pandas
return self._modin_frame.to_pandas()
File "~/src/modin/modin/logging/logger_decorator.py", line 129, in run_and_log
return obj(*args, **kwargs)
File "~/src/modin/modin/core/dataframe/pandas/dataframe/utils.py", line 501, in run_f_on_minimally_updated_metadata
result = f(self, *args, **kwargs)
File "~/src/modin/modin/core/dataframe/pandas/dataframe/dataframe.py", line 4028, in to_pandas
ErrorMessage.catch_bugs_and_request_email(
File "~/src/modin/modin/error_message.py", line 81, in catch_bugs_and_request_email
raise Exception(
Exception: Internal Error. Please visit https://github.com/modin-project/modin/issues to file an issue with the traceback and the command that caused this error. If you can't file a GitHub issue, please email bug_reports@modin.org.
Internal and external indices on axis 1 do not match.
```
Modin version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest released version of Modin.
[X] I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)
Reproducible Example
Issue Description
Calling
sort_values
has a destructive effect on the value of the dataframe. Some minimal debugging shows it has somehow lost the sorted column:Note that simply changing
joined.sort_values
tojoined.copy().sort_values
fixes the problem in the example above.I am guessing this does not actually have to do with joining, but probably is a result of these columns being in different partitions?
Expected Behavior
The joined dataframe is unaffected by the
sort_values
call.Error Logs
Installed Versions