modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.91k stars 653 forks source link

BUG: incorrect iloc behavior in modin when assigning index values based on row indices #7405

Open SchwurbeI opened 1 month ago

SchwurbeI commented 1 month ago

Modin version checks

Reproducible Example

import pandas as pd
import modin.pandas as mpd

dict1 = {
    'index_test': [-1, -1, -1]
}
df1 = pd.DataFrame(dict1)
mdf1 = mpd.DataFrame(dict1)

row_indices = [2, 0]
df1.iloc[row_indices, 0] = df1.iloc[row_indices].index
mdf1.iloc[row_indices, 0] = mdf1.iloc[row_indices].index

print(df1)  # as expected: 0, -1, 2
print('-------------')
print(mdf1)  # NOT as expected: 2, -1, 0

#    index_test
# 0           0
# 1          -1
# 2           2
# -------------
#    index_test
# 0           2
# 1          -1
# 2           0

Issue Description

When assigning values using iloc in modin, the behavior deviates from the expected behavior seen with pandas. Specifically, assigning index values to a subset of rows works correctly in pandas, but modin assigns values in wrong order.

Expected Behavior

This issue occurs consistently when trying to assign values based on row indices using iloc in modin. The expected behavior is for modin to mirror pandas behavior, but instead, the values are assigned in a different order.

expected output produced with pandas:

    index_test
 0           0
 1          -1
 2           2

actual output produced with modin:

    index_test
 0           2
 1          -1
 2           0

Error Logs

No response

Installed Versions

PyDev console: using IPython 8.23.0 INSTALLED VERSIONS ------------------ commit : 3e951a63084a9cbfd5e73f6f36653ee12d2a2bfa python : 3.11.8 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.22631 machine : AMD64 processor : Intel64 Family 6 Model 186 Stepping 2, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_Austria.1252 Modin dependencies ------------------ modin : 0.32.0 ray : 2.20.0 dask : 2024.5.2 distributed : 2024.5.2 pandas dependencies ------------------- pandas : 2.2.3 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 pip : 24.2 Cython : 3.0.10 sphinx : None IPython : 8.23.0 adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 blosc : None bottleneck : 1.3.7 dataframe-api-compat : None fastparquet : None fsspec : 2023.10.0 html5lib : None hypothesis : None gcsfs : None jinja2 : 3.1.3 lxml.etree : None matplotlib : 3.8.2 numba : None numexpr : 2.8.7 odfpy : None openpyxl : None pandas_gbq : None psycopg2 : None pymysql : None pyarrow : 14.0.2 pyreadstat : None pytest : 8.1.1 python-calamine : None pyxlsb : None s3fs : None scipy : 1.14.0 sqlalchemy : 2.0.25 tables : 3.9.2 tabulate : 0.9.0 xarray : None xlrd : None xlsxwriter : None zstandard : 0.23.0 tzdata : 2024.1 qtpy : 2.4.1 pyqt5 : None
Liquidmasl commented 1 month ago

This seams to only happen if the order of row indices is random. Taking the same row indices and sorting them before doing the indexing operation the result is whats expected.

Finding this issue was a huge pain in the ass