modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.91k stars 653 forks source link

BUG: extra row gets inserted when using `.iloc` on Series #7392

Open MarcoGorelli opened 2 months ago

MarcoGorelli commented 2 months ago

Modin version checks

Reproducible Example

$ MODIN_ENGINE=python ipython
Python 3.12.5 (main, Aug 14 2024, 05:08:31) [Clang 18.1.8 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.27.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import modin.pandas as mpd

In [2]: df = mpd.DataFrame({'a': [1,2,3], 'b': [4,5,6], 'c': [7,8,9]}).convert_dtypes(dtype_backend='pyarrow')

In [3]: df.loc[:, 'a'].iloc[[1, 0]] = [999, 888]

In [4]: df
Out[4]: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.

      a     b     c
0     1     4     7
1     2     5     8
2     3     6     9
a  <NA>  <NA>  <NA>

Issue Description

df got an extra row added?

Expected Behavior

df shouldn't get an extra row

pandas:

In [5]: import pandas as pd

In [6]: df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6], 'c': [7,8,9]}).convert_dtypes(dtype_backend='pyarrow')

In [7]: df.loc[:, 'a'].iloc[[1, 0]] = [999, 888]
FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [8]: df
Out[8]:
     a  b  c
0  888  4  7
1  999  5  8
2    3  6  9

I'm not too bothered about whether the [888, 999] changes carries through, but the fact that df got an extra row inserted looks wrong

Error Logs

```python-traceback Replace this line with the error backtrace (if applicable). ```

Installed Versions

INSTALLED VERSIONS ------------------ commit : c8bbca8e4e00c681370e3736b2f73bb0352408c3 python : 3.12.5.final.0 python-bits : 64 OS : Linux OS-release : 5.15.153.1-microsoft-standard-WSL2 Version : #1 SMP Fri Mar 29 23:14:13 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : C.UTF-8 Modin dependencies ------------------ modin : 0.31.0 ray : None dask : 2024.8.2 distributed : None pandas dependencies ------------------- pandas : 2.2.2 numpy : 2.1.1 pytz : 2024.2 dateutil : 2.9.0.post0 setuptools : 74.1.2 pip : None Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.4 IPython : 8.27.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : 2024.5.0 fsspec : 2024.9.0 gcsfs : None matplotlib : 3.9.2 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 17.0.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None