pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.71k stars 17.92k forks source link

BUG: `DataFrame.loc` is not consistent with `DataFrame.__setitem__` when used with 2D numpy array #46544

Open anmyachev opened 2 years ago

anmyachev commented 2 years ago

Pandas version checks

Reproducible Example

import pandas as pd
import numpy as np

df = pd.DataFrame(np.zeros((256, 10)))
array_2d = np.zeros((256, 2))

df[0] = array_2d  # works
df["col33"] = array_2d  # raises exception, see #46545
df.loc[:,0] = array_2d  # *** ValueError: Must have equal len keys and value when setting with an ndarray
# Note: if you swap 2 lines from above, then the code will work!

Issue Description

Inconsistent behavior of functions that should behave the same.

Expected Behavior

Either an exception should be thrown for both cases, or it should not, but also in both cases.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 06d230151e6f18fdb8139d09abf539867a8cd481 python : 3.8.12.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19044 machine : AMD64 processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United States.1252 pandas : 1.4.1 numpy : 1.22.2 pytz : 2021.3 dateutil : 2.8.2 pip : 22.0.3 setuptools : 59.8.0 Cython : None pytest : 7.0.0 hypothesis : None sphinx : 4.4.0 blosc : None feather : 0.4.1 xlsxwriter : None lxml.etree : 4.7.1 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.0.3 IPython : 8.0.1 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None fsspec : 2022.01.0 gcsfs : None matplotlib : 3.2.2 numba : None numexpr : 2.7.3 odfpy : None openpyxl : 3.0.9 pandas_gbq : 0.17.0 pyarrow : 6.0.1 pyreadstat : None pyxlsb : None s3fs : 2022.01.0 scipy : 1.8.0 sqlalchemy : 1.4.31 tables : 3.7.0 tabulate : None xarray : 0.21.1 xlrd : 2.0.1 xlwt : None zstandard : None
phofl commented 2 years ago

I think the loc case is correct

simonjayhawkins commented 2 years ago

I think the loc case is correct

for the case in the OP (without the edit) or for the swapped case ("Note: if you swap 2 lines from above, then the code will work!") or both?

Note this is a change in behavior from 1.3.5 when it worked for both cases and ordering did not matter.

I'll label as a regression, for now pending further investigation.

phofl commented 2 years ago

All of them should raise

simonjayhawkins commented 2 years ago

to be clear, on main...

df = pd.DataFrame(np.zeros((256, 10)))
array_2d = np.zeros((256, 2))

df.loc[:, 0] = array_2d
df[0] = array_2d

works

df = pd.DataFrame(np.zeros((256, 10)))
array_2d = np.zeros((256, 2))

df[0] = array_2d
df.loc[:, 0] = array_2d

raises

All of them should raise

i'm ignoring the df["col33"] = array_2d case as that was added to the OP later.

so both the above code samples (in this comment) should raise and there is a bug on master?

Note this is a change in behavior from 1.3.5 when it worked for both cases and ordering did not matter.

i'll do a bisect shortly to get more insight.

phofl commented 2 years ago

Yes I think so, if you use a list as indexer, e.g. [0], they are already raising.

also if you initial dataframe has multiple dtypes and we are running through the split path, they are also raising.

you can test this through adding df[100] = „a“ before doing the 2d assignment

simonjayhawkins commented 2 years ago

first bad commit: [03dd698bc1e84c35aba8b51bdd45c472860b9ec3] BUG: DataFrame.__setitem__ sometimes operating inplace (#43406)

yet it is the loc case that has changed behavior, only after a __setitem__ operation.

phofl commented 2 years ago

Haven't checked the behavior change, but I think that this works at all should be considered a bug.

simonjayhawkins commented 2 years ago

removing from 1.4.x milestone.