pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.8k stars 17.98k forks source link

BUG: Incompatible dype warning when assigning boolean series with logical indexer #57338

Open m0nzderr opened 9 months ago

m0nzderr commented 9 months ago

Pandas version checks

Reproducible Example

import pandas as pd

data1 = pd.Series([True, True, True], dtype=bool)
data2 = pd.Series([False, False, False], dtype=bool)
condition = pd.Series([False, True, False], dtype=bool)

data1[condition] = data2[condition] # > FutureWarning: Setting an item of incompatible dtype...

Issue Description

The assignment data1[condition] = data2[condition] results in warning claiming incompatible types:

FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[False]' has dtype incompatible with bool, please explicitly cast to a compatible dtype first. data1[condition] = data2[condition]

Which is clearly not true. This bug is somewhat related to the one reported in #56600, although the use case is different. Interestingly, the problem disappears when using another dtype, such as int (code below works as expected, no warnings) :


data1 = pd.Series([1 ,2, 3], dtype=int)
data2 = pd.Series([4, 5, 6], dtype=int)
condition = pd.Series([False, True, False], dtype=bool)

data1[condition] = data2[condition]



Note: Reproduced on pandas==2.2.0 and 3.0.0.dev0+292.g70f367194a

### Expected Behavior

There should be no warning.

### Installed Versions

<details>
INSTALLED VERSIONS
------------------
commit                : 70f367194aff043f4f9ca549003df8a2a97f286a
python                : 3.10.12.final.0
python-bits           : 64
OS                    : Linux
OS-release            : 5.15.133.1-microsoft-standard-WSL2
Version               : #1 SMP Thu Oct 5 21:02:42 UTC 2023
machine               : x86_64
processor             : x86_64
byteorder             : little
LC_ALL                : None
LANG                  : None
LOCALE                : en_US.UTF-8

pandas                : 3.0.0.dev0+292.g70f367194a
numpy                 : 1.26.4
pytz                  : 2024.1
dateutil              : 2.8.2
setuptools            : 59.6.0
pip                   : 24.0
Cython                : None
pytest                : None
hypothesis            : None
sphinx                : None
blosc                 : None
feather               : None
xlsxwriter            : None
lxml.etree            : None
html5lib              : None
pymysql               : None
psycopg2              : None
jinja2                : 3.1.3
IPython               : 8.21.0
pandas_datareader     : None
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : 4.12.3
bottleneck            : None
dataframe-api-compat  : None
fastparquet           : None
fsspec                : None
gcsfs                 : None
matplotlib            : 3.8.2
numba                 : None
numexpr               : None
odfpy                 : None
openpyxl              : None
pyarrow               : None
pyreadstat            : None
python-calamine       : None
pyxlsb                : None
s3fs                  : None
scipy                 : 1.12.0
sqlalchemy            : None
tables                : None
tabulate              : None
xarray                : None
xlrd                  : None
zstandard             : None
tzdata                : 2023.4
qtpy                  : None
pyqt5                 : None

</details>
phofl commented 9 months ago

This actually upcasts to object, which means that the warning is kind of correct, but this should definitely work and continue to work.

I suspect that we do a align under the hood that causes this, cc @MarcoGorelli

m0nzderr commented 9 months ago

@phofl I'm not familiar with the implementation and have no idea about the reasons for the upcast, but, intuitively, I see no reason for that to happen. Both series already have the same dtype, so any conversion would be unnecessary. It is also very counter-intuitive to have different behaviors among primitive types (i.e., no upcast in case of ints or floats, but an upcast in case of bools...).

SpoopyPillow commented 2 weeks ago

I think what's happening is that when we mask the series, we eventually try to set the items at the index array([1]) to array([False]) (which is the second element of data2). However, this array is an object type. When we try to cast this array to a bool, pandas gives a LossySetitemError, which may be the problem.