pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.58k stars 17.57k forks source link

BUG: CONTAINS_OP run on pd.NA results in pd.NAType.__bool__ call #57989

Open filip-komarzyniec opened 3 months ago

filip-komarzyniec commented 3 months ago

Pandas version checks

Reproducible Example

import pandas as pd

pd.NA in [1,2,3]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "missing.pyx", line 392, in pandas._libs.missing.NAType.__bool__
TypeError: boolean value of NA is ambiguous

Issue Description

checking for pd.NA existence in a list results in TypeError: boolean value of NA is ambiguous.
Why is performing in operation calls __bool__ method of the pd.NAType class?

Seems a bit similar to the issue regarding incorrect implementation of some operators: https://github.com/pandas-dev/pandas/issues/49828

Expected Behavior

Checking for existence of pd.NA type in any container should correctly return either True or False

Installed Versions

INSTALLED VERSIONS ------------------ commit : bdc79c146c2e32f2cab629be240f01658cfb6cc2 python : 3.10.13.final.0 python-bits : 64 OS : Darwin OS-release : 23.2.0 Version : Darwin Kernel Version 23.2.0: Wed Nov 15 21:55:06 PST 2023; root:xnu-10002.61.3~2/RELEASE_ARM64_T6020 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8 pandas : 2.2.1 numpy : 1.26.3 pytz : 2024.1 dateutil : 2.8.2 setuptools : 68.2.2 pip : 23.3.1 Cython : None pytest : 8.0.0 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.12.0 sqlalchemy : None tables : None tabulate : 0.9.0 xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None
rhshadrach commented 3 months ago

Thanks for the report - this is a consequence of having comparisons return pd.NA:

print(pd.NA == 1)
# <NA>

When Python checks "is pd.NA == 1", the result is NA, which Python then evaluates the truthiness of this result, giving you the TypeError as reported. As long as we are returning pd.NA on comparisons, I do not believe anything can be done here.

cc @jorisvandenbossche @phofl

phofl commented 3 months ago

We intend to change this to return false (discussed in Basel), should probably get this into 3.0

20revsined commented 3 months ago

take

asishm commented 2 months ago

We intend to change this to return false (discussed in Basel), should probably get this into 3.0

@phofl Would this change only apply for boolean ops or do you anticipating changing the behavior of numerical ops like 1 + pd.NA as well?

phofl commented 2 months ago

not it's only

bool(pd.NA) that we want to change.

@20revsined this is probably not a good issue for a beginner in pandas

julia-pfarr commented 1 week ago

I don't know if my issue is related to this, please remove my comment if not!

I have a function which gives me the following output (pd df):

timestamp duration trial_type blink message
9199380 \<NA> NaN \<NA> RECORD_START
9199345 392 fixation 0 NaN
etc...

column dtypes are: timestamp Int64 duration Int64 trial_type object blink Int64 message object dtype: object

To be precise: timestamp and duration hold numerics plus nans, trial_type holds strings plus nans, blink holds numerics (0 and 1) plus nans, and message hold strings plus nans.

Now I wrote a unit test to test the output for the first row:

@pytest.mark.parametrize(     
"folder, expected",     
[("emg", [9199380, pd.NA, np.nan, pd.NA, "RECORD_START"])]
# + *other folders, removed for simplicity*)

def test_physioevents_value(folder, expected, eyelink_test_data_dir):
    input_dir = eyelink_test_data_dir / folder
    asc_file = asc_test_files(input_dir=input_dir, suffix="*_events")[0]
    events = _load_asc_file(asc_file)
    events_after_start = _df_events_after_start(events)
    physioevents_reordered = _df_physioevents(events_after_start)
    physioevents_eye1 = _physioevents_eye1(physioevents_reordered)
    assert physioevents_eye1.iloc[0].tolist() == expected

And the list obviously looks like this: [9199380, \<NA>, nan, \<NA>, 'RECORD_START']

I get the following error when running the test:

E AssertionError: assert [9199380, \<NA>...CORD_START'] == [9199380, \<NA>...CORD_START'] E
E (pytest_assertion plugin: representation of details failed: missing.pyx:392: TypeError: boolean value of NA is ambiguous. E Probably an object has a faulty repr.)

tests/test_edf2bids.py:670: AssertionError

So I guess I cannot use pd.NA to check if the value in that field is \<NA>. However, I also cannot check it using "\<NA>", i.e. encoding it as a string.

How I can check if pd.NAs s in the dataframe exist?

I tried changing the dtypes so that every column has the dtype 'object'. However, that's not really what I want.

rhshadrach commented 1 week ago

While somewhat related, this:

How I can check if pd.NAs s in the dataframe exist?

is more of a usage question. Please try asking on StackOverflow first - if you don't get your question resolved in a few days, open a new issue here and link to your SO post. We do this as otherwise we fear our issue tracker would be flooded with usage questions.

julia-pfarr commented 1 week ago

Great, thank you for your reply! I already asked on SO a couple of days ago. I'll wait a bit more and then do as you asked if I don't get it resolved otherwise :-)