pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.6k stars 17.9k forks source link

Categorical equality with NaN behaves unexpectedly #28384

Closed ageorgou closed 4 years ago

ageorgou commented 5 years ago

Code Sample, a copy-pastable example if possible

import pandas as pd

cat_type = pd.CategoricalDtype(categories=["a", "b", "c"])
s1_cat = pd.Series(["a", "b", "c"], dtype=cat_type)
s2_cat = pd.Series([np.nan, "a", "b"], dtype=cat_type)
s1_cat != s2_cat  # expected all True

Output:

0    False
1     True
2     True
dtype: bool

Problem description

I would expect anything to compare != to NaN.

Comparing the categorical series with == works as expected:

s1_cat == s2_cat  # expect all False

Output:

0    False
1    False
2    False
dtype: bool

Element-wise comparison seems to work fine:

for left, right in zip(s1_cat, s2_cat):
    print(left != right)  # expect all True

Output:

True
True
True

This also works as expected with other data types, e.g.:

s1_int = pd.Series([1, 2, 3])
s2_int = pd.Series([np.nan, 1, 2])
s1_int != s2_int  # expect all True

gives

0    True
1    True
2    True
dtype: bool

Apologies if this is a duplicate; I have found various issues about categorical types and missing values, but not about comparing them in this way.

Expected Output

s1_cat != s2_cat  # expected all True
0    True
1    True
2    True
dtype: bool

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.7.4.final.0 python-bits : 64 OS : Darwin OS-release : 16.7.0 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8 pandas : 0.25.1 numpy : 1.17.2 pytz : 2019.2 dateutil : 2.8.0 pip : 19.2.2 setuptools : 41.0.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None s3fs : None scipy : None sqlalchemy : None tables : None xarray : None xlrd : None xlwt : None xlsxwriter : None
TomAugspurger commented 5 years ago

cc @jorisvandenbossche (gets into the NA vs. NaN discussion as part of https://github.com/pandas-dev/pandas/issues/28095).

WillAyd commented 5 years ago

This is standard behavior per IEE 7544

https://stackoverflow.com/questions/1565164/what-is-the-rationale-for-all-comparisons-returning-false-for-ieee754-nan-values

So I think nothing to be done directly on this issue. If you have feedback on generally how missing values should be handled I would suggest rolling into linked issue above

jorisvandenbossche commented 5 years ago

@WillAyd as noted in the issue, for other dtypes it does equate to True, as well for scalars:

In [9]: np.nan != np.nan  
Out[9]: True
jorisvandenbossche commented 5 years ago

To make the difference in behaviour between categorical and non-categorical very explicit:

In [14]: pd.Series(['a', 'b', np.nan]) != pd.Series(['b', 'a', 'a'])
Out[14]: 
0    True
1    True
2    True
dtype: bool

In [15]: pd.Series(['a', 'b', np.nan], dtype='category') != pd.Series(['b', 'a', 'a'], dtype='category') 
Out[15]: 
0     True
1     True
2    False
dtype: bool

I suppose this has to do with np.nan being represented in categorical as -1, and this might not be special cased.

quale1 commented 4 years ago

Just to further emphasize the problems this causes,

In [3]: x = pd.Series(['a', 'b', np.nan], dtype='category')

In [4]: y = pd.Series(['b', 'a', 'a'], dtype='category')

In [5]: x[2] != y[2]
Out[5]: True

In [6]: (x != y)[2]
Out[6]: False

Result 5 makes sense, but 6 doesn't. It's impossible to successfully reason about a Pandas program when x[2] != y[2] and (x != y)[2] give different results.

simonjayhawkins commented 4 years ago

looks to be fixed on master (@jbrockmendel maybe #36237?)

>>> pd.__version__
'1.2.0.dev0+378.g8df0218a4b'
>>> pd.Series(['a', 'b', np.nan], dtype='category') != pd.Series(['b', 'a', 'a'], dtype='category')
0    True
1    True
2    True
dtype: bool
>>>
junjunjunk commented 4 years ago

take