Closed ageorgou closed 4 years ago
cc @jorisvandenbossche (gets into the NA vs. NaN discussion as part of https://github.com/pandas-dev/pandas/issues/28095).
This is standard behavior per IEE 7544
So I think nothing to be done directly on this issue. If you have feedback on generally how missing values should be handled I would suggest rolling into linked issue above
@WillAyd as noted in the issue, for other dtypes it does equate to True, as well for scalars:
In [9]: np.nan != np.nan
Out[9]: True
To make the difference in behaviour between categorical and non-categorical very explicit:
In [14]: pd.Series(['a', 'b', np.nan]) != pd.Series(['b', 'a', 'a'])
Out[14]:
0 True
1 True
2 True
dtype: bool
In [15]: pd.Series(['a', 'b', np.nan], dtype='category') != pd.Series(['b', 'a', 'a'], dtype='category')
Out[15]:
0 True
1 True
2 False
dtype: bool
I suppose this has to do with np.nan
being represented in categorical as -1, and this might not be special cased.
Just to further emphasize the problems this causes,
In [3]: x = pd.Series(['a', 'b', np.nan], dtype='category')
In [4]: y = pd.Series(['b', 'a', 'a'], dtype='category')
In [5]: x[2] != y[2]
Out[5]: True
In [6]: (x != y)[2]
Out[6]: False
Result 5 makes sense, but 6 doesn't. It's impossible to successfully reason about a Pandas program when x[2] != y[2]
and (x != y)[2]
give different results.
looks to be fixed on master (@jbrockmendel maybe #36237?)
>>> pd.__version__
'1.2.0.dev0+378.g8df0218a4b'
>>> pd.Series(['a', 'b', np.nan], dtype='category') != pd.Series(['b', 'a', 'a'], dtype='category')
0 True
1 True
2 True
dtype: bool
>>>
take
Code Sample, a copy-pastable example if possible
Output:
Problem description
I would expect anything to compare
!=
to NaN.Comparing the categorical series with
==
works as expected:Output:
Element-wise comparison seems to work fine:
Output:
This also works as expected with other data types, e.g.:
gives
Apologies if this is a duplicate; I have found various issues about categorical types and missing values, but not about comparing them in this way.
Expected Output
Output of
pd.show_versions()