API: handling of missing values in Index.__contains__

jorisvandenbossche commented 1 week ago

The below table gives an overview of the result value for:

missing_value in idx

i.e. how Index.__contains__ handles various missing value sentinels as input for the different data types.

dtype	None	nan	\<NA>	NaT
object-none	True	False	False	False
object-nan	False	True	False	False
object-NA	False	False	True	False
datetime	True	True	True	True
period	True	True	True	True
timedelta	True	True	True	True
float64	False	True	False	False
categorical	True	True	True	True
interval	True	True	True	False
nullable_int	False	False	True	False
nullable_float	False	False	True	False
string-python	False	False	False	False
string-pyarrow	False	False	False	False
str-python	False	False	False	False

The last three rows with not a single True are specifically problematic, this seems a bug with the StringDtype

But more in general, this is quite inconsistent:

For object dtype, we require exact match
For datetimelike and categorical, we match any missing-like
For interval, we match any missing-like except NaT (also not in case of datetimelike interval dtype)
For float we only match NaN
For nullable dtypes (int/float), we only match NA

The code to generate the table above:

```python import numpy as np import pandas as pd # from conftest.py indices_dict = { "object-none": pd.Index(["a", None], dtype=object), "object-nan": pd.Index(["a", np.nan], dtype=object), "object-NA": pd.Index(["a", pd.NA], dtype=object), "datetime": pd.DatetimeIndex(["2024-01-01", "NaT"]), "period": pd.PeriodIndex(["2024-01-01", None], freq="D"), "timedelta": pd.TimedeltaIndex(["1 days", "NaT"]), "float64": pd.Index([2.0, np.nan], dtype="float64"), "categorical": pd.CategoricalIndex(["a", None]), "interval": pd.IntervalIndex.from_tuples([(1, 2), np.nan]), "nullable_int": pd.Index([2, None], dtype="Int64"), "nullable_float": pd.Index([2.0, None], dtype="Float32"), "string-python": pd.Index(["a", None], dtype="string[python]"), "string-pyarrow": pd.Index(["a", None], dtype="string[pyarrow]"), "str-python": pd.Index(["a", None], dtype=pd.StringDtype("pyarrow", na_value=np.nan)) } results = [] for dtype, data in indices_dict.items(): for val in [None, np.nan, pd.NA, pd.NaT]: res = val in data results.append((dtype, str(val), res)) df = pd.DataFrame(results, columns=["dtype", "val", "result"]) df_overview = df.pivot(columns="val", index="dtype", values="result").reindex(columns=df["val"].unique(), index=df["dtype"].unique()) print(df_overview.astype(str).to_markdown()) ```

cc @jbrockmendel I would have expected we had issues about this, but didn't directly find anything

jbrockmendel commented 1 week ago

im not aware of a dedicated issue for this either. i think at one point I made a PR trying to make more of the EA subclasses use is_valid_na_for but that got tabled pending the nan-vs-na topic.

For the datetimelike cases i think/hope that mismatched NaTs will return False (i.e. np.timedelta64("NaT") in my_datetimeindex should always be False). Also Decimal("NaN") should be handled correctly.

jorisvandenbossche commented 1 week ago

For the datetimelike cases i think/hope that mismatched NaTs will return False (i.e. np.timedelta64("NaT") in my_datetimeindex should always be False).

Indeed, the np.timedelta64("NaT") and np.datetime64("NaT") only give True for timedelta/datetime index, respectively, and all other index dtypes return False for those, with one exception: categorical.

Also Decimal("NaN") should be handled correctly.

In the sense that it is not matched in general (again, except for categorical ..). But it seems also not be matched for object dtype with such decimal: Decimal("NaN") in pd.Index([Decimal("2.0"), Decimal("NaN")], dtype=object) gives False.

Expanded table:

dtype	None	nan	\<NA>	NaT	np.datetime64('NaT')	np.timedelta64('NaT')	Decimal('NaN')
object-none	True	False	False	False	False	False	False
object-nan	False	True	False	False	False	False	False
object-NA	False	False	True	False	False	False	False
object-decimal-NaN	False	False	False	False	False	False	False
datetime	True	True	True	True	True	False	False
period	True	True	True	True	False	False	False
timedelta	True	True	True	True	False	True	False
float64	False	True	False	False	False	False	False
categorical	True	True	True	True	True	True	True
interval	True	True	True	False	False	False	False
nullable_int	False	False	True	False	False	False	False
nullable_float	False	False	True	False	False	False	False
string-python	False	False	False	False	False	False	False
string-pyarrow	False	False	False	False	False	False	False
str-python	False	False	False	False	False	False	False

```python import numpy as np import pandas as pd from decimal import Decimal # from conftest.py indices_dict = { "object-none": pd.Index(["a", None], dtype=object), "object-nan": pd.Index(["a", np.nan], dtype=object), "object-NA": pd.Index(["a", pd.NA], dtype=object), "object-decimal-NaN": pd.Index(["a", Decimal("NaN")], dtype=object), "datetime": pd.DatetimeIndex(["2024-01-01", "NaT"]), "period": pd.PeriodIndex(["2024-01-01", None], freq="D"), "timedelta": pd.TimedeltaIndex(["1 days", "NaT"]), "float64": pd.Index([2.0, np.nan], dtype="float64"), "categorical": pd.CategoricalIndex(["a", None]), "interval": pd.IntervalIndex.from_tuples([(1, 2), np.nan]), "nullable_int": pd.Index([2, None], dtype="Int64"), "nullable_float": pd.Index([2.0, None], dtype="Float32"), "string-python": pd.Index(["a", None], dtype="string[python]"), "string-pyarrow": pd.Index(["a", None], dtype="string[pyarrow]"), "str-python": pd.Index(["a", None], dtype=pd.StringDtype("pyarrow", na_value=np.nan)) } results = [] for dtype, data in indices_dict.items(): for val in [None, np.nan, pd.NA, pd.NaT, np.datetime64("NaT"), np.timedelta64("NaT"), Decimal("NaN")]: res = val in data results.append((dtype, repr(val), res)) df = pd.DataFrame(results, columns=["dtype", "val", "result"]) df_overview = df.pivot(columns="val", index="dtype", values="result").reindex(columns=df["val"].unique(), index=df["dtype"].unique()) print(df_overview.astype(str).to_markdown()) ```

pandas-dev / pandas

API: handling of missing values in Index.contains #59765

pandas-dev / pandas

API: handling of missing values in Index.__contains__ #59765

API: handling of missing values in Index.contains #59765