pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.27k stars 17.8k forks source link

API: handling of missing values in Index.__contains__ #59765

Open jorisvandenbossche opened 1 week ago

jorisvandenbossche commented 1 week ago

The below table gives an overview of the result value for:

missing_value in idx

i.e. how Index.__contains__ handles various missing value sentinels as input for the different data types.

dtype None nan \<NA> NaT
object-none True False False False
object-nan False True False False
object-NA False False True False
datetime True True True True
period True True True True
timedelta True True True True
float64 False True False False
categorical True True True True
interval True True True False
nullable_int False False True False
nullable_float False False True False
string-python False False False False
string-pyarrow False False False False
str-python False False False False

The last three rows with not a single True are specifically problematic, this seems a bug with the StringDtype

But more in general, this is quite inconsistent:

The code to generate the table above:

```python import numpy as np import pandas as pd # from conftest.py indices_dict = { "object-none": pd.Index(["a", None], dtype=object), "object-nan": pd.Index(["a", np.nan], dtype=object), "object-NA": pd.Index(["a", pd.NA], dtype=object), "datetime": pd.DatetimeIndex(["2024-01-01", "NaT"]), "period": pd.PeriodIndex(["2024-01-01", None], freq="D"), "timedelta": pd.TimedeltaIndex(["1 days", "NaT"]), "float64": pd.Index([2.0, np.nan], dtype="float64"), "categorical": pd.CategoricalIndex(["a", None]), "interval": pd.IntervalIndex.from_tuples([(1, 2), np.nan]), "nullable_int": pd.Index([2, None], dtype="Int64"), "nullable_float": pd.Index([2.0, None], dtype="Float32"), "string-python": pd.Index(["a", None], dtype="string[python]"), "string-pyarrow": pd.Index(["a", None], dtype="string[pyarrow]"), "str-python": pd.Index(["a", None], dtype=pd.StringDtype("pyarrow", na_value=np.nan)) } results = [] for dtype, data in indices_dict.items(): for val in [None, np.nan, pd.NA, pd.NaT]: res = val in data results.append((dtype, str(val), res)) df = pd.DataFrame(results, columns=["dtype", "val", "result"]) df_overview = df.pivot(columns="val", index="dtype", values="result").reindex(columns=df["val"].unique(), index=df["dtype"].unique()) print(df_overview.astype(str).to_markdown()) ```

cc @jbrockmendel I would have expected we had issues about this, but didn't directly find anything

jbrockmendel commented 1 week ago

im not aware of a dedicated issue for this either. i think at one point I made a PR trying to make more of the EA subclasses use is_valid_na_for but that got tabled pending the nan-vs-na topic.

For the datetimelike cases i think/hope that mismatched NaTs will return False (i.e. np.timedelta64("NaT") in my_datetimeindex should always be False). Also Decimal("NaN") should be handled correctly.

jorisvandenbossche commented 1 week ago

For the datetimelike cases i think/hope that mismatched NaTs will return False (i.e. np.timedelta64("NaT") in my_datetimeindex should always be False).

Indeed, the np.timedelta64("NaT") and np.datetime64("NaT") only give True for timedelta/datetime index, respectively, and all other index dtypes return False for those, with one exception: categorical.

Also Decimal("NaN") should be handled correctly.

In the sense that it is not matched in general (again, except for categorical ..). But it seems also not be matched for object dtype with such decimal: Decimal("NaN") in pd.Index([Decimal("2.0"), Decimal("NaN")], dtype=object) gives False.


Expanded table:

dtype None nan \<NA> NaT np.datetime64('NaT') np.timedelta64('NaT') Decimal('NaN')
object-none True False False False False False False
object-nan False True False False False False False
object-NA False False True False False False False
object-decimal-NaN False False False False False False False
datetime True True True True True False False
period True True True True False False False
timedelta True True True True False True False
float64 False True False False False False False
categorical True True True True True True True
interval True True True False False False False
nullable_int False False True False False False False
nullable_float False False True False False False False
string-python False False False False False False False
string-pyarrow False False False False False False False
str-python False False False False False False False
```python import numpy as np import pandas as pd from decimal import Decimal # from conftest.py indices_dict = { "object-none": pd.Index(["a", None], dtype=object), "object-nan": pd.Index(["a", np.nan], dtype=object), "object-NA": pd.Index(["a", pd.NA], dtype=object), "object-decimal-NaN": pd.Index(["a", Decimal("NaN")], dtype=object), "datetime": pd.DatetimeIndex(["2024-01-01", "NaT"]), "period": pd.PeriodIndex(["2024-01-01", None], freq="D"), "timedelta": pd.TimedeltaIndex(["1 days", "NaT"]), "float64": pd.Index([2.0, np.nan], dtype="float64"), "categorical": pd.CategoricalIndex(["a", None]), "interval": pd.IntervalIndex.from_tuples([(1, 2), np.nan]), "nullable_int": pd.Index([2, None], dtype="Int64"), "nullable_float": pd.Index([2.0, None], dtype="Float32"), "string-python": pd.Index(["a", None], dtype="string[python]"), "string-pyarrow": pd.Index(["a", None], dtype="string[pyarrow]"), "str-python": pd.Index(["a", None], dtype=pd.StringDtype("pyarrow", na_value=np.nan)) } results = [] for dtype, data in indices_dict.items(): for val in [None, np.nan, pd.NA, pd.NaT, np.datetime64("NaT"), np.timedelta64("NaT"), Decimal("NaN")]: res = val in data results.append((dtype, repr(val), res)) df = pd.DataFrame(results, columns=["dtype", "val", "result"]) df_overview = df.pivot(columns="val", index="dtype", values="result").reindex(columns=df["val"].unique(), index=df["dtype"].unique()) print(df_overview.astype(str).to_markdown()) ```