pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.62k stars 17.91k forks source link

INT: include NA values in Categorical.categories #37930

Open jbrockmendel opened 3 years ago

jbrockmendel commented 3 years ago

There are a bunch of places in CategoricalIndex where we check something like:

codes = self.categories.get_indexer(target)
if (codes == -1).any():
    do_something()

This leads to ambiguity, as a -1 code can indicate an NA value that is present, or a non-NA value that is not among self.categories. Having to sort out which we're looking at is a hassle which we should try to avoid.

jreback commented 3 years ago

we don't allow na values in the constructor by definition so this should be unambiguous

jbrockmendel commented 3 years ago

we don't allow na values in the constructor by definition so this should be unambiguous

which constructor doesnt allow NA values? I can do

cat = pd.Categorical(["A", "B", np.nan, "C"])

target1 = [np.nan]
target2 = ["D"]

>>> cat.categories.get_indexer(target1)
array([-1])
>>> cat.categories.get_indexer(target2)
array([-1])
jreback commented 3 years ago

check the categories

jorisvandenbossche commented 3 years ago

When Categorical was originally added, it did support missing values in its categories (which means there are basically two ways to have missing values: a -1 in the codes, or a missing value in the categories). But shortly after, we changed that only allow a single way, i.e -1 in the codes, and thus disallow missing values in the categories.

General constructors like pd.Categorical(["A", "B", np.nan, "C"]) will convert the missing values to -1 in the codes, and specialized constructors check that the constructors don't have nans:

In [15]: pd.Categorical.from_codes([0, 1, 2], categories=["a", "b", None])
...
~/scipy/pandas/pandas/core/dtypes/dtypes.py in validate_categories(categories, fastpath)
    502 
    503             if categories.hasnans:
--> 504                 raise ValueError("Categorical categories cannot be null")
    505 
    506             if not categories.is_unique:

ValueError: Categorical categories cannot be null
TomAugspurger commented 3 years ago

There's another issue somewhere discussing this, but I can't find it right now. At the time, we (I?) decided against including NA-like values in the categories.

jbrockmendel commented 3 years ago

There's another issue somewhere discussing this, but I can't find it right now. At the time, we (I?) decided against including NA-like values in the categories.

I'm not finding it either, but i think there was something about trying to a specific NA value or possibly multiple distinct NA types.