Open jbrockmendel opened 3 years ago
we don't allow na values in the constructor by definition so this should be unambiguous
we don't allow na values in the constructor by definition so this should be unambiguous
which constructor doesnt allow NA values? I can do
cat = pd.Categorical(["A", "B", np.nan, "C"])
target1 = [np.nan]
target2 = ["D"]
>>> cat.categories.get_indexer(target1)
array([-1])
>>> cat.categories.get_indexer(target2)
array([-1])
check the categories
When Categorical was originally added, it did support missing values in its categories (which means there are basically two ways to have missing values: a -1 in the codes, or a missing value in the categories). But shortly after, we changed that only allow a single way, i.e -1 in the codes, and thus disallow missing values in the categories.
General constructors like pd.Categorical(["A", "B", np.nan, "C"])
will convert the missing values to -1 in the codes, and specialized constructors check that the constructors don't have nans:
In [15]: pd.Categorical.from_codes([0, 1, 2], categories=["a", "b", None])
...
~/scipy/pandas/pandas/core/dtypes/dtypes.py in validate_categories(categories, fastpath)
502
503 if categories.hasnans:
--> 504 raise ValueError("Categorical categories cannot be null")
505
506 if not categories.is_unique:
ValueError: Categorical categories cannot be null
There's another issue somewhere discussing this, but I can't find it right now. At the time, we (I?) decided against including NA-like values in the categories.
There's another issue somewhere discussing this, but I can't find it right now. At the time, we (I?) decided against including NA-like values in the categories.
I'm not finding it either, but i think there was something about trying to a specific NA value or possibly multiple distinct NA types.
There are a bunch of places in CategoricalIndex where we check something like:
This leads to ambiguity, as a -1 code can indicate an NA value that is present, or a non-NA value that is not among self.categories. Having to sort out which we're looking at is a hassle which we should try to avoid.