Open moi90 opened 11 months ago
Thanks for the report, this is related to #48476. value_counts
is utilizing groupby, which converts the None
to np.nan
in the result. Unfortunately, using np.nan
to get the row doesn't work either. This is because np.nan != np.nan
and often you get a view on np.nan
:
df = pd.DataFrame(
[[False, None, False], [True, None, False], [True, True, False]],
columns=["a", "b", "c"],
)
counts = df.value_counts(dropna=False)
nan_from_index = counts.index[1][1]
print(nan_from_index)
# nan
print(nan_from_index is np.nan)
# False
Because of this behavior, the lookup fails.
@jorisvandenbossche @phofl - were lookups involving NA part of the discussion in the ice cream agreement?
Thanks for the explanation. I worked around it by changing NaN to another value.
First of all, @moi90 example has a mistake, which does not invalidate it, but still. Instead of
counts.loc[(pd.NA, True, False)] # KeyError: (<NA>, True, False)
it should be
counts.loc[(True, pd.NA, False)] # KeyError: (True, <NA>, False)
Note that
(<NA>, True, False)
is genuinely not in the MultiIndex, so we expect an error. But(True, <NA>, False)
is in the MultiIndex and we expect no error!
Second, there is a closely related behaviour:
counts.loc[(slice(None), pd.NA, slice(None))] # KeyError: <NA>
The error stack points to raise KeyError(key)
in here:
https://github.com/pandas-dev/pandas/blob/58461fef2315d228d08f65f8a9430e9294d65b31/pandas/core/indexes/multi.py#L3331-L3335
I believe that the pd.NA
in the user's .loc query is encoded as idx == -1
, while level_codes
encodes NaNs as 0
, which is evident from:
counts.index.get_level_values(1) # Index([nan, nan, True], dtype='object', name='b')
counts.index.codes[1] # array([0, 0, 1], dtype=int8)
which is why I doubt that the real reason for these bugs is the one @rhshadrach mentions, namely:
This is because
np.nan != np.nan
(but I am not 100% sure and will leave this to experts)
P.S. What is "Ice Cream Agreement"? Some bug reports just make you salivate.
The issue here is in None
in MultiIndex
used with GroupBy
.
For example, if to remove GroupBy
following @rhshadrach 's example above :
tmp = df.set_index(['a', 'b'])
tmp.index[0][1] is numpy.nan
# True
tmp.loc[(False, numpy.nan)] # works as expected
c False
Name: (False, nan), dtype: bool
It's a suprise to see in @rhshadrach 's example that:
print(nan_from_index is np.nan)
# False
even though:
counts.index[0][1] is counts.index[1][1]
# True
Unfortunately, using np.nan to get the row doesn't work either. This is because np.nan != np.nan
Is it possible to get this fixed in GroupBy
, for np.nan is np.nan
? I stopped at pandas.core.sorting.compress_group_index
reading that:
Group_index is offsets into cartesian product of all possible labels. This space can be huge, so this function compresses it, by computing offsets (comp_ids) into the list of unique labels (obs_group_ids).
Okay, this indeed seems to be related to GroupBy
and not just to MultiIndex
. Here is another illustrative example:
df = pd.DataFrame(
[[False, None, False], [True, None, False], [True, True, False]],
columns=["a", "b", "c"],
)
counts1 = df.value_counts(dropna=False)
mi = pd.MultiIndex.from_tuples([*df.values], names=['a', 'b', 'c'])
counts2 = pd.Series(index=mi, data=[1,1,1], name='count')
# counts1 and counts2 appear to have same content:
# a b c
# False NaN False 1
# True NaN False 1
# True False 1
# Name: count, dtype: int64
# However, the content is different
counts1.index.codes # FrozenList([[0, 1, 1], [0, 0, 1], [0, 0, 0]])
counts2.index.codes # FrozenList([[0, 1, 1], [-1, -1, 0], [0, 0, 0]])
counts2[counts2.index[[0]]] # Works
counts2[counts2.index[0]] # Works
counts1[counts1.index[[0]]] # Works
counts1[counts1.index[0]] # KeyError: (False, nan, False)
I think I solved it. I can open a PR soon, in a week or two, meanwhile I'll describe my findings.
The problem is in BaseMultiIndexCodesEngine.get_loc()
method":
The mistake is the use of checknull(v)
to determine the appropriate code for an element of a tuple-key. Thus, BaseMultiIndexCodesEngine.get_loc
makes this decision based on the value of the key, as opposed to the code of the value. This makes the get_loc
's value-based logic incompatible with BaseMultiIndexCodesEngine._extract_level_codes()
, which uses the code-based rule: https://github.com/pandas-dev/pandas/blob/b162331554d7c7f6fd46ddde1ff3908f2dc8bcce/pandas/_libs/index.pyx#L779-L788
I suppose, this can be fixed by checking level_has_nans
in get_loc()
like this:
[1 if (checknull(v) and has_nan) else lev.get_loc(v) + multiindex_nulls_shift
for lev, v, has_nan in zip(self.levels, key, self.level_has_nans)]
It remains to check if this fix breaks anything, but I guess it does not.
In any case, a PR should add the following test for MultiIndex
:
mi = MultiIndex(...) # random codes, values, including NA
for i in range(len(mi)):
assert i == mi.get_loc(mi[i])
@rhshadrach Please let me know if all this makes sense.
Thanks for the investigation @avm19. Your proposal makes sense, but I think would not work in the pathological case where a MultiIndex has NA values both in the levels and -1 for codes, e.g.
pd.MultiIndex(
levels=[[np.nan], [0]],
codes=[[0, -1], [0, 0]],
)
I think this is a consequence of groupby requiring nonnegative codes for NA values when dropna=True
(the code is used to determine the index of the result, and so must be nonnegative). If we change the codes to be negative for NA values, then it works. E.g.
df = pd.DataFrame(
[[False, None], [True, None]],
columns=["a", "b"],
)
counts = df.value_counts(dropna=False)
print(counts)
# a b
# False NaN 1
# True NaN 1
# Name: count, dtype: int64
print(counts.index.codes)
# [[0, 1], [0, 0]]
print(counts.loc[(False, np.nan)])
# Raises KeyError
mi = pd.MultiIndex(
[[False, True], []],
[[0, 1], [-1, -1]],
names=["a", "b"],
)
counts.index = mi
print(counts.loc[(False, np.nan)])
# 1
So to me the question is: do we support MultiIndex with NA in the levels as well as -1 for codes, or should we force NA values correspond to -1 codes. I can't say I have a good understanding of the consequences.
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
When selecting values from a MultiIndex where one (or multiple) level contain None / NaN, I get a KeyError.
Expected Behavior
I need to be able to use None in a MultiIndex.
A workaround would be to replace None with something else. Is there a better option?)
Installed Versions