pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.62k stars 17.57k forks source link

BUG: Enlarging multilevel index fails if one or more level keys are None #59153

Open micky-gee opened 4 days ago

micky-gee commented 4 days ago

Pandas version checks

Reproducible Example

import pandas as pd

#Create simple multilevel index with two levels (note one entry on level 1 is None):
index = pd.MultiIndex.from_tuples([('A', 'a1'), ('A', 'a2'), ('B', 'b1'), ('B', None)])

#Create dataframe with said index:
pd.DataFrame([(0, 6), (1, 5), (2, 4), (3, 7)], index=index)

#       0  1
#A a1   0  6
#  a2   1  5
#B b1   2  4
#  NaN  3  7

#Now it is possible to enlarge this dataframe with a new index entry provided none of the keys are None:
df.loc[('B', 'b2'),:] = [10, 11]

#           0     1
# A a1    0.0   6.0
#   a2    1.0   5.0
# B b1    2.0   4.0
#   NaN   3.0   7.0
#   b2   10.0  11.0

#However this will throw a KeyError:
df.loc[('A', None),:] = [12, 13]

#Also doesn't work with an index slice:
idx = pd.IndexSlice

#this will throw a KeyError:
df.loc[idx['A', None],:] = [12, 13]

Issue Description

It is possible to enlarge a dataframe with a multilevel indexes by providing the new key as parameters to df.loc[...]

It is also possible to create entries to multilevel indices that have None as the key i.e. df.loc[('A', None),...]

It is not possible to enlarge a dataframe with a multilevel index if one or more of the keys is None.

Expected Behavior

Building on the example above, df.loc[('A', None),:] = [12, 13]

should result in the following:

# A a1    0.0   6.0
#   a2    1.0   5.0
#   NaN  12.0  13.0
# B b1    2.0   4.0
#   NaN   3.0   7.0
#   b2   10.0  11.0

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.10.6.final.0 python-bits : 64 OS : Darwin OS-release : 23.5.0 Version : Darwin Kernel Version 23.5.0: Wed May 1 20:19:05 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T8112 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8 pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 63.2.0 pip : 24.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.4 IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.8.4 numba : 0.59.1 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.13.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None
micky-gee commented 3 days ago

Adding what I've found from some more digging, I've found the call within the multilevel index that is failing:

>>> index._engine.get_loc(('A', None))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "index.pyx", line 776, in pandas._libs.index.BaseMultiIndexCodesEngine.get_loc
  File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc
  File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 2152, in pandas._libs.hashtable.UInt64HashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 2176, in pandas._libs.hashtable.UInt64HashTable.get_item
KeyError: 17

I think that this has to do with the hashing of the None type and converting that to an address on the underlying data structure?

When I give a valid tuple to the multilevel index, I get an integer corresponding to an entry in an underlying datastructure:

>>>index._engine.get_loc(('A', 'a2'))
1
micky-gee commented 3 days ago

As part of trying to understand this problem more broadly, I've been investigating hashable types (None and NaN are hashable) and their usability in indices with Pandas.

As a single level index (opposed to a multilevel index), here is an MWE that demonstrates these inconsistencies:

>>> import pandas as pd
>>> import numpy as np
>>> index2 = pd.Index([1, 2, 3, None])
>>> df2 = pd.DataFrame([4, 5, 6, 9], index=index2)
>>> df2
     0
1.0  4
2.0  5
3.0  6
NaN  9

Now addressing the index entry with None results in a key error:

>>> df2.loc[None]
Traceback (most recent call last):
  File "/Users/michaelgrant/.pyenv/versions/s7s_strategy_private/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc
    return self._engine.get_loc(casted_key)
  File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc
  File "index.pyx", line 175, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index_class_helper.pxi", line 19, in pandas._libs.index.Float64Engine._check_type
KeyError: None

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/michaelgrant/.pyenv/versions/s7s_strategy_private/lib/python3.10/site-packages/pandas/core/indexing.py", line 1191, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/Users/michaelgrant/.pyenv/versions/s7s_strategy_private/lib/python3.10/site-packages/pandas/core/indexing.py", line 1431, in _getitem_axis
    return self._get_label(key, axis=axis)
  File "/Users/michaelgrant/.pyenv/versions/s7s_strategy_private/lib/python3.10/site-packages/pandas/core/indexing.py", line 1381, in _get_label
    return self.obj.xs(label, axis=axis)
  File "/Users/michaelgrant/.pyenv/versions/s7s_strategy_private/lib/python3.10/site-packages/pandas/core/generic.py", line 4301, in xs
    loc = index.get_loc(key)
  File "/Users/michaelgrant/.pyenv/versions/s7s_strategy_private/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc
    raise KeyError(key) from err
KeyError: None

However replacing None with np.nan works just fine:

>>> df2.loc[np.nan]
0    9
Name: nan, dtype: int64