pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.92k stars 18.03k forks source link

BUG: Index and MultiIndex `KeyError` cases and discussion #39775

Open attack68 opened 3 years ago

attack68 commented 3 years ago

Since the introduction of KeyError for missing keys in an index there have been quite a few use cases from different issues. I will try and link some of the issues if I see them.

My view is that KeyErrors for Index is fine, but MultiIndexes should be treated differently: you cannot always raise a KeyError for a single keys in a MultiIndex slice since a MultiIndex cannot always be reindexed.

Index

indexes = [
    pd.Index(['a','b','c','e','d'], name='Unique Non-Monotonic'),
    pd.Index(['a','b','c','e','d'], name='Unique Monotonic').sort_values(),
    pd.Index(['a','b','b','e','d'], name='Non-Unique Non-Monotonic'),
    pd.Index(['a','b','b','e','d'], name='Non-Unique Monotonic').sort_values(),
]

Screen Shot 2021-02-12 at 13 23 16

Code generator ``` ret = None def do(command): try: exec(f'global ret; ret={command}', globals()) except KeyError: return 'KeyError' else: if isinstance(ret, (np.int64)): return 'int64' elif isinstance(ret, (pd.Series)): return 'Series' elif isinstance(ret, (pd.DataFrame)): return 'DataFrame' return 'OtherType' cases = [ "'a'", # single valid key "'!'", # single invalid key "['a']", # single valid key as pseudo multiple valid keys "['!']", # single invalid key as pseudo multip valid keys "['a','e']", # multiple valid keys "['a','!']", # at least one invalid keys "'a':'e'", # valid key slice "'a':'!'", # at least one invalid slice key "'!':", # at least one invalid slice key "'b'", # single valid non-unique key "['b']", # single valid non-unique key as pseudo multiple keys "'b':'d'", # slice with non-unique key ] base = [ [f's.loc[{case}]', # use regular s.loc[] f's.loc[ix[{case}]]'] # and with index slice as comparison s.loc[ix[{}]] for case in cases ] commands = [ command for sublist in base for command in sublist ] indexes = [ pd.Index(['a','b','c','e','d'], name='Unique Non-Monotonic'), pd.Index(['a','b','c','e','d'], name='Unique Monotonic').sort_values(), pd.Index(['a','b','b','e','d'], name='Non-Unique Non-Monotonic'), pd.Index(['a','b','b','e','d'], name='Non-Unique Monotonic').sort_values(), ] results = pd.DataFrame('', index=commands, columns=['Uq Non-Mono', 'Uq Mono', 'Non-Uq Non-Mono', 'Non-Uq Mono']) for j, index in enumerate(indexes): s = pd.Series([1,2,3,4,5], index=index) for i, command in enumerate(commands): results.iloc[i, j] = do(command) ```

This seems to be pretty consistent. The only inconsistency is perhaps highlighted in red, and a minor niggle for dynamic coding might be the different return types in the case of non-unique indexes.

Obviously the solution to dealing with any case where you need to index by pre-defined levels that may have been filtered is to reindex with your pre-defined keys. Any this is quite easy to do in RAM.

MultiIndex

indexes = [
    pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'z'), ('a','x'), ('a', 'z')]),
    pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'z'), ('a','x'), ('a', 'z')]).sortlevel()[0],
    pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'x'), ('a','x'), ('a', 'z')]),
    pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'x'), ('a','x'), ('a', 'z')]).sortlevel()[0],
]

MultiIndexing is different. You cannot always reindex for one of two reasons:

For example, consider the MultiIndex levels: (a,b), (x,y,z). There are a maximum of 6 index tuples but practically one will work with indexes of much less than the maximum combinations (Since the combinations scale exponentially with the number of levels). Your MultiIndex is thus [(a,x), (a,z), (b,x), (b,y)].

I think you need to be able to index MultiIndexes with keys that are missing. As a rule I would suggest that slices which are an iterable do not yield KeyErrors. Here is a summary of some of the observances below for current behaviour:

[a, y] : KeyError
[a, [y]] : KeyError but should return empty (a in level0)
[[a], y] : KeyError but should return empty (y in level1)
[[a], [y]] : KeyError but should return empty 
[a, !] : KeyError 
[a, [!]] : returns empty
[[a], !] : KeyError (maybe OK since ! not in level1)
[[!], x] : returns empty (x in level1)
[[!], [!]] : returns empty
[!, !] : KeyError

multiindex_slice

Code generator ``` cases_level0 = [ "'a'", # single valid key on level0 "'!'", # single invalid key on level0 "['a']", # single valid key on level0 as pseudo multiple valid keys "['!']", # single invalid key on level0 as pseudo multiple valid keys "['a', 'b']", # multiple valid key on level0 "['a', '!']", # at least one invalid key on level0 "'a':'b'", # valid level0 index slice "'a':'!'", # invalid level0 index slice "'!':", # fully invalid level0 index slice ] comments_level0 = [ '0: valid single, ', '0: invalid single, ', '0: valid single as multiple, ', '0: invalid single as multiple, ', '0: multiple valid, ', '0: one invalid in multiple, ', '0: valid slice, ', '0: semi-invalid slice, ', '0: invalid slice, ', ] base = [ [f's.loc[{case}]', f's.loc[ix[{case}, :]]'] for case in cases_level0 ] commands = [ command for sublist in base for command in sublist ] indexes = [ pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'z'), ('a','x'), ('a', 'z')]), pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'z'), ('a','x'), ('a', 'z')]).sortlevel()[0], pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'x'), ('a','x'), ('a', 'z')]), pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'x'), ('a','x'), ('a', 'z')]).sortlevel()[0], ] results = pd.DataFrame('', index=commands, columns=['Uq Non-Mono', 'Uq Mono', 'Non-Uq Non-Mono', 'Non-Uq Mono']) for j, index in enumerate(indexes): s = pd.Series([1,2,3,4,5], index=index) for i, command in enumerate(commands): results.iloc[i,j] = do(command) base = [ [com, com] for com in comments_level0 ] comments = [ com for sublist in base for com in sublist ] results['Comment'] = comments results.style cases_level1 = [ "'x'", # single valid key on level1 "'y'", # single sometimes-valid key on level1 "'!'", # single invalid key on level1 "['x']", # single valid key on level1 as pseudo multiple valid keys "['y']", # single sometimes-valid key on level1 as pseudo multiple valid keys "['!']", # single invalid key on level1 as pseudo multiple valid keys "['x', 'y']", # multiple sometimes-valid key on level1 "['x', '!']", # at least one invalid key on level1 "'x':'y'", # sometimes-valid level0 index slice "'x':'!'", # invalid level1 index slice "'!':", # fully invalid level1 index slice ] comments_level1 = [ '1: valid single', '1: semi-valid single', '1: invalid single', '1: valid single as multiple', '1: semi-valid single as multiple', '1: invalid single as multiple', '1: multiple semi-valid', '1: one invalid in multiple', '1: semi-valid slice', '1: semi-invalid slice', '1: invalid slice', ] from itertools import product multi_cases = list(product(cases_level0, cases_level1)) multi_comments = list(product(comments_level0, comments_level1)) base = [ [f's.loc[{case[0]}, {case[1]}]', f's.loc[ix[{case[0]}, {case[1]}]]'] for case in multi_cases ] commands = [ command for sublist in base for command in sublist ] results = pd.DataFrame('', index=commands, columns=['Uq Non-Mono', 'Uq Mono', 'Non-Uq Non-Mono', 'Non-Uq Mono']) for j, index in enumerate(indexes): s = pd.Series([1,2,3,4,5], index=index) for i, command in enumerate(commands): results.iloc[i,j] = do(command) base = [ [com, com] for com in multi_comments ] comments = [ com for sublist in base for com in sublist ] results['comment'] = comments results.style\ .applymap(lambda v: 'background-color:red;', subset=ix[["s.loc['a', ['!']]", "s.loc['a', ['y']]", "s.loc[['!'], ['!']]", "s.loc[['a'], ['y']]"], :])\ .applymap(lambda v: 'background-color:LemonChiffon;', subset=ix[["s.loc['a', 'x':'y']"], :]) ```
kthyng commented 1 year ago

I also met this problem.