Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Since the introduction of KeyError for missing keys in an index there have been quite a few use cases from different issues. I will try and link some of the issues if I see them.
My view is that KeyErrors for Index is fine, but MultiIndexes should be treated differently: you cannot always raise a KeyError for a single keys in a MultiIndex slice since a MultiIndexcannot always be reindexed.
Code generator
```
ret = None
def do(command):
try:
exec(f'global ret; ret={command}', globals())
except KeyError:
return 'KeyError'
else:
if isinstance(ret, (np.int64)):
return 'int64'
elif isinstance(ret, (pd.Series)):
return 'Series'
elif isinstance(ret, (pd.DataFrame)):
return 'DataFrame'
return 'OtherType'
cases = [
"'a'", # single valid key
"'!'", # single invalid key
"['a']", # single valid key as pseudo multiple valid keys
"['!']", # single invalid key as pseudo multip valid keys
"['a','e']", # multiple valid keys
"['a','!']", # at least one invalid keys
"'a':'e'", # valid key slice
"'a':'!'", # at least one invalid slice key
"'!':", # at least one invalid slice key
"'b'", # single valid non-unique key
"['b']", # single valid non-unique key as pseudo multiple keys
"'b':'d'", # slice with non-unique key
]
base = [
[f's.loc[{case}]', # use regular s.loc[]
f's.loc[ix[{case}]]'] # and with index slice as comparison s.loc[ix[{}]]
for case in cases
]
commands = [
command for sublist in base for command in sublist
]
indexes = [
pd.Index(['a','b','c','e','d'], name='Unique Non-Monotonic'),
pd.Index(['a','b','c','e','d'], name='Unique Monotonic').sort_values(),
pd.Index(['a','b','b','e','d'], name='Non-Unique Non-Monotonic'),
pd.Index(['a','b','b','e','d'], name='Non-Unique Monotonic').sort_values(),
]
results = pd.DataFrame('', index=commands, columns=['Uq Non-Mono', 'Uq Mono', 'Non-Uq Non-Mono', 'Non-Uq Mono'])
for j, index in enumerate(indexes):
s = pd.Series([1,2,3,4,5], index=index)
for i, command in enumerate(commands):
results.iloc[i, j] = do(command)
```
This seems to be pretty consistent. The only inconsistency is perhaps highlighted in red, and a minor niggle for dynamic coding might be the different return types in the case of non-unique indexes.
Obviously the solution to dealing with any case where you need to index by pre-defined levels that may have been filtered is to reindex with your pre-defined keys. Any this is quite easy to do in RAM.
MultiIndexing is different. You cannot always reindex for one of two reasons:
The number of possible combinations of the index level values exceeds ram and is computationally slow.
If you are to add in a value or set of values to a MultiIndex level the process is ambiguous and expanding all combinations will lead to above problems.
For example, consider the MultiIndex levels: (a,b), (x,y,z). There are a maximum of 6 index tuples but practically one will work with indexes of much less than the maximum combinations (Since the combinations scale exponentially with the number of levels). Your MultiIndex is thus [(a,x), (a,z), (b,x), (b,y)].
I think you need to be able to index MultiIndexes with keys that are missing. As a rule I would suggest that slices which are an iterable do not yield KeyErrors. Here is a summary of some of the observances below for current behaviour:
[a, y] : KeyError
[a, [y]] : KeyError but should return empty (a in level0)
[[a], y] : KeyError but should return empty (y in level1)
[[a], [y]] : KeyError but should return empty
[a, !] : KeyError
[a, [!]] : returns empty
[[a], !] : KeyError (maybe OK since ! not in level1)
[[!], x] : returns empty (x in level1)
[[!], [!]] : returns empty
[!, !] : KeyError
Code generator
```
cases_level0 = [
"'a'", # single valid key on level0
"'!'", # single invalid key on level0
"['a']", # single valid key on level0 as pseudo multiple valid keys
"['!']", # single invalid key on level0 as pseudo multiple valid keys
"['a', 'b']", # multiple valid key on level0
"['a', '!']", # at least one invalid key on level0
"'a':'b'", # valid level0 index slice
"'a':'!'", # invalid level0 index slice
"'!':", # fully invalid level0 index slice
]
comments_level0 = [
'0: valid single, ',
'0: invalid single, ',
'0: valid single as multiple, ',
'0: invalid single as multiple, ',
'0: multiple valid, ',
'0: one invalid in multiple, ',
'0: valid slice, ',
'0: semi-invalid slice, ',
'0: invalid slice, ',
]
base = [
[f's.loc[{case}]', f's.loc[ix[{case}, :]]'] for case in cases_level0
]
commands = [
command for sublist in base for command in sublist
]
indexes = [
pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'z'), ('a','x'), ('a', 'z')]),
pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'z'), ('a','x'), ('a', 'z')]).sortlevel()[0],
pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'x'), ('a','x'), ('a', 'z')]),
pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'x'), ('a','x'), ('a', 'z')]).sortlevel()[0],
]
results = pd.DataFrame('', index=commands, columns=['Uq Non-Mono', 'Uq Mono', 'Non-Uq Non-Mono', 'Non-Uq Mono'])
for j, index in enumerate(indexes):
s = pd.Series([1,2,3,4,5], index=index)
for i, command in enumerate(commands):
results.iloc[i,j] = do(command)
base = [
[com, com] for com in comments_level0
]
comments = [
com for sublist in base for com in sublist
]
results['Comment'] = comments
results.style
cases_level1 = [
"'x'", # single valid key on level1
"'y'", # single sometimes-valid key on level1
"'!'", # single invalid key on level1
"['x']", # single valid key on level1 as pseudo multiple valid keys
"['y']", # single sometimes-valid key on level1 as pseudo multiple valid keys
"['!']", # single invalid key on level1 as pseudo multiple valid keys
"['x', 'y']", # multiple sometimes-valid key on level1
"['x', '!']", # at least one invalid key on level1
"'x':'y'", # sometimes-valid level0 index slice
"'x':'!'", # invalid level1 index slice
"'!':", # fully invalid level1 index slice
]
comments_level1 = [
'1: valid single',
'1: semi-valid single',
'1: invalid single',
'1: valid single as multiple',
'1: semi-valid single as multiple',
'1: invalid single as multiple',
'1: multiple semi-valid',
'1: one invalid in multiple',
'1: semi-valid slice',
'1: semi-invalid slice',
'1: invalid slice',
]
from itertools import product
multi_cases = list(product(cases_level0, cases_level1))
multi_comments = list(product(comments_level0, comments_level1))
base = [
[f's.loc[{case[0]}, {case[1]}]', f's.loc[ix[{case[0]}, {case[1]}]]'] for case in multi_cases
]
commands = [
command for sublist in base for command in sublist
]
results = pd.DataFrame('', index=commands, columns=['Uq Non-Mono', 'Uq Mono', 'Non-Uq Non-Mono', 'Non-Uq Mono'])
for j, index in enumerate(indexes):
s = pd.Series([1,2,3,4,5], index=index)
for i, command in enumerate(commands):
results.iloc[i,j] = do(command)
base = [
[com, com] for com in multi_comments
]
comments = [
com for sublist in base for com in sublist
]
results['comment'] = comments
results.style\
.applymap(lambda v: 'background-color:red;', subset=ix[["s.loc['a', ['!']]", "s.loc['a', ['y']]", "s.loc[['!'], ['!']]", "s.loc[['a'], ['y']]"], :])\
.applymap(lambda v: 'background-color:LemonChiffon;', subset=ix[["s.loc['a', 'x':'y']"], :])
```
Since the introduction of
KeyError
for missing keys in anindex
there have been quite a few use cases from different issues. I will try and link some of the issues if I see them.My view is that
KeyErrors
forIndex
is fine, butMultiIndexes
should be treated differently: you cannot always raise aKeyError
for a single keys in aMultiIndex
slice since aMultiIndex
cannot always be reindexed.Index
Code generator
``` ret = None def do(command): try: exec(f'global ret; ret={command}', globals()) except KeyError: return 'KeyError' else: if isinstance(ret, (np.int64)): return 'int64' elif isinstance(ret, (pd.Series)): return 'Series' elif isinstance(ret, (pd.DataFrame)): return 'DataFrame' return 'OtherType' cases = [ "'a'", # single valid key "'!'", # single invalid key "['a']", # single valid key as pseudo multiple valid keys "['!']", # single invalid key as pseudo multip valid keys "['a','e']", # multiple valid keys "['a','!']", # at least one invalid keys "'a':'e'", # valid key slice "'a':'!'", # at least one invalid slice key "'!':", # at least one invalid slice key "'b'", # single valid non-unique key "['b']", # single valid non-unique key as pseudo multiple keys "'b':'d'", # slice with non-unique key ] base = [ [f's.loc[{case}]', # use regular s.loc[] f's.loc[ix[{case}]]'] # and with index slice as comparison s.loc[ix[{}]] for case in cases ] commands = [ command for sublist in base for command in sublist ] indexes = [ pd.Index(['a','b','c','e','d'], name='Unique Non-Monotonic'), pd.Index(['a','b','c','e','d'], name='Unique Monotonic').sort_values(), pd.Index(['a','b','b','e','d'], name='Non-Unique Non-Monotonic'), pd.Index(['a','b','b','e','d'], name='Non-Unique Monotonic').sort_values(), ] results = pd.DataFrame('', index=commands, columns=['Uq Non-Mono', 'Uq Mono', 'Non-Uq Non-Mono', 'Non-Uq Mono']) for j, index in enumerate(indexes): s = pd.Series([1,2,3,4,5], index=index) for i, command in enumerate(commands): results.iloc[i, j] = do(command) ```This seems to be pretty consistent. The only inconsistency is perhaps highlighted in red, and a minor niggle for dynamic coding might be the different return types in the case of non-unique indexes.
Obviously the solution to dealing with any case where you need to index by pre-defined levels that may have been filtered is to
reindex
with your pre-defined keys. Any this is quite easy to do in RAM.MultiIndex
MultiIndexing is different. You cannot always reindex for one of two reasons:
For example, consider the MultiIndex levels: (a,b), (x,y,z). There are a maximum of 6 index tuples but practically one will work with indexes of much less than the maximum combinations (Since the combinations scale exponentially with the number of levels). Your MultiIndex is thus [(a,x), (a,z), (b,x), (b,y)].
I think you need to be able to index MultiIndexes with keys that are missing. As a rule I would suggest that slices which are an iterable do not yield KeyErrors. Here is a summary of some of the observances below for current behaviour:
Code generator
``` cases_level0 = [ "'a'", # single valid key on level0 "'!'", # single invalid key on level0 "['a']", # single valid key on level0 as pseudo multiple valid keys "['!']", # single invalid key on level0 as pseudo multiple valid keys "['a', 'b']", # multiple valid key on level0 "['a', '!']", # at least one invalid key on level0 "'a':'b'", # valid level0 index slice "'a':'!'", # invalid level0 index slice "'!':", # fully invalid level0 index slice ] comments_level0 = [ '0: valid single, ', '0: invalid single, ', '0: valid single as multiple, ', '0: invalid single as multiple, ', '0: multiple valid, ', '0: one invalid in multiple, ', '0: valid slice, ', '0: semi-invalid slice, ', '0: invalid slice, ', ] base = [ [f's.loc[{case}]', f's.loc[ix[{case}, :]]'] for case in cases_level0 ] commands = [ command for sublist in base for command in sublist ] indexes = [ pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'z'), ('a','x'), ('a', 'z')]), pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'z'), ('a','x'), ('a', 'z')]).sortlevel()[0], pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'x'), ('a','x'), ('a', 'z')]), pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'x'), ('a','x'), ('a', 'z')]).sortlevel()[0], ] results = pd.DataFrame('', index=commands, columns=['Uq Non-Mono', 'Uq Mono', 'Non-Uq Non-Mono', 'Non-Uq Mono']) for j, index in enumerate(indexes): s = pd.Series([1,2,3,4,5], index=index) for i, command in enumerate(commands): results.iloc[i,j] = do(command) base = [ [com, com] for com in comments_level0 ] comments = [ com for sublist in base for com in sublist ] results['Comment'] = comments results.style cases_level1 = [ "'x'", # single valid key on level1 "'y'", # single sometimes-valid key on level1 "'!'", # single invalid key on level1 "['x']", # single valid key on level1 as pseudo multiple valid keys "['y']", # single sometimes-valid key on level1 as pseudo multiple valid keys "['!']", # single invalid key on level1 as pseudo multiple valid keys "['x', 'y']", # multiple sometimes-valid key on level1 "['x', '!']", # at least one invalid key on level1 "'x':'y'", # sometimes-valid level0 index slice "'x':'!'", # invalid level1 index slice "'!':", # fully invalid level1 index slice ] comments_level1 = [ '1: valid single', '1: semi-valid single', '1: invalid single', '1: valid single as multiple', '1: semi-valid single as multiple', '1: invalid single as multiple', '1: multiple semi-valid', '1: one invalid in multiple', '1: semi-valid slice', '1: semi-invalid slice', '1: invalid slice', ] from itertools import product multi_cases = list(product(cases_level0, cases_level1)) multi_comments = list(product(comments_level0, comments_level1)) base = [ [f's.loc[{case[0]}, {case[1]}]', f's.loc[ix[{case[0]}, {case[1]}]]'] for case in multi_cases ] commands = [ command for sublist in base for command in sublist ] results = pd.DataFrame('', index=commands, columns=['Uq Non-Mono', 'Uq Mono', 'Non-Uq Non-Mono', 'Non-Uq Mono']) for j, index in enumerate(indexes): s = pd.Series([1,2,3,4,5], index=index) for i, command in enumerate(commands): results.iloc[i,j] = do(command) base = [ [com, com] for com in multi_comments ] comments = [ com for sublist in base for com in sublist ] results['comment'] = comments results.style\ .applymap(lambda v: 'background-color:red;', subset=ix[["s.loc['a', ['!']]", "s.loc['a', ['y']]", "s.loc[['!'], ['!']]", "s.loc[['a'], ['y']]"], :])\ .applymap(lambda v: 'background-color:LemonChiffon;', subset=ix[["s.loc['a', 'x':'y']"], :]) ```