pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.78k stars 17.97k forks source link

weird behaviours with tuples in index #37190

Open johny-b opened 4 years ago

johny-b commented 4 years ago

I'm not submitting this strictly as a bug because this is so messed up that I seriously consider tuples in indexes just simply never work, but I don't think the docs are clear about it.

I'm using python 3.6.9 and pandas==1.1.3.

Example 1:

import pandas as pd                                                             
data = {                                                                        
    'first':  [[(0,), 0, 0], [(1,), 1, 1], [(2,), 2, 2]],                       
    'second': [[(0,), 0, 0], [(1,), 1, 1], [(1,), 2, 2]],                       
}                                                                               
for name, d in data.items():                                                    
    df = pd.DataFrame(d, columns=['a', 'b', 'c']).set_index('a')                
    print(name)                                                                 
    print(df)                                                                   
    print(df.index.get_loc((0,))) 

works for first and not for second:

first
      b  c
a         
(0,)  0  0
(1,)  1  1
(2,)  2  2
0
second
      b  c
a         
(0,)  0  0
(1,)  1  1
(1,)  2  2
Traceback (most recent call last):
  File "pandas/_libs/index.pyx", line 112, in pandas._libs.index.IndexEngine._get_loc_duplicates
TypeError: '<' not supported between instances of 'tuple' and 'int'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py", line 2895, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 96, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 116, in pandas._libs.index.IndexEngine._get_loc_duplicates
KeyError: (0,)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "t3.py", line 10, in <module>
    print(df.index.get_loc((0,)))
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py", line 2897, in get_loc
    raise KeyError(key) from err
KeyError: (0,)

Example 2:

import pandas as pd                                                             
data = [[0, (1,2), (3,)], [1, (2,3), (4,5)]]                                    
a = pd.DataFrame(data, columns=['a', 'b', 'c']).set_index(['b', 'c'])           

print(a)                                                                        

print(a.loc[[a.index.values[0]]])                                               
print(a.loc[[a.index.values[1]]])

works for the first row only:

               a
b      c        
(1, 2) (3,)    0
(2, 3) (4, 5)  1
             a
b      c      
(1, 2) (3,)  0
Traceback (most recent call last):
  File "t2.py", line 8, in <module>
    print(a.loc[[a.index.values[1]]])
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py", line 879, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py", line 1099, in _getitem_axis
    return self._getitem_iterable(key, axis=axis)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py", line 1037, in _getitem_iterable
    keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py", line 1240, in _get_listlike_indexer
    indexer, keyarr = ax._convert_listlike_indexer(key)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/multi.py", line 2388, in _convert_listlike_indexer
    _, indexer = self.reindex(keyarr, level=level)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/multi.py", line 2302, in reindex
    target = ensure_index(target)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py", line 5618, in ensure_index
    return Index(index_like)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py", line 385, in __new__
    return Int64Index(data, copy=copy, dtype=dtype, name=name)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/numeric.py", line 76, in __new__
    raise ValueError("Index data must be 1-dimensional")
ValueError: Index data must be 1-dimensional
briangalindoherbert commented 4 years ago

a KeyError I encountered tonight is very similar, but as i write this I'm thinking I'll try first to fix it with some exception handling, or perhaps my design for aggregating this historical data just needs a rethink :). But maybe this post can provide helpful details for future improvements.

Context: I am analyzing covid data by U.S. county, keyed on 5 digit FIPS code and a date timestamp. I was using a current list of counties and cases and fatalities, then doing .loc searches of a historical dataset (which also has a multi-index on fips code and date timestamp) in order to add columns for cases and fatalities in the same area one month ago, two months ago, etc.

Python throws a Key Error where there are a few missing county records in my historical dataset. Here is my code snippet, followed by print statements I added for debugging, and finally the error output:

fipslist = list(dfstats.fips.unique()) asof = df.date.max() for x in iter(fipslist): prior_dt: dt.date = asof - dt.timedelta(days=30) prior_row = df.loc[(str(x), priormth)] # df = historical pd.df keyed on fips and date dfstats.at[dfstats['fips']==x, 'cases_30'] = prior_row['cases'] dfstats.at[dfstats['fips']==x, 'deaths_30'] = prior_row['deaths']

[print statements I added to my code: it was running fine until it tried to do a .loc for a county fips code in Alaska which did not have an entry:] [processing correctly for this record:] Name: (02198, 2020-09-19 00:00:00), Length: 7, dtype: object value priormth =2020-09-19 00:00:00 value fips =02220 PRIOR_ROW =fips 02220 date 2020-09-19 00:00:00 county Sitka City and Borough state Alaska cases 55 deaths 0 pop NaN [it choked- could not locate the tuple for multi-index fips, date: ( '02230', '2020-09-19' ) ] Name: (02220, 2020-09-19 00:00:00), Length: 7, dtype: object value priormth =2020-09-19 00:00:00 value fips =02230 Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3343, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 4, in prior_row = df_nyt.loc[(str(x), priormth)] File "/usr/local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1762, in getitem return self._getitem_tuple(key) File "/usr/local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1272, in _getitem_tuple return self._getitem_lowerdim(tup) File "/usr/local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1421, in _getitem_lowerdim return getattr(section, self.name)[new_key] File "/usr/local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1762, in getitem return self._getitem_tuple(key) File "/usr/local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1272, in _getitem_tuple return self._getitem_lowerdim(tup) File "/usr/local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1389, in _getitem_lowerdim section = self._getitem_axis(key, axis=i) File "/usr/local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1965, in _getitem_axis return self._get_label(key, axis=axis) File "/usr/local/lib/python3.8/site-packages/pandas/core/indexing.py", line 625, in _get_label return self.obj._xs(label, axis=axis) File "/usr/local/lib/python3.8/site-packages/pandas/core/generic.py", line 3529, in xs return self[key] File "/usr/local/lib/python3.8/site-packages/pandas/core/frame.py", line 2800, in getitem indexer = self.columns.get_loc(key) File "/usr/local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2648, in get_loc return self._engine.get_loc(self._maybe_cast_indexer(key)) File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: Timestamp('2020-09-19 00:00:00')