pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.83k stars 17.99k forks source link

DataFrame MultiIndex column access (and pop) #4145

Closed hayd closed 11 years ago

hayd commented 11 years ago

Suppose I want to acces a column in df2 (perhaps there is a near way, but I also expect these to work):

In [11]: df
Out[11]:
  h1 main  h3 sub  h5
0  a    A   1  A1   1
1  b    B   2  B1   2
2  c    B   3  A1   3
3  d    A   4  B2   4
4  e    A   5  B2   5
5  f    B   6  A2   6

In [12]: df2 = df.set_index(['main', 'sub']).T.sort_index(1)

In [13]: df2
Out[13]:
main  A        B
sub  A1 B2 B2 A1 A2 B1
h1    a  d  e  c  f  b
h3    1  4  5  3  6  2
h5    1  4  5  3  6  2

I want to access the column ('A', 'A1'):

In [14]: df2.iloc[:, 0]  # cheating with iloc
In [15]: df2.T.loc[('A', 'A1'), :].iloc[0]  # hacky!
In [16]: df2.iloc[:, df2.columns.get_loc(('A', 'A1')).start]  # very hacky!

In [17]: df2[df2.columns[:1]]  # returns DataFrame

I had assumed/hoped this would work:

In [18]: df2[('A', 'A1')]
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-18-c307c7bb3bb8> in <module>()
----> 1 df2[('A', 'A1')]

/Users/234BroadWalk/pandas/pandas/core/frame.pyc in __getitem__(self, key)
   1997             return self._getitem_frame(key)
   1998         elif isinstance(self.columns, MultiIndex):
-> 1999             return self._getitem_multilevel(key)
   2000         else:
   2001             # get column

/Users/234BroadWalk/pandas/pandas/core/frame.pyc in _getitem_multilevel(self, key)
   2036         if isinstance(loc, (slice, np.ndarray)):
   2037             new_columns = self.columns[loc]
-> 2038             result_columns = _maybe_droplevels(new_columns, key)
   2039             if self._is_mixed_type:
   2040                 result = self.reindex(columns=new_columns)

/Users/234BroadWalk/pandas/pandas/core/indexing.pyc in _maybe_droplevels(index, key)
   1103     if isinstance(key, tuple):
   1104         for _ in key:
-> 1105             index = index.droplevel(0)
   1106     else:
   1107         index = index.droplevel(0)

AttributeError: 'Index' object has no attribute 'droplevel'

In [19]: df2[['A', 'A1']]  # interestingly, slightly different error here
KeyError: "['A1'] not in index"

Also this way is buggy (loses the index)... which is weird, separated this part of the issue as #4146:

In [21]: df2['A']['A1']  # in master but not in 0.11.0
Out[21]:
   0
0  a
1  1
2  1

pop uses this in it's implementation, so atm it's not possible to pop a MultiIndex.

jtratner commented 11 years ago

I don't think df2[['A', 'A1']] should work with this, right? Because that's selecting for two individual columns 'A' and 'A1'. The tuple version ought to. Have you tried bisecting through earlier commits to see where this behavior started occurring?

hayd commented 11 years ago

@jreback No that shouldn't work (just was surprised it was doing something different to the tuple - as I thought it meant the tuple was being captured somehow, presumably as part of a MultiIndex...).

I haven't tried bisecting (ever) will have a go now.

jtratner commented 11 years ago

@hayd Tuple is treated differently than list - tuples are considered a single element for the purposes of indexing, whereas a list is the (only?) way to have your input treated as a list of elements/lookups. So if you want to index into a MultiIndex, you ultimately have to use a tuple to get to it.

jreback commented 11 years ago

this is all pretty simple, PR coming shortly...basically an oversight

jtratner commented 11 years ago

@jreback wow...that's pretty amazing that you can tell what's wrong so quickly.

jreback commented 11 years ago

nah...it was obvious once the test case hit it that the code was returning an incomplete answer (just the resulting values and not a full block manager); I had written the code so knew where it was (but I guess never explictiy made a test to validate that case)....its kind of deep...have to have a non-unique column index, and selecting a unique value from it (the original case was to handle selecting the non-unique values)

jreback commented 11 years ago

see pr #4148, df2[['A','A1']] raising is correct, and now df2[[('A','A1')]] works as well