pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.43k stars 17.85k forks source link

BUG: at with non-unique multi-index buggy #7965

Closed ojdo closed 10 years ago

ojdo commented 10 years ago

I fear that I somehow created a DataFrame with a numeric MultiIndex that triggers (un?)intended behaviour in the .loc function. I failed to create a reproducible example with anything but a pickled dump of a DataFrame.

Question Is this a bug, or a user error? I'm confused...

Steps to reproduce

  1. Get http://ojdo.de/tmp/df.pickle (17 kB)
  2. Execute
import pandas as pd
df = pd.read_pickle('df.pickle')
df.loc[(1,199), 'Elec']

Resulting output

Vertex1  Vertex2
1        199        7.602552
Name: Elec, dtype: float64

Expected output

The value on its own: 7.602552

It get's weirder

There must be something between row 80 and 90 in this DataFrame, because the following snippet yields a single value, while returing a Series if executed with .head(90).

In [20]: df2 = df.head(80)

In [21]: df2.loc[(1,199), 'Elec']
Out[21]: 7.6025524199999994

Installed versions

commit: None python: 2.7.0.final.0 python-bits: 32 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 26 Stepping 5, GenuineIntel byteorder: little LC_ALL: None LANG: None

pandas: 0.14.1 nose: 1.3.3 Cython: None numpy: 1.8.1 scipy: 0.12.0 statsmodels: None IPython: 0.13.2 sphinx: None patsy: None scikits.timeseries: None dateutil: 1.5-mpl pytz: 2012d bottleneck: None tables: 3.1.1 numexpr: 2.4 matplotlib: 1.2.1 openpyxl: 2.0.2 xlrd: 0.9.2 xlwt: 0.7.5 xlsxwriter: 0.5.5 lxml: 3.3.5 bs4: 4.3.2 html5lib: 1.0b3 httplib2: None apiclient: None rpy2: None sqlalchemy: None pymysql: None psycopg2: None

jreback commented 10 years ago

this has to do with some pretty low-level code that find whether the label you suppy is an indexer. it then returns a scalar or a slice depending.

For some reason in the first example it is returning a slice, but the second a scalar. Not really sure why. I'd like to reproduce this, do you have the code to generate before the pickle? I know its weird but not sure if this is an impl issue, a bug, or just not guaranteed.

you can guarantee the result by doing this:

df.loc[[(1,199)],'Elec'] which will always return a Frame

jreback commented 10 years ago

actually this is a 'user' issue.

df.index.is_unique is False df2.index.is_unique is True

that said still might be a bug

jreback commented 10 years ago

So this reproduces.

The question is, if you select from a multi-index that ONLY selects a unique value even though the index is non-unique, should it be treated like selecting from a unique multi-index. (Currently this is NOT true in general; if you have a DataFrame with non-unique columns selecting a single-column gets you back a DataFrame, and not a Series)

In [3]: df = DataFrame(dict(value = [0,1,2]),index=MultiIndex.from_tuples([(1,1),(1,2),(1,2)]))

In [4]: df2 = DataFrame(dict(value = [0,1,2]),index=MultiIndex.from_tuples([(1,1),(1,2),(1,3)]))

In [5]: df
Out[5]: 
     value
1 1      0
  2      1
  2      2

In [6]: df2
Out[6]: 
     value
1 1      0
  2      1
  3      2

In [7]: df.loc[(1,1),'value']
Out[7]: 
1  1    0
Name: value, dtype: int64

In [8]: df2.loc[(1,1),'value']
Out[8]: 0

In [9]: df.loc[(1,2),'value']
Out[9]: 
1  2    1
   2    2
Name: value, dtype: int64
ojdo commented 10 years ago

Ouch, thank you for spotting this. I think, in that case the .loc function is not to blame, but actually helpful by ensuring consistent return types throughout a DataFrame.

What I would consider a bug, though, is that for your reproducing example, df.at fails to access row (1,1), even though this row is unique in both DataFrames:

In [10]: df.at[(1,1),'value']
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-15c883613e3d> in <module>()
----> 1 df.at[(1,1),'value']

C:\Python27\lib\site-packages\pandas\core\indexing.pyc in __getitem__(self, key)
   1264
   1265         key = self._convert_key(key)
-> 1266         return self.obj.get_value(*key)
   1267
   1268     def __setitem__(self, key, value):

C:\Python27\lib\site-packages\pandas\core\frame.pyc in get_value(self, index, col)
   1526         series = self._get_item_cache(col)
   1527         engine = self.index._engine
-> 1528         return engine.get_value(series.values, index)
   1529
   1530     def set_value(self, index, col, value):

C:\Python27\lib\site-packages\pandas\index.pyd in pandas.index.IndexEngine.get_value (pandas\index.c:2957)()

C:\Python27\lib\site-packages\pandas\index.pyd in pandas.index.IndexEngine.get_value (pandas\index.c:2772)()

C:\Python27\lib\site-packages\pandas\index.pyd in pandas.index.IndexEngine.get_loc (pandas\index.c:3451)()

C:\Python27\lib\site-packages\pandas\index.pyd in pandas.index.IndexEngine._get_loc_duplicates (pandas\index.c:3747)()

TypeError: only integer arrays with one element can be converted to an index

Maybe the error message could hint something along df.index.is_unique == True is required for successful single-element access?

jreback commented 10 years ago

yes that last is prob a bug

ojdo commented 10 years ago

Just to be sure: The bug seems to be "hidden" in a low-level part that I cannot reach with the Python debugger, right? When trigger post-mortem %debug in IPython, I cannot step into engine.get_value(series.values, index) to find out what's going on down there. (Bonus question: would 'debugging c extensions' be the right keywords to look for how to debug these parts as well?)

Other than that, would it help if I prepare the reproducing example as a new test case pull request?

jreback commented 10 years ago

its tricky to debug the cython, I generally just insert print statements as needed.

But that's not really the issue, it was calling a different routine depending on if its unique or not. This is correct (I mean you can argue that the non-unique case that only returns a single value needs special treatmenet when the index is non-unique), but that's a different (and API issue).

you can certainly do a pull-request to fix the .at issue (which is what this issue now represents). Put in the test cases, see where it fails and fix.

would be great. the indexing code has a lot of paths, but debugging is actually straightforward once you do it a few times.

theandygross commented 10 years ago

I think I may have a similar issue... not sure if its the same bug or a different one, let me know if it belongs in a different issue. T

df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
df.columns = pd.MultiIndex.from_tuples([(0,1),(1,1),(2,1)])
df.groupby(axis=1, level=[0,1]).first()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-5240a9c3bdf4> in <module>()
      1 df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
      2 df.index = pd.MultiIndex.from_tuples([(0,1),(1,1),(2,1)])
----> 3 df.T.groupby(axis=1, level=[0,1]).first()

/cellar/users/agross/anaconda2/lib/python2.7/site-packages/pandas-0.14.1.dev-py2.7-linux-x86_64.egg/pandas/core/groupby.pyc in f(self)
    109             raise SpecificationError(str(e))
    110         except Exception:
--> 111             result = self.aggregate(lambda x: npfunc(x, axis=self.axis))
    112             if _convert:
    113                 result = result.convert_objects()

/cellar/users/agross/anaconda2/lib/python2.7/site-packages/pandas-0.14.1.dev-py2.7-linux-x86_64.egg/pandas/core/groupby.pyc in aggregate(self, arg, *args, **kwargs)
   2528 
   2529             if self.grouper.nkeys > 1:
-> 2530                 return self._python_agg_general(arg, *args, **kwargs)
   2531             else:
   2532 

/cellar/users/agross/anaconda2/lib/python2.7/site-packages/pandas-0.14.1.dev-py2.7-linux-x86_64.egg/pandas/core/groupby.pyc in _python_agg_general(self, func, *args, **kwargs)
   1081                 output[name] = self._try_cast(values[mask], result)
   1082 
-> 1083         return self._wrap_aggregated_output(output)
   1084 
   1085     def _wrap_applied_output(self, *args, **kwargs):

/cellar/users/agross/anaconda2/lib/python2.7/site-packages/pandas-0.14.1.dev-py2.7-linux-x86_64.egg/pandas/core/groupby.pyc in _wrap_aggregated_output(self, output, names)
   3087             result = result.T
   3088 
-> 3089         return self._reindex_output(result).convert_objects()
   3090 
   3091     def _wrap_agged_blocks(self, items, blocks):

/cellar/users/agross/anaconda2/lib/python2.7/site-packages/pandas-0.14.1.dev-py2.7-linux-x86_64.egg/pandas/core/groupby.pyc in _reindex_output(self, result)
   3129         levels_list = [ ping._group_index for ping in groupings ]
   3130         index = MultiIndex.from_product(levels_list, names=self.grouper.names)
-> 3131         return result.reindex(**{ self.obj._get_axis_name(self.axis) : index, 'copy' : False }).sortlevel()
   3132 
   3133     def _iterate_column_groupbys(self):

/cellar/users/agross/anaconda2/lib/python2.7/site-packages/pandas-0.14.1.dev-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in sortlevel(self, level, axis, ascending, inplace, sort_remaining)
   2811         the_axis = self._get_axis(axis)
   2812         if not isinstance(the_axis, MultiIndex):
-> 2813             raise TypeError('can only sort by level with a hierarchical index')
   2814 
   2815         new_axis, indexer = the_axis.sortlevel(level, ascending=ascending,

TypeError: can only sort by level with a hierarchical index
jreback commented 10 years ago

works fine in 0.14.1

In [38]: df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])

In [39]: df.columns = pd.MultiIndex.from_tuples([(0,1),(1,1),(2,1)])

In [40]: df.groupby(axis=1, level=[0,1]).first()
Out[40]: 
   0  1  2
   1  1  1
0  1  2  3
1  4  5  6
2  7  8  9
theandygross commented 10 years ago

I was on master... switched to 0.14.1 and it works for me as well. Must be a recent thing.

jreback commented 10 years ago

hmm, that IS broken in master, weird. can you open a separate issue for that. thanks!

jreback commented 10 years ago

bug posted here: https://github.com/pydata/pandas/issues/7997

jreback commented 10 years ago

ok the primary is a usage question, bug reported in #7997