pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.59k stars 17.9k forks source link

Pandas indexing bug raises TypeError when slicing with categorical IntervalIndex #21068

Open antipisa opened 6 years ago

antipisa commented 6 years ago

Pandas indexing should not rely on subnormal floats behavior inside categorical data. Please bit cast your floats to integers when computing categorical labels: https://github.com/pandas-dev/pandas/blob/648ca95af696266b18ded6bfc5327d0666e3ad23/pandas/core/indexes/interval.py#L56

The following is an example of integer slicing with floating point interval endpoints that should return the first slice of the table:

import pandas as pd
import numpy as np

t = pd.DataFrame(dict(sym=np.arange(2), y=1., z=-1.))
t.loc[:, 'x'] = pd.Series([pd.Interval(-1., 0.0, closed='right'), pd.Interval(0.0, 1, closed='right')])
t.set_index('x', inplace=True)
t.index = pd.Categorical(t.index)
t.loc[t.index.categories[0], :]

Out:
sym    0.0
y      1.0
z     -1.0
Name: (-1.0, 0.0], dtype: float64

However, this fails:

import daz
daz.set_ftz()
daz.set_daz()

t = pd.DataFrame(dict(sym=np.arange(2), y=1., z=-1.))
t.loc[:, 'x'] = pd.Series([pd.Interval(-1., 0.0, closed='right'), pd.Interval(0.0, 1, closed='right')])
t.set_index('x', inplace=True)
t.index = pd.Categorical(t.index)
t.loc[t.index.categories[0], :]
TypeError                                 Traceback (most recent call last)
<ipython-input-3-3a8fe3a302cf> in <module>()
      8 t.set_index('x', inplace=True)
      9 t.index = pd.Categorical(t.index)
---> 10 t.loc[t.index.categories[0], :]

/Users/bohun/anaconda2/lib/python2.7/site-packages/pandas/core/indexing.pyc in __getitem__(self, key)
   1365             except (KeyError, IndexError):
   1366                 pass
-> 1367             return self._getitem_tuple(key)
   1368         else:
   1369             # we by definition only have the 0th axis

/Users/bohun/anaconda2/lib/python2.7/site-packages/pandas/core/indexing.pyc in _getitem_tuple(self, tup)
    856     def _getitem_tuple(self, tup):
    857         try:
--> 858             return self._getitem_lowerdim(tup)
    859         except IndexingError:
    860             pass

/Users/bohun/anaconda2/lib/python2.7/site-packages/pandas/core/indexing.pyc in _getitem_lowerdim(self, tup)
    989         for i, key in enumerate(tup):
    990             if is_label_like(key) or isinstance(key, tuple):
--> 991                 section = self._getitem_axis(key, axis=i)
    992 
    993                 # we have yielded a scalar ?

/Users/bohun/anaconda2/lib/python2.7/site-packages/pandas/core/indexing.pyc in _getitem_axis(self, key, axis)
   1625         # fall thru to straight lookup
   1626         self._has_valid_type(key, axis)
-> 1627         return self._get_label(key, axis=axis)
   1628 
   1629 

/Users/bohun/anaconda2/lib/python2.7/site-packages/pandas/core/indexing.pyc in _get_label(self, label, axis)
    143             raise IndexingError('no slices here, handle elsewhere')
    144 
--> 145         return self.obj._xs(label, axis=axis)
    146 
    147     def _get_loc(self, key, axis=None):

/Users/bohun/anaconda2/lib/python2.7/site-packages/pandas/core/generic.pyc in xs(self, key, axis, level, drop_level)
   2342                                                       drop_level=drop_level)
   2343         else:
-> 2344             loc = self.index.get_loc(key)
   2345 
   2346             if isinstance(loc, np.ndarray):

/Users/bohun/anaconda2/lib/python2.7/site-packages/pandas/core/indexes/category.pyc in get_loc(self, key, method)
    410         if (codes == -1):
    411             raise KeyError(key)
--> 412         return self._engine.get_loc(codes)
    413 
    414     def get_value(self, series, key):

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

TypeError: 'slice(0, 2, None)' is an invalid key

since the default behavior for floating endpoints forces the interval index to be cast into an integer slice. This is not ideal.

chris-b1 commented 6 years ago

Your second example is identical to the first, can you check it?

antipisa commented 6 years ago

It is identical except for the first two lines I added.

import daz
daz.set_ftz()
daz.set_daz()

Setting denormals as zero causes pandas categorical indexing to break.

chris-b1 commented 6 years ago

I'm not especially familiar with denormals, or why treating them as zero is desirable and something we should support, can you fill out a bit more of what exactly is going on? Additionally / alternatively a PR is welcome if you know what is needed.

antipisa commented 6 years ago

Setting the denormals are zero and flush to zero flags will convert subnormal numbers to zero. Because pandas categorical indexing is relying on the behavior of subnormal floats, it causes t.loc[t.index.categories[0], :] to break since it cannot locate the first float interval. Categorical labels should not behave this way--you should bit cast your floats to integers if the index is a categorical interval. It would also improve performance of interval slicing. See #https://github.com/numpy/numpy/issues/4581

chris-b1 commented 6 years ago

I'm currently on a windows machine which evidently dax won't install on, but am I understanding correctly that with the flags set, the issue is that np.nextafter(0., np.inf) will return 0?

antipisa commented 6 years ago

*daz not dax. Yes that is correct. It treats subnormals as zero.