Closed wesm closed 11 years ago
Same idea came up on this mailing list thread.
Yes, I would voice support for a general index that keeps the original index dtype. I used a float as index (it was time in seconds) and was delighted when everything, including df.plot()
worked swimmingly. But then I wasted 30min figuring out why pylab.exp(df.index.values)
was failing with the mysterious AttributeError: exp
. pandas and Python normally make things so pleasant, but unexpected behavior like this reminds me of my dark days debugging c :(
@kghose: I agree. I also use indices to store things like the time in seconds (e.g. oscilloscope traces), and am constantly having to do array(df.index.values, dtype=float)
in place of a simple df.index.values
to get an array that I can use with scipy fitting functions. It's an awkward idiom.
+1 here. Matplotlib issue has tripped me up a number of times when I needed to make custom plots.
see http://pandas.pydata.org/pandas-docs/dev/indexing.html#fallback-indexing, it is rarely necessary to actually use a float index; you are often better off served by using a column. The point of the index is to make individual elements faster, e.g. df[1.0]
, but this is quite tricky; this is the reason for having an issue about this.
Yes, its true because whether two floats are the same depends on precision, but its nice to be able to have that as a time index.
In my cases I don't really care about being able to select via get item ish style indexing I usually want to loop over the index series pairs or I have them in frame that I want to show as an image with the index in the columns. The object dtype makes matplotlib show the index to full precision which is really annoying since I then have to go in and format the tick labels by hand. I wholeheartedly agree that float indexes are to be avoided but sometimes they make sense. My cases are mostly plotting issues which only matters when I can't use pandas plotting abilities which thankfully isn't that often.
@kghose consider using a datetime64[ns] index (if you are dealing with time), or as I said, use it as a column; you can do nearly everything you need (with an ocassional set_index/reset_index
). what you are you trying to do? as @cpcloud indicates, the only real issue with not having a FloatIndex
typed as float
is for plotting (w/o manual conversion)
General index dtype retention is probably not worth the amount of complexity and code that it would require to do it right. Datetime indexes are your friend. @jreback what about attempting coercion of object indexes when accessing the values attribute?
Something like this is pretty easy (@cpcloud, can't change the way values
works or everything breaks)
Is this useful?
In [1]: idx = pd.Index(np.arange(10).astype('float64'))
In [3]: idx
Out[3]: Index([0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0], dtype=object)
In [4]: idx.inferred_values
Out[4]: array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
Of course for datetimes you get this
In [1]: idx = date_range('20130101',periods=5)
In [2]: idx
Out[2]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00, ..., 2013-01-05 00:00:00]
Length: 5, Freq: D, Timezone: None
In [3]: idx.inferred_values
Out[3]:
array([1356998400000000000, 1357084800000000000, 1357171200000000000,
1357257600000000000, 1357344000000000000])
Well it's consistent... But it looks like it would only be useful in the float case... What would strings return?
Shouldn't dates return array of date time?
I could return anything...(e.g. a datetime64[ns]) numpy array for example, is easy enough, strings will return the same (an object array)..
numpy 1.7 (this is the same as .values though)
In [5]: x = date_range('20130101',periods=5)
In [6]: x.inferred_values
Out[6]:
array(['2012-12-31T19:00:00.000000000-0500',
'2013-01-01T19:00:00.000000000-0500',
'2013-01-02T19:00:00.000000000-0500',
'2013-01-03T19:00:00.000000000-0500',
'2013-01-04T19:00:00.000000000-0500'], dtype='datetime64[ns]')
In [7]: x.inferred_values[0]
Out[7]: numpy.datetime64('2012-12-31T19:00:00.000000000-0500')
@cpcloud I think you are right, only float is dfferent...
i mean...i don't feel super strong about this since it seems like there are so few use cases for float indices. i do think that it should return the "highest level" dtype possible that can be represented by numpy, e.g., return dates as dates like u show if this is going to be done. again though, inferred_values
will be the same as values
in every case except float
and maybe you could return a 2D array for MultiIndex
...
I've been using float indices a lot, so I would love inferred_values
, or some property that gives you back an array cast to the same dtype as the one originally passed to the index=
keyword argument.
A time axis is not the only use case for a float index; sometimes I work with spectral data where the X axis is a floating point value representing frequency or wavelength.
Do this somewhere in your code (before you use it!) This is a monkey patch
import numpy as np
import pandas as pd
In [10]: def inferred_values(self):
....: if self.inferred_type == 'floating':
....: return np.asarray(self,dtype='float64')
....: return np.asarray(self)
....:
In [11]: pd.Index.inferred_values = property(inferred_values)
In [12]: idx.inferred_values
Out[12]: array([ 0.1, 1.1, 2.1, 3.1, 4.1])
In [13]: idx = Index(np.arange(5)+0.1)
In [14]: idx
Out[14]: Index([0.1, 1.1, 2.1, 3.1, 4.1], dtype=object)
A time axis is not the only use case for a float index; sometimes I work with spectral data where the X axis is a floating point value representing frequency or wavelength.
You're abs right here, I also use it for things other than time. It would be great if there was some way to integrate pandas with quantities
, but that's probably a long way away...
I (inadvertently) started a thread about this on the pystatsmodels list; thread link: https://groups.google.com/forum/#!topic/pystatsmodels/ua7WpNd-U8Q
My use case is also for time values (and DatetimeIndex is not useful for a variety of reasons, most notably that all I have a deltas against some unknown epoch defined as "whenever someone hit the record button".). My concern though isn't so much having a useful .values attribute (though I guess that might be nice too!), but for having a reliable way to do time-based indexing, mostly for ad hoc interactive use. The main features I'm looking for are:
plot(df.loc[:1000, "P4"])
. For this kind of usage, no-one cares whether there was a sample taken at exactly 1000 milliseconds or not. Currently .ix
does interpret floating point slices like this, but .loc
does not.@njsmith
I made a couple of minor changes to .loc to get the following behavior, which I believe is still consistent with label based, but does NOT fallback (the end-points of a slice are allowed to just not be in the index, which is slightly inconsistent, but they select on a label basis, so I think that is ok) . Pls review and lmk.
In [1]: s = Series(np.arange(5), index=np.arange(5) * 2.5)
In [2]: s
Out[2]:
0.0 0
2.5 1
5.0 2
7.5 3
10.0 4
dtype: int64
In [3]: # label based slicing
In [4]: s[1.0:3.0]
Out[4]:
2.5 1
dtype: int64
In [5]: s.ix[1.0:3.0]
Out[5]:
2.5 1
dtype: int64
In [6]: s.loc[1.0:3.0]
Out[6]:
2.5 1
dtype: int64
In [7]: # exact indexing when found
In [8]: s[5.0]
Out[8]: 2
In [9]: s.loc[5.0]
Out[9]: 2
In [10]: s.ix[5.0]
Out[10]: 2
In [11]: # non-fallback location based should raise this error (__getitem__,ix fallback here)
In [12]: s.loc[4.0]
KeyError: 'the label [4.0] is not in the [index]'
In [13]: s[4.0] == s[4]
Out[13]: True
In [14]: s[4] == s[4]
Out[14]: True
# confusing slicing patterns in __getitem__/ix, loc is clear
In [15]: s.loc[2.0:5.0]
Out[15]:
2.5 1
5.0 2
dtype: int64
In [16]: s.loc[2.0:5]
Out[16]:
2.5 1
5.0 2
dtype: int64
In [17]: s.loc[2.1:5]
Out[17]:
2.5 1
5.0 2
dtype: int64
In [18]: # these are what __getitem__/ix does
In [19]: s.ix[2.0:5.0]
Out[19]:
2.5 1
5.0 2
dtype: int64
In [20]: s.ix[2.0:5]
Out[20]:
5.0 2
7.5 3
10.0 4
dtype: int64
In [21]: s.ix[2.1:5]
Out[21]:
2.5 1
5.0 2
dtype: int64
In [22]: s[2.0:5.0]
Out[22]:
2.5 1
5.0 2
dtype: int64
In [23]: s[2.0:5]
Out[23]:
5.0 2
7.5 3
10.0 4
dtype: int64
In [24]: s[2.1:5]
Out[24]:
2.5 1
5.0 2
dtype: int64
cc @dragoljub, cc @nehalecky
you guys have had interests in indexing in the past....not sure if you have any comments wrt this
BTW, I'd be in favor of a Float64Index
class that specifically implemented the limited subset of stuff that makes sense for floating point indices and not the stuff that didn't. So e.g. trying to groupby would just be an error, and I can even see the argument for making scalar indexing an error. This would be much safer than the current situation, making @jreback happy :-). But, stuff like slicing and .values
could still act the way people want, making users happy too.
For much of the data I work with I have been OK with using Object/Int64 index types, however I do also keep a copy of my indexers as data columns to enable easier plotting/slicing for some cases.
IMO, anything that enables a smoother interface to MatplotLib, Galry or Scikit-Learn I'm :+1:
Idea from conversation with @CRP in #235