Add Float64Index class?

wesm commented 13 years ago

Idea from conversation with @CRP in #235

lodagro commented 12 years ago

Same idea came up on this mailing list thread.

kghose commented 11 years ago

Yes, I would voice support for a general index that keeps the original index dtype. I used a float as index (it was time in seconds) and was delighted when everything, including df.plot() worked swimmingly. But then I wasted 30min figuring out why pylab.exp(df.index.values) was failing with the mysterious AttributeError: exp. pandas and Python normally make things so pleasant, but unexpected behavior like this reminds me of my dark days debugging c :(

baldwint commented 11 years ago

@kghose: I agree. I also use indices to store things like the time in seconds (e.g. oscilloscope traces), and am constantly having to do array(df.index.values, dtype=float) in place of a simple df.index.values to get an array that I can use with scipy fitting functions. It's an awkward idiom.

cpcloud commented 11 years ago

+1 here. Matplotlib issue has tripped me up a number of times when I needed to make custom plots.

jreback commented 11 years ago

see http://pandas.pydata.org/pandas-docs/dev/indexing.html#fallback-indexing, it is rarely necessary to actually use a float index; you are often better off served by using a column. The point of the index is to make individual elements faster, e.g. df[1.0], but this is quite tricky; this is the reason for having an issue about this.

kghose commented 11 years ago

Yes, its true because whether two floats are the same depends on precision, but its nice to be able to have that as a time index.

cpcloud commented 11 years ago

In my cases I don't really care about being able to select via get item ish style indexing I usually want to loop over the index series pairs or I have them in frame that I want to show as an image with the index in the columns. The object dtype makes matplotlib show the index to full precision which is really annoying since I then have to go in and format the tick labels by hand. I wholeheartedly agree that float indexes are to be avoided but sometimes they make sense. My cases are mostly plotting issues which only matters when I can't use pandas plotting abilities which thankfully isn't that often.

jreback commented 11 years ago

@kghose consider using a datetime64[ns] index (if you are dealing with time), or as I said, use it as a column; you can do nearly everything you need (with an ocassional set_index/reset_index). what you are you trying to do? as @cpcloud indicates, the only real issue with not having a FloatIndex typed as float is for plotting (w/o manual conversion)

cpcloud commented 11 years ago

General index dtype retention is probably not worth the amount of complexity and code that it would require to do it right. Datetime indexes are your friend. @jreback what about attempting coercion of object indexes when accessing the values attribute?

jreback commented 11 years ago

Something like this is pretty easy (@cpcloud, can't change the way values works or everything breaks)

Is this useful?

In [1]: idx = pd.Index(np.arange(10).astype('float64'))

In [3]: idx
Out[3]: Index([0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0], dtype=object)

In [4]: idx.inferred_values
Out[4]: array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.])

jreback commented 11 years ago

Of course for datetimes you get this

In [1]: idx = date_range('20130101',periods=5)

In [2]: idx
Out[2]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00, ..., 2013-01-05 00:00:00]
Length: 5, Freq: D, Timezone: None

In [3]: idx.inferred_values
Out[3]: 
array([1356998400000000000, 1357084800000000000, 1357171200000000000,
       1357257600000000000, 1357344000000000000])

cpcloud commented 11 years ago

Well it's consistent... But it looks like it would only be useful in the float case... What would strings return?

cpcloud commented 11 years ago

Shouldn't dates return array of date time?

jreback commented 11 years ago

I could return anything...(e.g. a datetime64[ns]) numpy array for example, is easy enough, strings will return the same (an object array)..

jreback commented 11 years ago

numpy 1.7 (this is the same as .values though)

In [5]: x = date_range('20130101',periods=5)

In [6]: x.inferred_values
Out[6]: 
array(['2012-12-31T19:00:00.000000000-0500',
       '2013-01-01T19:00:00.000000000-0500',
       '2013-01-02T19:00:00.000000000-0500',
       '2013-01-03T19:00:00.000000000-0500',
       '2013-01-04T19:00:00.000000000-0500'], dtype='datetime64[ns]')

In [7]: x.inferred_values[0]
Out[7]: numpy.datetime64('2012-12-31T19:00:00.000000000-0500')

jreback commented 11 years ago

@cpcloud I think you are right, only float is dfferent...

cpcloud commented 11 years ago

i mean...i don't feel super strong about this since it seems like there are so few use cases for float indices. i do think that it should return the "highest level" dtype possible that can be represented by numpy, e.g., return dates as dates like u show if this is going to be done. again though, inferred_values will be the same as values in every case except float and maybe you could return a 2D array for MultiIndex...

baldwint commented 11 years ago

I've been using float indices a lot, so I would love inferred_values, or some property that gives you back an array cast to the same dtype as the one originally passed to the index= keyword argument.

A time axis is not the only use case for a float index; sometimes I work with spectral data where the X axis is a floating point value representing frequency or wavelength.

jreback commented 11 years ago

Do this somewhere in your code (before you use it!) This is a monkey patch

import numpy as np
import pandas as pd
In [10]: def inferred_values(self):
   ....:     if self.inferred_type == 'floating':
   ....:         return np.asarray(self,dtype='float64')
   ....:     return np.asarray(self)
   ....: 

In [11]: pd.Index.inferred_values = property(inferred_values)

In [12]: idx.inferred_values
Out[12]: array([ 0.1,  1.1,  2.1,  3.1,  4.1])

In [13]: idx = Index(np.arange(5)+0.1)

In [14]: idx
Out[14]: Index([0.1, 1.1, 2.1, 3.1, 4.1], dtype=object)

cpcloud commented 11 years ago

A time axis is not the only use case for a float index; sometimes I work with spectral data where the X axis is a floating point value representing frequency or wavelength.

You're abs right here, I also use it for things other than time. It would be great if there was some way to integrate pandas with quantities, but that's probably a long way away...

njsmith commented 11 years ago

I (inadvertently) started a thread about this on the pystatsmodels list; thread link: https://groups.google.com/forum/#!topic/pystatsmodels/ua7WpNd-U8Q

My use case is also for time values (and DatetimeIndex is not useful for a variety of reasons, most notably that all I have a deltas against some unknown epoch defined as "whenever someone hit the record button".). My concern though isn't so much having a useful .values attribute (though I guess that might be nice too!), but for having a reliable way to do time-based indexing, mostly for ad hoc interactive use. The main features I'm looking for are:

Make it reliably predictable whether any given indexing expression will go by time-in-milliseconds or offset-in-array
For a time-based indexing expression, slices should give all values that fall within their bounds, whether or not the exact endpoints are present in the array. (NB: my index will always be sorted.) This is to support ad hoc queries like "eh, let's see one second of data from channel P4" -> plot(df.loc[:1000, "P4"]). For this kind of usage, no-one cares whether there was a sample taken at exactly 1000 milliseconds or not. Currently .ix does interpret floating point slices like this, but .loc does not.
For bonus points: if a time does happen to be a nice exact integer value, then there should be some way to write down a time-based indexing expression that picks it out exactly.

jreback commented 11 years ago

@njsmith

I made a couple of minor changes to .loc to get the following behavior, which I believe is still consistent with label based, but does NOT fallback (the end-points of a slice are allowed to just not be in the index, which is slightly inconsistent, but they select on a label basis, so I think that is ok) . Pls review and lmk.

In [1]: s = Series(np.arange(5), index=np.arange(5) * 2.5)

In [2]: s
Out[2]: 
0.0     0
2.5     1
5.0     2
7.5     3
10.0    4
dtype: int64

In [3]: # label based slicing

In [4]: s[1.0:3.0]
Out[4]: 
2.5    1
dtype: int64

In [5]: s.ix[1.0:3.0]
Out[5]: 
2.5    1
dtype: int64

In [6]: s.loc[1.0:3.0]
Out[6]: 
2.5    1
dtype: int64

In [7]: # exact indexing when found

In [8]: s[5.0]
Out[8]: 2

In [9]: s.loc[5.0]
Out[9]: 2

In [10]: s.ix[5.0]
Out[10]: 2

In [11]: # non-fallback location based should raise this error (__getitem__,ix fallback here)

In [12]: s.loc[4.0]
KeyError: 'the label [4.0] is not in the [index]'

In [13]: s[4.0] == s[4]
Out[13]: True

In [14]: s[4] == s[4]
Out[14]: True

# confusing slicing patterns in __getitem__/ix, loc is clear

In [15]: s.loc[2.0:5.0]
Out[15]: 
2.5    1
5.0    2
dtype: int64

In [16]: s.loc[2.0:5]
Out[16]: 
2.5    1
5.0    2
dtype: int64

In [17]: s.loc[2.1:5]
Out[17]: 
2.5    1
5.0    2
dtype: int64

In [18]: # these are what __getitem__/ix does

In [19]: s.ix[2.0:5.0]
Out[19]: 
2.5    1
5.0    2
dtype: int64

In [20]: s.ix[2.0:5]
Out[20]: 
5.0     2
7.5     3
10.0    4
dtype: int64

In [21]: s.ix[2.1:5]
Out[21]: 
2.5    1
5.0    2
dtype: int64

In [22]: s[2.0:5.0]
Out[22]: 
2.5    1
5.0    2
dtype: int64

In [23]: s[2.0:5]
Out[23]: 
5.0     2
7.5     3
10.0    4
dtype: int64

In [24]: s[2.1:5]
Out[24]: 
2.5    1
5.0    2
dtype: int64

jreback commented 11 years ago

cc @dragoljub, cc @nehalecky

you guys have had interests in indexing in the past....not sure if you have any comments wrt this

njsmith commented 11 years ago

BTW, I'd be in favor of a Float64Index class that specifically implemented the limited subset of stuff that makes sense for floating point indices and not the stuff that didn't. So e.g. trying to groupby would just be an error, and I can even see the argument for making scalar indexing an error. This would be much safer than the current situation, making @jreback happy :-). But, stuff like slicing and .values could still act the way people want, making users happy too.

dragoljub commented 11 years ago

For much of the data I work with I have been OK with using Object/Int64 index types, however I do also keep a copy of my indexers as data columns to enable easier plotting/slicing for some cases.

IMO, anything that enables a smoother interface to MatplotLib, Galry or Scikit-Learn I'm :+1:

pandas-dev / pandas

Add Float64Index class? #236