pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.3k stars 17.8k forks source link

API: revisit adding datetime-like ops in Series #7207

Closed jreback closed 10 years ago

jreback commented 10 years ago

related is #7146, #7206

maybe provide a to_index() function and/or a class-like accessor

.values_to_index (too long though)

jreback commented 10 years ago

@cpcloud @jorisvandenbossche

cc @rosnfeld cc @mtkni

continue the discussion here (but we'll consider for 0.14.1)

cpcloud commented 10 years ago

i think to_index good. never really use these that much. agree that the Index type dependence could be a bit a special case and possibly confusing to new users.

jorisvandenbossche commented 10 years ago

I agree with @mtkni that adding a to_index() method does not really solve the issue (but apart from this issue, it can be a good enhancement to add such a method).

The problem in mindset with the datetime accessors, when you have a datetime column in a dataframe, is that you have to first make an index of it before you can access datetime fields. If you don't really know the background, this is not logical. "Why should you make an index of it? If I only need the day values?" Whether this is with DatetimeIndex(s) or with s.to_index() does not really matter that much in this sense (although I agree it makes it a bit easier/shorter).

I also agree that you 'expect' at first sight that a series method acts on the values, not on the index (for most methods, you have of course index-specific methods). So I think it is good for now to disable this (as you did in #7206) until we decide on the API.

jorisvandenbossche commented 10 years ago

What would you actually think of having the attributes act on the Series values? (so having s.year and s.index.year) or would that be to dangerous to confuse them?

jreback commented 10 years ago

that's what I had originally; but it IS ambiguous if you have a datetime index AND datetime values on a Series (not that common but common enough I guess).

jorisvandenbossche commented 10 years ago

In that case it is indeed a bit ambiguous, but I think less ambiguous as the other way around (the attributes acting on the index). And it would also be useful functionality to have such attributes on the datetime64 values of a series/column.

jreback commented 10 years ago

ahh..you want them to act on the VALUES? hmm, don't liek the s.to_index().day approach? (which IMHO is pretty explicity)....maybe could make a property like s.vindex.day or something

rosnfeld commented 10 years ago

Just a comment on my motivation here (I can't speak for @mtkni, though curious to hear other use cases):

I have worked with various datasets with multiple datetime columns, where none of them should really be the "index", so I find this to be somewhat common. E.g. humanitarian funding data might have various "state transition" dates for each entry (date funding was pledged, date it was transferred, date it was deployed towards a project, etc). Or a big dataset of projects might have start and end dates. So it's not that I'm creating standalone Series with date values, but the columns of a DataFrame often fit the bill.

Maybe I want to extract the year or month from each of these columns, or maybe somebody would want to tz_localize/tz_convert. Writing projects.end_date.year seems natural and yet it doesn't work, and the error is confusing - why would that operation have anything to do with the index? It looks like it's an operation on the column.

Requiring conversion to an index is surprising, as I (and likely others who don't understand the code as well) don't understand what is "special" about an index (or special about datetimes).

I'd be curious to know how others view this.

jorisvandenbossche commented 10 years ago

@jreback yes indeed the values :-)

And that having to convert it to an index is strange, is what I wanted to say in my comment above (https://github.com/pydata/pandas/issues/7207#issuecomment-43934559), as also @rosnfeld argues.

For me it is also rather common to have datetime values in columns, eg start and end times of measurements.

I agree that having a datelike index and a non-datelike values is probably by far the more common use case, for that case it is not that much more work (and more explicit) to type s.index.day

To make it concrete, this would give something like this:

In [1]: s = pd.Series(pd.date_range("2013-01-01 09:00:00", periods=3))

In [2]: s
Out[2]:
0   2013-01-01 09:00:00
1   2013-01-02 09:00:00
2   2013-01-03 09:00:00
dtype: datetime64[ns]

In [3]: s.day
0    1
1    2
2    3
dtype: int64
jreback commented 10 years ago

@rosnfeld ok, so you like the idea of using datetime-like ops (e.g. year,month,seconds...etc)

on the VALUES of a series (e.g. s.year)

rather that having them act on the index (which s.index.year will accomplish in any event)

so going back to common methods/properties for index/series (so they act similary, a good thing).

Note that their are currently several methods that DO NOT DO this, e.g. at_time/between_time/tz_localize/tz_convert which act on the index! (and the removed Series.weekday).

note also that tshift/asfreq acts on the index (but is more of a reshaping operation so that seems ok)

TomAugspurger commented 10 years ago

Just to confirm, with a Series with datetime values and a datetime Index, the ops will still operate on the values right?

If so then +1 from me on operating on the values.

shoyer commented 10 years ago

Instead of adding to_index() just so you can write s.to_index().day, what about creating some sort of "DatetimeArray" which separates out the part of DatetimeIndex which fixes np.datetime64 quirks from the Index part? This seems like the cleaner, more modular solution. DatetimeArray would implement properties like .day without the need to create an Index.

Not having looked at DatetimeIndex internals very carefully, I'm not sure how painful this would be. Of course, better would be to try to fix datetime64 upstream in numpy...

jreback commented 10 years ago

@shoyer not sure what u mean by datetime64 quirks? DatetimeIndex already fixes everything (numpy quirks) and provides quite a lot of functionality the issue is that Series and Index are really quite similar (and share a hierarchy to some extent in. 0.14.0)

only issue is how to unambiguously do things with the Series values. (eg Series.day) or the index Series.index.day or have the user be more explicit by doing something like Series.to_index().day or Series.values_as_index.day (as Series.values returns the raw numpy data and cannot be used to this in a nice clean way) and relying on numpy for that functionality is not possible ATM

shoyer commented 10 years ago

I know DatetimeIndex fixes the numpy quirks (comparison, casting, NaT, etc.) and provides some nice extra functionality (like the .day properties). My point is that most of these features are actually entirely distinct from the index part, so you can imagine exposing those in something like a "DatetimeArray" which makes the index part optional. DatetimeIndex would then be a relatively shallow wrapper around DatetimeArray.

I guess this would be API breaking, but if s.values could be an nd-array-like DatetimeArray (possibly even an ndarray subclass), then it would simplify this design issue because it would be obvious that Series would just pass on the "day" property from "s.values", like how it already does for ndarray properties.

Maybe this is not terribly useful for pandas because creating an index larger than a certain size only bothers to create the index lookup table on demand anyways.

jreback commented 10 years ago

it's possible that I could simply have Series.values return a DatetimeIndex if the dtype is correct (in fact other dtypes eg Categorical and Sparse do this) This might work and is not really a breaking API change - not need to create another class for that reason

that said it IS natural to simply use Series.day which applies to the values of the Series just like virtually any other function (with a couple of exceptions as noted above)

the only issue I have is that doing Series.day on say a float dtype series will have to raise a TypeError but I think that's ok

shoyer commented 10 years ago

I guess the reason to (possibly) make another class would be so this same sort of thing works even with multi-dimensional datetime arrays, e.g., on a DataFrame with datetime values (df.day). That said, I'm :+1: on Series.values returning a DatetimeIndex as a step in the right direction. Bare np.datetime64 arrays are not terribly useful.

jorisvandenbossche commented 10 years ago

I am not too fond of returning a DatetimeIndex for s.values, as this is well known to return a numpy array with the values. That would also be strange I think, that this would be different for datetimes.

And if you want to have these things for a whole dataframe of datetimes, you can always use apply and use the (possible) series attribute.

jonblunt commented 10 years ago

FWIW how about s.values.asdatetime().day .

For the uniinitiated the "index" information is not interesting, and confusing. We have column that is datetime, the values are a not very useful datetime64 but we want to use thme as a pandas datetime.

I assume that if the underlying array is not datetime64 this throws an error.

On Sat, May 24, 2014 at 2:33 PM, Joris Van den Bossche < notifications@github.com> wrote:

I am not too fond of returning a DatetimeIndex for s.values, as this is well known to return a numpy array with the values. That would also be strange I think, that this would be different for datetimes.

And if you want to have these things for a whole dataframe of datetimes, you can always use apply and use the (possible) series attribute.

— Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/7207#issuecomment-44097276 .

shoyer commented 10 years ago

@jonblunt Unfortunately s.values.asdatetime() can't work unless s.values changes from being a raw numpy.ndarray -- it would need to be some sort of custom object for it to be possible to add a new method like asdatetime().

I agree that seeing a DatetimeIndex could be somewhat confusing, which is why I suggested a more generic DatetimeArray. However, pandas.Index objects do act almost exactly like ndarrays with a few extra methods, so from a functionality perspective it would be almost equivalent -- just with a slightly confusing repr when you print the values.

hayd commented 10 years ago

Would it make sense to make a Series with datetime values a subclass of Series (with DatetimeIndex date-y methods?)... Or perhaps a property as datetime? That way we tab complete the methods.

@property
def as_datetime(self):
    # assert it's a datetime dtype
    return pd.DatetimeIndex(self)

pd.Series.as_datetime = as_datetime

s.as_datetime.<tab>  # shows datetimeindex methods if datetime
jreback commented 10 years ago

pushing to 0.15; would be an API change

jankatins commented 10 years ago

Not sure if this is related, but #7217 just introduced Series.cat.<categorical functions> (if Series is of type Categorical).

jreback commented 10 years ago

http://stackoverflow.com/questions/25129144/pandas-return-hour-from-datetime-column-directly?noredirect=1#comment39179371_25129144

I think easiest / best way is simply add:

Series.to_index(), nice clean and consistent.

any takers for a PR?

jorisvandenbossche commented 10 years ago

Repeating what is said above: although adding a Series.to_index() method could certainly be useful in general, this does not really solve this issue IMO. The issue is: to access datetime attributes of datetime values in a series, you have to do (for year): pd.DatetimeIndex(series).year. This has two issues:

  1. This is rather verbose to write
  2. This is also strange / not logical / surprising to have to write this, because "Uh, I don't need an index, I just want the year-part?"

That you need this index because for this type the attributes are only available on an index-type is an implementation detail, which a lot of users will not know (wanting to access the year part is not inherently more logically tied to an index, but just to datetime-values, whether in a series or in an index).

So adding Series.to_index() will solve my point 1 a bit (it is less typing), but point 2 not at all.

Possible solutions I see (and mentioned above):

jreback commented 10 years ago

Here's the big problem with adding soln 1), from joris list (Series.year) and the reason was backed out in the first place.

It is ambiguous if BOTH the index and the values are datetime (or periods).

I suppose we could allow the properties to work if its non-ambiguous, but then raise in the ambiguous case?

idx = date_range('20130101',periods=5)

so this would work Series(range(5),index=idx).year

so this would raise (AmbiguousAccessorError)? Series(idx,index=idx).year

of is that too weird / confusing?

jreback commented 10 years ago

as an aside, its pretty easy to do some filtering on the attributes at run-time to show/not show depending on the dtype, just need to override __dir__ and/or _local_dir

shoyer commented 10 years ago

@jreback I think there is actually a very consistent rule: series/frame properties act on values, not the index. With dataframes (more common than isolated series) this is particularly obvious: df['time'].year vs df.index.year.

I agree with Joris about to_index() not really solving this problem.

On Wed, Aug 6, 2014 at 10:18 AM, jreback notifications@github.com wrote:

Here's the big problem with adding soln 1) (Series.year) and the reason was backed out in the first place.

It is ambiguous if BOTH the index and the values are datetime (or periods).

Reply to this email directly or view it on GitHub: https://github.com/pydata/pandas/issues/7207#issuecomment-51366668

jreback commented 10 years ago

I think we are back to a namespace:

Series.date, Series.dates, Series.dt ?

jorisvandenbossche commented 10 years ago

If it is easy to do some filtering in runtime on the attributes to show with tab completion, I am personally in favor of having direct attributes acting on the series values, instead of another namespace.

Personally I think this is not that ambiguous, if we clearly document it in those functions (tz_localize/convert and between_time) that these act on the index and not on the values (this is the special case I think). By the way, your example is not fully correct I think (with idx a DatetimeIndex):

jreback commented 10 years ago

@jorisvandenbossche ok, you are saying that you think that these attributes ought to work on the values (as do all other actions). Ok, that makes sense then (rather than trying to be cute and work on index OR values).

ok, then this seems straightforward then.

cc @rosnfeld cc @mtkni @cpcloud

thoughts on @jorisvandenbossche last?

jorisvandenbossche commented 10 years ago

Yes indeed, the Series.attribute always on the values of the series, if you need to work on the index then use Series.index.attribute. But I am also not against a namespace (so something like Series.datetime.attribute), but with just a preference for the direct attribute, so lets first hear some other people before deciding.

rosnfeld commented 10 years ago

I am pretty much in agreement with @jorisvandenbossche , but I personally have a slight preference for namespaces, and would be curious to hear the arguments against. I know many people believe "flat is better than nested" but there are so many attributes directly accessible already from Series (ipython suggests 216 options to an "empty" tab-complete on current master) that I find them somewhat overwhelming if I am trying to find a method/attribute and don't remember the name. Maybe there are already so many options that it hardly matters either way. (i.e. the decision here isn't going to change that situation, re-grouping everything is such a large change that it is unlikely to happen)

I do also like the "grouping" that a namespace provides - you tab-complete on the namespace, looking for "hour" and also see "minute", which might also be relevant to your task, but don't see "unstack", which is likely less relevant.

hayd commented 10 years ago

I guess the argument against namespacing is consistency (there's no real namespacing currently, sad python zen). There's also scope for a .minute attribute being useful for a timedelta Series (without having to make a different api namespace)... so I think this would be my preference.

Saying that, I'm a little skeptical of having these available for Series which won't allow it (e.g. ints must raise TypeError* rather than attempt to work)... as it feels like poor man's subclassing :p. I'm definitely not saying I'm against it, just healthily skeptical/unsure what right call is here.

__and I guess the TypeError can say "did you mean .index.minute ?" if it's a DatetimeIndex.*

immerrr commented 10 years ago

it feels like poor man's subclassing

I used to think of it not as subclassing but as a trait or a mixin which, incidentally, in some languages is used as a replacement for multiple inheritance.

@hayd, do you mean using .minute to convert time intervals to minute units?

As for flat/nested, it sounds ok to have all those clearly time-related attributes in one category, but I'd object having something like series.datetime.time.subsecond.micro. So I guess it's more about depth of nesting rather than nesting per se.

jankatins commented 10 years ago

I like s.dt.{minute|year|...}, it's short and s.date would be not intuitive if you want a time.

There are now at least s.str and s.cat so I don't see why there shouldn't be a s.dt...

jreback commented 10 years ago

its also easy to make .dt (or other namespaces) raise if its actually accessed by an incorrect dtype (and not show up in tab completion)

jreback commented 10 years ago

ok, pls take a look at #7953 which implements the .dt namespace

jorisvandenbossche commented 10 years ago

s.cat is not really a namespace, it is just an attribute to access the Categorical (which has of course special attributes/methods). And I don't think that this is exactly what we want in this case, providing a DatetimeIndex via an attribute. Because in this way, the argument of "grouping of methods/attributes" from above does not apply (you have just all index methods, not the specific datetime attributes), and then I don't see an advantage of this nesting above just direct attributes?

@jreback this is at once also my comment on the PR

rosnfeld commented 10 years ago

I had the same reaction - it would be nice if the tab-complete just showed the timeseries-specific attributes.

jreback commented 10 years ago

not hard to just show only datetimelike attributes....

jankatins commented 10 years ago

s.cat could become a namespace, if numpy starts providing a categorical datatype itself, that was at least my understanding when that was discussed. Maybe that should be more clear in the documentation, that only a few methods on that "namespace" are guaranteed to work in the future.

jankatins commented 10 years ago

Reading the tab completion in https://github.com/pydata/pandas/pull/7953, it would be nice if there would be a way to hide the "none-Api" methods. Kind of like a wrapper object which would created by passing in a list of attributes which are accessable (see this SO question) + adding these methods to tab completion (no idea how that works but someone mentioned it above)

jreback commented 10 years ago

This does look nicer (latest version of #7953)

In [1]: s = Series(date_range('20130101',periods=3))

In [2]: s.dt.
s.dt.date              s.dt.dayofweek         s.dt.hour              s.dt.is_month_start    s.dt.is_quarter_start  s.dt.is_year_start     s.dt.minute            s.dt.nanosecond        s.dt.second            s.dt.week              s.dt.year              
s.dt.day               s.dt.dayofyear         s.dt.is_month_end      s.dt.is_quarter_end    s.dt.is_year_end       s.dt.microsecond       s.dt.month             s.dt.quarter           s.dt.time              s.dt.weekofyear        

In [2]: s.dt.hour
Out[2]: array([0, 0, 0])

In [3]: s.dt.year
Out[3]: array([2013, 2013, 2013])

In [4]: s.dt.day
Out[4]: array([1, 2, 3])

And its specific to the type of wrapped delegate

In [5]: p = Series(period_range('20130101',periods=3,freq='D').asobject)

In [6]: p.dt.
p.dt.day         p.dt.dayofweek   p.dt.dayofyear   p.dt.hour        p.dt.minute      p.dt.month       p.dt.quarter     p.dt.qyear       p.dt.second      p.dt.week        p.dt.weekofyear  p.dt.year        
rosnfeld commented 10 years ago

Should the return types be arrays? I kind of thought they would be Series, like with the 'str' methods.

jreback commented 10 years ago

they are arrays now (that is unchanged)

jreback commented 10 years ago

e.g.

generally

s = Series(date_range('20130101',periods=3))

s[s.dt.day==1]

versus currently

s[pd.DatetimeIndex(s.values).day==1]
jreback commented 10 years ago

we could probably return an Index but that's a separate issue

shoyer commented 10 years ago

Why not return a new Series with the same index as the original series, as @rosnfeld suggests? That does seem a little more consistent with how most Series operations work.

jreback commented 10 years ago

hmm. These are actually index attributes, but easy enough to return a Series. sure that makes sense.

jreback commented 10 years ago

easy enough

In [3]: s
Out[3]: 
0   2013-01-01
1   2013-01-02
2   2013-01-03
dtype: datetime64[ns]

In [4]: s.dt.day
Out[4]: 
0    1
1    2
2    3
dtype: int64

In [5]: s[s.dt.day==1]
Out[5]: 
0   2013-01-01
dtype: datetime64[ns]