Closed jreback closed 10 years ago
@cpcloud @jorisvandenbossche
cc @rosnfeld cc @mtkni
continue the discussion here (but we'll consider for 0.14.1)
i think to_index
good. never really use these that much. agree that the Index
type dependence could be a bit a special case and possibly confusing to new users.
I agree with @mtkni that adding a to_index()
method does not really solve the issue (but apart from this issue, it can be a good enhancement to add such a method).
The problem in mindset with the datetime accessors, when you have a datetime column in a dataframe, is that you have to first make an index of it before you can access datetime fields. If you don't really know the background, this is not logical. "Why should you make an index of it? If I only need the day values?" Whether this is with DatetimeIndex(s)
or with s.to_index()
does not really matter that much in this sense (although I agree it makes it a bit easier/shorter).
I also agree that you 'expect' at first sight that a series method acts on the values, not on the index (for most methods, you have of course index-specific methods). So I think it is good for now to disable this (as you did in #7206) until we decide on the API.
What would you actually think of having the attributes act on the Series values? (so having s.year
and s.index.year
) or would that be to dangerous to confuse them?
that's what I had originally; but it IS ambiguous if you have a datetime index AND datetime values on a Series (not that common but common enough I guess).
In that case it is indeed a bit ambiguous, but I think less ambiguous as the other way around (the attributes acting on the index). And it would also be useful functionality to have such attributes on the datetime64 values of a series/column.
ahh..you want them to act on the VALUES? hmm, don't liek the s.to_index().day
approach? (which IMHO is pretty explicity)....maybe could make a property like s.vindex.day
or something
Just a comment on my motivation here (I can't speak for @mtkni, though curious to hear other use cases):
I have worked with various datasets with multiple datetime columns, where none of them should really be the "index", so I find this to be somewhat common. E.g. humanitarian funding data might have various "state transition" dates for each entry (date funding was pledged, date it was transferred, date it was deployed towards a project, etc). Or a big dataset of projects might have start and end dates. So it's not that I'm creating standalone Series with date values, but the columns of a DataFrame often fit the bill.
Maybe I want to extract the year or month from each of these columns, or maybe somebody would want to tz_localize/tz_convert. Writing projects.end_date.year
seems natural and yet it doesn't work, and the error is confusing - why would that operation have anything to do with the index? It looks like it's an operation on the column.
Requiring conversion to an index is surprising, as I (and likely others who don't understand the code as well) don't understand what is "special" about an index (or special about datetimes).
I'd be curious to know how others view this.
@jreback yes indeed the values :-)
And that having to convert it to an index is strange, is what I wanted to say in my comment above (https://github.com/pydata/pandas/issues/7207#issuecomment-43934559), as also @rosnfeld argues.
For me it is also rather common to have datetime values in columns, eg start and end times of measurements.
I agree that having a datelike index and a non-datelike values is probably by far the more common use case, for that case it is not that much more work (and more explicit) to type s.index.day
To make it concrete, this would give something like this:
In [1]: s = pd.Series(pd.date_range("2013-01-01 09:00:00", periods=3))
In [2]: s
Out[2]:
0 2013-01-01 09:00:00
1 2013-01-02 09:00:00
2 2013-01-03 09:00:00
dtype: datetime64[ns]
In [3]: s.day
0 1
1 2
2 3
dtype: int64
@rosnfeld ok, so you like the idea of using datetime-like ops (e.g. year,month,seconds...etc)
on the VALUES of a series (e.g. s.year
)
rather that having them act on the index (which s.index.year
will accomplish in any event)
so going back to common methods/properties for index/series (so they act similary, a good thing).
Note that their are currently several methods that DO NOT DO this, e.g. at_time/between_time/tz_localize/tz_convert
which act on the index! (and the removed Series.weekday
).
note also that tshift/asfreq
acts on the index (but is more of a reshaping operation so that seems ok)
Just to confirm, with a Series
with datetime values and a datetime Index, the ops will still operate on the values right?
If so then +1 from me on operating on the values.
Instead of adding to_index()
just so you can write s.to_index().day
, what about creating some sort of "DatetimeArray" which separates out the part of DatetimeIndex which fixes np.datetime64 quirks from the Index part? This seems like the cleaner, more modular solution. DatetimeArray would implement properties like .day
without the need to create an Index.
Not having looked at DatetimeIndex internals very carefully, I'm not sure how painful this would be. Of course, better would be to try to fix datetime64 upstream in numpy...
@shoyer not sure what u mean by datetime64 quirks? DatetimeIndex already fixes everything (numpy quirks) and provides quite a lot of functionality the issue is that Series and Index are really quite similar (and share a hierarchy to some extent in. 0.14.0)
only issue is how to unambiguously do things with the Series values. (eg Series.day) or the index Series.index.day or have the user be more explicit by doing something like Series.to_index().day or Series.values_as_index.day (as Series.values returns the raw numpy data and cannot be used to this in a nice clean way) and relying on numpy for that functionality is not possible ATM
I know DatetimeIndex fixes the numpy quirks (comparison, casting, NaT, etc.) and provides some nice extra functionality (like the .day properties). My point is that most of these features are actually entirely distinct from the index part, so you can imagine exposing those in something like a "DatetimeArray" which makes the index part optional. DatetimeIndex would then be a relatively shallow wrapper around DatetimeArray.
I guess this would be API breaking, but if s.values
could be an nd-array-like DatetimeArray (possibly even an ndarray subclass), then it would simplify this design issue because it would be obvious that Series would just pass on the "day" property from "s.values", like how it already does for ndarray properties.
Maybe this is not terribly useful for pandas because creating an index larger than a certain size only bothers to create the index lookup table on demand anyways.
it's possible that I could simply have Series.values return a DatetimeIndex if the dtype is correct (in fact other dtypes eg Categorical and Sparse do this) This might work and is not really a breaking API change - not need to create another class for that reason
that said it IS natural to simply use Series.day which applies to the values of the Series just like virtually any other function (with a couple of exceptions as noted above)
the only issue I have is that doing Series.day on say a float dtype series will have to raise a TypeError but I think that's ok
I guess the reason to (possibly) make another class would be so this same sort of thing works even with multi-dimensional datetime arrays, e.g., on a DataFrame with datetime values (df.day
). That said, I'm :+1: on Series.values returning a DatetimeIndex as a step in the right direction. Bare np.datetime64 arrays are not terribly useful.
I am not too fond of returning a DatetimeIndex for s.values
, as this is well known to return a numpy array with the values. That would also be strange I think, that this would be different for datetimes.
And if you want to have these things for a whole dataframe of datetimes, you can always use apply
and use the (possible) series attribute.
FWIW how about s.values.asdatetime().day .
For the uniinitiated the "index" information is not interesting, and confusing. We have column that is datetime, the values are a not very useful datetime64 but we want to use thme as a pandas datetime.
I assume that if the underlying array is not datetime64 this throws an error.
On Sat, May 24, 2014 at 2:33 PM, Joris Van den Bossche < notifications@github.com> wrote:
I am not too fond of returning a DatetimeIndex for s.values, as this is well known to return a numpy array with the values. That would also be strange I think, that this would be different for datetimes.
And if you want to have these things for a whole dataframe of datetimes, you can always use apply and use the (possible) series attribute.
— Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/7207#issuecomment-44097276 .
@jonblunt Unfortunately s.values.asdatetime()
can't work unless s.values
changes from being a raw numpy.ndarray
-- it would need to be some sort of custom object for it to be possible to add a new method like asdatetime()
.
I agree that seeing a DatetimeIndex could be somewhat confusing, which is why I suggested a more generic DatetimeArray. However, pandas.Index
objects do act almost exactly like ndarrays with a few extra methods, so from a functionality perspective it would be almost equivalent -- just with a slightly confusing repr when you print the values.
Would it make sense to make a Series with datetime values a subclass of Series (with DatetimeIndex date-y methods?)... Or perhaps a property as datetime? That way we tab complete the methods.
@property
def as_datetime(self):
# assert it's a datetime dtype
return pd.DatetimeIndex(self)
pd.Series.as_datetime = as_datetime
s.as_datetime.<tab> # shows datetimeindex methods if datetime
pushing to 0.15; would be an API change
Not sure if this is related, but #7217 just introduced Series.cat.<categorical functions>
(if Series is of type Categorical).
I think easiest / best way is simply add:
Series.to_index()
, nice clean and consistent.
any takers for a PR?
Repeating what is said above: although adding a Series.to_index()
method could certainly be useful in general, this does not really solve this issue IMO.
The issue is: to access datetime attributes of datetime values in a series, you have to do (for year
): pd.DatetimeIndex(series).year
. This has two issues:
That you need this index because for this type the attributes are only available on an index-type is an implementation detail, which a lot of users will not know (wanting to access the year part is not inherently more logically tied to an index, but just to datetime-values, whether in a series or in an index).
So adding Series.to_index()
will solve my point 1 a bit (it is less typing), but point 2 not at all.
Possible solutions I see (and mentioned above):
Series.year
and Series.index.year
tz_localize/convert
, at_time/between_time
(but for the first, these are not even possible for series values, so not yet a real problem, only an inconsistency) + adding a lot of attributes which are only relevant for one dtypestr
for the StringMethods. Something like Series.timestamp/datetime/dt/..
. So eg Series.dt.year
and Series.index.year
.
DatetimeSeries
(so a kind of subclass of Series depending on the dtype of the series values, like you have different Index types) that adds these attributes
Here's the big problem with adding soln 1), from joris list (Series.year
) and the reason was backed out in the first place.
It is ambiguous if BOTH the index and the values are datetime (or periods).
I suppose we could allow the properties to work if its non-ambiguous, but then raise in the ambiguous case?
idx = date_range('20130101',periods=5)
so this would work
Series(range(5),index=idx).year
so this would raise (AmbiguousAccessorError
)?
Series(idx,index=idx).year
of is that too weird / confusing?
as an aside, its pretty easy to do some filtering on the attributes at run-time to show/not show depending on the dtype, just need to override __dir__
and/or _local_dir
@jreback I think there is actually a very consistent rule: series/frame properties act on values, not the index. With dataframes (more common than isolated series) this is particularly obvious: df['time'].year
vs df.index.year
.
I agree with Joris about to_index() not really solving this problem.
On Wed, Aug 6, 2014 at 10:18 AM, jreback notifications@github.com wrote:
Here's the big problem with adding soln 1) (
Series.year
) and the reason was backed out in the first place.It is ambiguous if BOTH the index and the values are datetime (or periods).
Reply to this email directly or view it on GitHub: https://github.com/pydata/pandas/issues/7207#issuecomment-51366668
I think we are back to a namespace:
Series.date
, Series.dates
, Series.dt
?
If it is easy to do some filtering in runtime on the attributes to show with tab completion, I am personally in favor of having direct attributes acting on the series values, instead of another namespace.
Personally I think this is not that ambiguous, if we clearly document it in those functions (tz_localize/convert and between_time) that these act on the index and not on the values (this is the special case I think).
By the way, your example is not fully correct I think (with idx
a DatetimeIndex):
Series(range(5), index=idx).year
would not work in any case, but give an AttributeError (or other error) saying that year
is only defined with series of dtype datetime64Series(idx, index=range(5)).year
that would work and is not ambiguousSeries(idx, index=idx).year
is the one that is maybe ambiguous, and for which you suggest an error (but a AmbiguousWarning would be more appropriate), although I think this is not needed@jorisvandenbossche ok, you are saying that you think that these attributes ought to work on the values (as do all other actions). Ok, that makes sense then (rather than trying to be cute and work on index OR values).
ok, then this seems straightforward then.
cc @rosnfeld cc @mtkni @cpcloud
thoughts on @jorisvandenbossche last?
Yes indeed, the Series.attribute
always on the values of the series, if you need to work on the index then use Series.index.attribute
.
But I am also not against a namespace (so something like Series.datetime.attribute
), but with just a preference for the direct attribute, so lets first hear some other people before deciding.
I am pretty much in agreement with @jorisvandenbossche , but I personally have a slight preference for namespaces, and would be curious to hear the arguments against. I know many people believe "flat is better than nested" but there are so many attributes directly accessible already from Series (ipython suggests 216 options to an "empty" tab-complete on current master) that I find them somewhat overwhelming if I am trying to find a method/attribute and don't remember the name. Maybe there are already so many options that it hardly matters either way. (i.e. the decision here isn't going to change that situation, re-grouping everything is such a large change that it is unlikely to happen)
I do also like the "grouping" that a namespace provides - you tab-complete on the namespace, looking for "hour" and also see "minute", which might also be relevant to your task, but don't see "unstack", which is likely less relevant.
I guess the argument against namespacing is consistency (there's no real namespacing currently, sad python zen). There's also scope for a .minute
attribute being useful for a timedelta Series (without having to make a different api namespace)... so I think this would be my preference.
Saying that, I'm a little skeptical of having these available for Series which won't allow it (e.g. ints must raise TypeError* rather than attempt to work)... as it feels like poor man's subclassing :p. I'm definitely not saying I'm against it, just healthily skeptical/unsure what right call is here.
__and I guess the TypeError can say "did you mean .index.minute
?" if it's a DatetimeIndex.*
it feels like poor man's subclassing
I used to think of it not as subclassing but as a trait or a mixin which, incidentally, in some languages is used as a replacement for multiple inheritance.
@hayd, do you mean using .minute
to convert time intervals to minute units?
As for flat/nested, it sounds ok to have all those clearly time-related attributes in one category, but I'd object having something like series.datetime.time.subsecond.micro
. So I guess it's more about depth of nesting rather than nesting per se.
I like s.dt.{minute|year|...}
, it's short and s.date
would be not intuitive if you want a time.
There are now at least s.str
and s.cat
so I don't see why there shouldn't be a s.dt
...
its also easy to make .dt
(or other namespaces) raise if its actually accessed by an incorrect dtype (and not show up in tab completion)
ok, pls take a look at #7953 which implements the .dt
namespace
s.cat
is not really a namespace, it is just an attribute to access the Categorical
(which has of course special attributes/methods).
And I don't think that this is exactly what we want in this case, providing a DatetimeIndex via an attribute. Because in this way, the argument of "grouping of methods/attributes" from above does not apply (you have just all index methods, not the specific datetime attributes), and then I don't see an advantage of this nesting above just direct attributes?
@jreback this is at once also my comment on the PR
I had the same reaction - it would be nice if the tab-complete just showed the timeseries-specific attributes.
not hard to just show only datetimelike attributes....
s.cat
could become a namespace, if numpy starts providing a categorical datatype itself, that was at least my understanding when that was discussed. Maybe that should be more clear in the documentation, that only a few methods on that "namespace" are guaranteed to work in the future.
Reading the tab completion in https://github.com/pydata/pandas/pull/7953, it would be nice if there would be a way to hide the "none-Api" methods. Kind of like a wrapper object which would created by passing in a list of attributes which are accessable (see this SO question) + adding these methods to tab completion (no idea how that works but someone mentioned it above)
This does look nicer (latest version of #7953)
In [1]: s = Series(date_range('20130101',periods=3))
In [2]: s.dt.
s.dt.date s.dt.dayofweek s.dt.hour s.dt.is_month_start s.dt.is_quarter_start s.dt.is_year_start s.dt.minute s.dt.nanosecond s.dt.second s.dt.week s.dt.year
s.dt.day s.dt.dayofyear s.dt.is_month_end s.dt.is_quarter_end s.dt.is_year_end s.dt.microsecond s.dt.month s.dt.quarter s.dt.time s.dt.weekofyear
In [2]: s.dt.hour
Out[2]: array([0, 0, 0])
In [3]: s.dt.year
Out[3]: array([2013, 2013, 2013])
In [4]: s.dt.day
Out[4]: array([1, 2, 3])
And its specific to the type of wrapped delegate
In [5]: p = Series(period_range('20130101',periods=3,freq='D').asobject)
In [6]: p.dt.
p.dt.day p.dt.dayofweek p.dt.dayofyear p.dt.hour p.dt.minute p.dt.month p.dt.quarter p.dt.qyear p.dt.second p.dt.week p.dt.weekofyear p.dt.year
Should the return types be arrays? I kind of thought they would be Series, like with the 'str' methods.
they are arrays now (that is unchanged)
e.g.
generally
s = Series(date_range('20130101',periods=3))
s[s.dt.day==1]
versus currently
s[pd.DatetimeIndex(s.values).day==1]
we could probably return an Index
but that's a separate issue
Why not return a new Series with the same index as the original series, as @rosnfeld suggests? That does seem a little more consistent with how most Series operations work.
hmm. These are actually index attributes, but easy enough to return a Series. sure that makes sense.
easy enough
In [3]: s
Out[3]:
0 2013-01-01
1 2013-01-02
2 2013-01-03
dtype: datetime64[ns]
In [4]: s.dt.day
Out[4]:
0 1
1 2
2 3
dtype: int64
In [5]: s[s.dt.day==1]
Out[5]:
0 2013-01-01
dtype: datetime64[ns]
related is #7146, #7206
maybe provide a
to_index()
function and/or a class-like accessor.values_to_index
(too long though)