pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.42k stars 17.85k forks source link

ENH: return .dt.weekday/isoweekday/month_name/day_name as ordered categoricals #12993

Open jreback opened 8 years ago

jreback commented 8 years ago

12803 added .dt.weekday_name. I think its appropriate to return this (and .weekday) as ordered categoricals

In [1]: s = Series(pd.date_range('20130101',periods=10))

In [2]: s.dt.weekday
Out[2]: 
0    1
1    2
2    3
3    4
4    5
5    6
6    0
7    1
8    2
9    3
dtype: int64

In [3]: s.dt.weekday_name
Out[3]: 
0      Tuesday
1    Wednesday
2     Thursday
3       Friday
4     Saturday
5       Sunday
6       Monday
7      Tuesday
8    Wednesday
9     Thursday
dtype: object
jreback commented 8 years ago

xfref #12806

cc @BastiaanBergman

I realized as merging #12803 that we didn't actually have to do this in cython and instead is a trivial map operation.

In [7]: s.dt.weekday.map(dict(enumerate(['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])))
Out[7]: 
0      Tuesday
1    Wednesday
2     Thursday
3       Friday
4     Saturday
5       Sunday
6       Monday
7      Tuesday
8    Wednesday
9     Thursday
dtype: object
jreback commented 8 years ago

And if you categorize its even easier (and way more efficient)

In [18]: cats
Out[18]: ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

In [19]: s.dt.weekday.astype('category',ordered=True).cat.rename_categories(cats)
Out[19]: 
0      Tuesday
1    Wednesday
2     Thursday
3       Friday
4     Saturday
5       Sunday
6       Monday
7      Tuesday
8    Wednesday
9     Thursday
dtype: category
Categories (7, object): [Monday < Tuesday < Wednesday < Thursday < Friday < Saturday < Sunday]
BastiaanBergman commented 8 years ago

I don't know what the speed implications are for big dataframes. In any case, implementing alongside the existing Cython code wasn't exactly un-trivial.

On Tue, Apr 26, 2016 at 6:52 AM, Jeff Reback notifications@github.com wrote:

And if you categorize its even easier.

In [18]: cats Out[18]: ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

In [19]: s.dt.weekday.astype('category',ordered=True).cat.rename_categories(cats) Out[19]: 0 Tuesday 1 Wednesday 2 Thursday 3 Friday 4 Saturday 5 Sunday 6 Monday 7 Tuesday 8 Wednesday 9 Thursday dtype: category Categories (7, object): [Monday < Tuesday < Wednesday < Thursday < Friday < Saturday < Sunday]

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/12993#issuecomment-214751982

jreback commented 8 years ago

@BastiaanBergman no, what I mean is that THIS impl is trivial. Of course the cython is not :<

kawochen commented 8 years ago

I would think they shouldn't be ordered (because it's cyclic). An order would probably only enable .max(), and .min(), right?

jreback commented 8 years ago

well also allows comparisons, e.g.

In [4]: os = s.dt.weekday.astype('category',ordered=True).cat.rename_categories(cats)

In [5]: os
Out[5]: 
0      Tuesday
1    Wednesday
2     Thursday
3       Friday
4     Saturday
5       Sunday
6       Monday
7      Tuesday
8    Wednesday
9     Thursday
dtype: category
Categories (7, object): [Monday < Tuesday < Wednesday < Thursday < Friday < Saturday < Sunday]

In [9]: os
Out[9]: 
0      Tuesday
1    Wednesday
2     Thursday
3       Friday
4     Saturday
5       Sunday
6       Monday
7      Tuesday
8    Wednesday
9     Thursday
dtype: category
Categories (7, object): [Monday < Tuesday < Wednesday < Thursday < Friday < Saturday < Sunday]

In [10]: os.min()
Out[10]: 'Monday'

In [11]: os<'Wednesday'
Out[11]: 
0     True
1    False
2    False
3    False
4    False
5    False
6     True
7     True
8    False
9    False
dtype: bool
sivakar12 commented 6 years ago

I'd like to give this a try. Can I work on this?

mroeschke commented 6 years ago

Go for it @sivakar12! Some of the files you may want to edit are in this recent PR https://github.com/pandas-dev/pandas/pull/18164/files

sivakar12 commented 6 years ago

I found that categorical is not defined in the Cython code. So I focused on the DatetimeIndex class, tried calling as_type, returning a CategoricalIndex from the _field_accessor method there. They are not working and I always end up getting dtype: object. What am I missing?

mroeschke commented 6 years ago

After the index is created, you can either use the map function or astype with predefined categories as described in these comments: https://github.com/pandas-dev/pandas/issues/12993#issuecomment-214751982 or https://github.com/pandas-dev/pandas/issues/12993#issuecomment-214751314

sivakar12 commented 6 years ago

I made DatetimeIndex class return a CategoricalIndex when weekday_name property is accessed. But the output of s.dt.weekday_name returns a DatetimeProperties object which seems to convert it back to object type. The code in the comments apply map or astype on an instance of DatetimeProperties not on DatetimeIndex which works fine. I can't figure out what's going on inside DatetimeProperties

mroeschke commented 6 years ago

Feel free to open a pull request (you can mark it as a work in progress) with your initial changes. It will be easier for us to review and help debug the issue.

jbrockmendel commented 4 years ago

Not wild about making DatetimeArray have a dependency on Categorical (which in turn has dependency on Index)

jreback commented 4 years ago

this would be an indirect dependency and is for user convenience