pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.53k stars 17.88k forks source link

ENH: dt.day_of_week should return int8 #58185

Open WillAyd opened 6 months ago

WillAyd commented 6 months ago

Feature Type

Problem Description

For NumPy types today this returns int32.

In [3]: pd.Series(["2024-01-01", "2024-01-02", "2024-01-03"], dtype="datetime64[us]").dt.day_of_week
Out[3]: 
0    0
1    1
2    2
dtype: int32

pyarrow dates return int64:

In [18]:   pa_arr = pa.array([
    ...:       datetime.date(2024, 1, 1),
    ...:       datetime.date(2024, 1, 2),
    ...:       datetime.date(2024, 1, 3),
    ...:   ])
    ...:   ser = pd.Series(pa_arr, dtype=pd.ArrowDtype(pa.date32()))

In [19]: ser.dt.day_of_week
Out[19]: 
0    0
1    1
2    2
dtype: int64[pyarrow]

Feature Description

Both could reasonably return int8 or even uint8 since the domain values are 0-6

Alternative Solutions

status quo

Additional Context

No response

mroeschke commented 6 months ago

For pyarrow types, this would probably be a better enhancement request to pyarrow as it just uses pyarrow.compute.day_of_week

rmhowe425 commented 3 months ago

take

rmhowe425 commented 3 months ago

Hi @WillAyd I'm trying to work on the implementation for this issue and I'm getting a little lost here.

It looks like when a DateTimeArray is initialized, day_of_week is automatically created as an int32 type, and I'm not seeing any way to change that in the Python code.

I dug a bit deeper and it looks like I may need to make some changes to the underlying ccalendar.pxd and ccalendar.pyx files used for Datetime objects. Specifically, the dayofweek function.

Would you agree that I'll need to make changes in the underlying Cython code? Or am I going down a rabbit hole?

rmhowe425 commented 3 months ago

@WillAyd From what I can tell, ser.dt.day_of_week is created after cls._simple_new(subarr, freq=inferred_freq, dtype=data_dtype) is executed on line 403 in datetimes.py.

Looking at similar PRs, it looks like all DatetimeArray attributes are treated as int32 type, which tells me that I'll need to just modify day_of_week. Which leads me to believe that the simplest way to do this would be to modify ccalendar files in pandas._libs.tslibs to return what I assume to be the Cython equivalent of numpy.uint8

Looking at DatetimeIndex attributes, it looks like a good number of these fields could be numpy.uint8 :man_shrugging:

WillAyd commented 3 months ago

Hey @rmhowe425 thanks for taking a look. I'm not sure of all the places that need to be updated, but yes I expect the core of the issue will need to be tackled in Cython.

If you get somewhat close I would advise just pushing up a draft PR for discussion; usually easier to discuss and advise that way