pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.2k stars 17.77k forks source link

DEPR: DatetimeIndex/TimedeltaIndex constructor keywords #55499

Open jbrockmendel opened 11 months ago

jbrockmendel commented 11 months ago

DatetimeIndex.__new__ has constructor keywords (excluding ones present in the base class and ones already deprecated) freq, tz, ambiguous, dayfirst, yearfirst. dayfirst and yearfirst overlap with the pd.to_datetime keywords.

Could we deprecate some/all of these keywords and point users to pd.to_datetime instead?

(similarly TimedeltaIndex has unit and freq, unit being present in pd.to_timedelta)

cc @MarcoGorelli @mroeschke @jorisvandenbossche

MarcoGorelli commented 10 months ago

sounds good, would favour moving towards to_datetime as being the way to parse into datetime

mroeschke commented 10 months ago

Agreed, would be simpler if DTI/TDI just accepted correctly typed arrays

jorisvandenbossche commented 10 months ago

See some related notes in https://github.com/pandas-dev/pandas/issues/50894 and https://github.com/pandas-dev/pandas/issues/50411#issuecomment-1387389894, where it is also brought to let DatetimeIndex only accept ISO8601 strings, and the first issue also has some notes on differences in keywords between DatetimeIndex and to_datetime.

There is the more general question about what to support as input (only correctly typed arrays, still strings but only ISO formatted, anything that works with the Series/Index constructor when specifying the dtype, ..), although I think that for most of the keywords that's actually not that relevant.

I think dayfirst and yearfirst are solely related to string parsing, and we generally agree that users should use to_datetime for that if they want fine-grained control, so I agree those can certainly be removed.

freq we could either exempt or make a post-construction thing by making _with_freq public.

To me this feels like the one keyword that actually makes sense to keep in the DatetimeIndex constructor, since it is a DatetimeIndex specific attribute? (like you can also set name in the Index constructors)

to_datetime has utc instead of tz. I generally like tz better than utc as a keyword, but deprecating utc might be painful. We could add tz to to_datetime and disallow passing tz-and-utc.

I also like the tz keyword. From what I said in https://github.com/pandas-dev/pandas/issues/50894#issuecomment-1404188720, the reason that we went with utc=True/False in the past for to_datetime might be that a tz keyword can be a ambiguous on how to interpret the input data? (are it localized or utc values? i.e. do you do a tz_localize or tz_convert with the specified timezone?)

Adding ambiguous to to_datetime wouldn't be too bad aside from general keyword fatigue.

I think ambiguous is purely linked to tz (since that is essentially needed for the tz_localize logic), so I think the fate of this keyword can depend on what we decide for tz

jbrockmendel commented 9 months ago

the reason that we went with utc=True/False in the past for to_datetime might be that a tz keyword can be a ambiguous on how to interpret the input data?

I've looked into this and determined the following:

In the DatetimeIndex constructor ambiguous is only used in cases where 1) the data is dt64-naive or 2) the data is object and array_to_datetime does not infer a tz. In both of these cases, DatetimeIndex(data, tz=tz, ambiguous=ambig) is equivalent to DatetimeIndex(data).tz_localize(tz, ambiguous=ambig) and also to pd.to_datetime(data).tz_localize(tz, ambiguous=ambig).

We could push ambiguous into the cython array_to_datetime_with_tz, which would allow us to respect ambiguous in some mixed-type data cases, but those cases would be pretty contrived. Moreover, we could only do this for a subset of the ambiguouss that we support, e.g. not "infer"` (allowing an arraylike would be possible but a PITA).

So i think we are safe:

1) deprecating the ambiguous keyword in DatetimeIndex, telling users to call .tz_localize after construction instead 2) Adding a tz keyword to to_datetime without a ambiguous keyword (mutually exclusive with utc, which we could deprecated) 3) deprecating dayfirst, yearfirst in DatetimeIndex.

jbrockmendel commented 9 months ago

Tried deprecating ambiguous, dayfirst, yearfirst in DatetimeIndex and hit an unexpected snag: when a freq is passed, we currently pass ambiguous to _validate_frequency, which appears to need it in test_resample_origin_with_day_freq_on_dst. Doesn't look insurmountable.