Open jbrockmendel opened 11 months ago
sounds good, would favour moving towards to_datetime as being the way to parse into datetime
Agreed, would be simpler if DTI/TDI just accepted correctly typed arrays
See some related notes in https://github.com/pandas-dev/pandas/issues/50894 and https://github.com/pandas-dev/pandas/issues/50411#issuecomment-1387389894, where it is also brought to let DatetimeIndex only accept ISO8601 strings, and the first issue also has some notes on differences in keywords between DatetimeIndex and to_datetime.
There is the more general question about what to support as input (only correctly typed arrays, still strings but only ISO formatted, anything that works with the Series/Index constructor when specifying the dtype, ..), although I think that for most of the keywords that's actually not that relevant.
I think dayfirst
and yearfirst
are solely related to string parsing, and we generally agree that users should use to_datetime
for that if they want fine-grained control, so I agree those can certainly be removed.
freq
we could either exempt or make a post-construction thing by making_with_freq
public.
To me this feels like the one keyword that actually makes sense to keep in the DatetimeIndex constructor, since it is a DatetimeIndex specific attribute? (like you can also set name
in the Index constructors)
to_datetime
hasutc
instead oftz
. I generally liketz
better thanutc
as a keyword, but deprecatingutc
might be painful. We could addtz
toto_datetime
and disallow passing tz-and-utc.
I also like the tz
keyword. From what I said in https://github.com/pandas-dev/pandas/issues/50894#issuecomment-1404188720, the reason that we went with utc=True/False
in the past for to_datetime
might be that a tz
keyword can be a ambiguous on how to interpret the input data? (are it localized or utc values? i.e. do you do a tz_localize
or tz_convert
with the specified timezone?)
Adding
ambiguous
to to_datetime wouldn't be too bad aside from general keyword fatigue.
I think ambiguous
is purely linked to tz
(since that is essentially needed for the tz_localize
logic), so I think the fate of this keyword can depend on what we decide for tz
the reason that we went with utc=True/False in the past for to_datetime might be that a tz keyword can be a ambiguous on how to interpret the input data?
I've looked into this and determined the following:
In the DatetimeIndex constructor ambiguous
is only used in cases where 1) the data is dt64-naive or 2) the data is object and array_to_datetime does not infer a tz. In both of these cases, DatetimeIndex(data, tz=tz, ambiguous=ambig)
is equivalent to DatetimeIndex(data).tz_localize(tz, ambiguous=ambig)
and also to pd.to_datetime(data).tz_localize(tz, ambiguous=ambig)
.
We could push ambiguous into the cython array_to_datetime_with_tz
, which would allow us to respect ambiguous
in some mixed-type data cases, but those cases would be pretty contrived. Moreover, we could only do this for a subset of the ambiguous
s that we support, e.g. not "infer"` (allowing an arraylike would be possible but a PITA).
So i think we are safe:
1) deprecating the ambiguous keyword in DatetimeIndex, telling users to call .tz_localize after construction instead
2) Adding a tz
keyword to to_datetime without a ambiguous
keyword (mutually exclusive with utc
, which we could deprecated)
3) deprecating dayfirst, yearfirst in DatetimeIndex.
Tried deprecating ambiguous, dayfirst, yearfirst in DatetimeIndex and hit an unexpected snag: when a freq is passed, we currently pass ambiguous to _validate_frequency, which appears to need it in test_resample_origin_with_day_freq_on_dst
. Doesn't look insurmountable.
DatetimeIndex.__new__
has constructor keywords (excluding ones present in the base class and ones already deprecated)freq
,tz
,ambiguous
,dayfirst
,yearfirst
.dayfirst
andyearfirst
overlap with thepd.to_datetime
keywords.Could we deprecate some/all of these keywords and point users to
pd.to_datetime
instead?freq
we could either exempt or make a post-construction thing by making_with_freq
public.to_datetime
hasutc
instead oftz
. I generally liketz
better thanutc
as a keyword, but deprecatingutc
might be painful. We could addtz
toto_datetime
and disallow passing tz-and-utc.ambiguous
to to_datetime wouldn't be too bad aside from general keyword fatigue.(similarly
TimedeltaIndex
hasunit
andfreq
,unit
being present inpd.to_timedelta
)cc @MarcoGorelli @mroeschke @jorisvandenbossche