pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.62k stars 17.91k forks source link

API: resolution for date_range, to_datetime, timedelta_range, to_timedelta #49060

Open jbrockmendel opened 2 years ago

jbrockmendel commented 2 years ago

In 2.0 we'll support non-nanosecond datetime64 and timedelta64. ATM date_range, timedelta_range, to_datetime, and to_timedelta still are nano-only. This issue is about how to support non-nano in these functions.

Two main options: inference or a keyword. A keyword would be something like pd.date_range(start, end, periods=10, reso="ms"), and the default would be "ns". This is the simplest thing to implement, but adds more API surface.

inference for date_range would look at start and stop to determine the correct resolution. This could get messy if e.g. start and stop have different resos. ATM im thinking this isn't worth it.

inference for to_datetime (really in array_to_datetime) is more compelling in part bc I expect to_datetime to be called by library code for e.g. io.

mroeschke commented 2 years ago

For the _range methods, what if freq is a lower resolution than reso? e.g. date_range("2022", periods=3, freq="D", reso="ms")

If the to_ methods have inference, would the resolution of each argument be collected and the highest one chosen as the inferred reso? e.g. to_timedelta([timedelta(day=1), timedelta(second=1), timedelta(millisecond=1])

jbrockmendel commented 2 years ago

For the _range methods, what if freq is a lower resolution than reso? e.g. date_range("2022", periods=3, freq="D", reso="ms")

That wouldn't be a problem, would be identical to date_range("2022", periods=3, freq="D").astype("M8[ms]"). What would be a problem is the reverse, where freq is a higher-resolution than reso, e.g. date_range("2022", periods=3, freq="ns", reso="s"). We'd probably need to disallow that.

If the to_ methods have inference, would the resolution of each argument be collected and the highest one chosen as the inferred reso? e.g. to_timedelta([timedelta(day=1), timedelta(second=1), timedelta(millisecond=1])

In that particular case they are all pytimedelta objects which all get microsecond resolution. Suppose instead we have to_timedelta([Timedelta(days=1)._as_unit(unit) for unit in ["s", "ms", "us", "ns"]]). I think the way I would implement this would be something like

def array_to_timedelta(objs):
    try:
        res = array_to_timedelta_with_reso(objs, "ns")
    except OutOfBoundsTimedelta:
        try:
              res = array_to_timedelta_with_reso(objs, "us")
        [...]
   return res

def array_to_timedelta_with_reso(objs, reso):
    for item in objs:
           td = Timedelta(item)._as_unit(reso)  # <- will raise if either overflow or casting involves rounding
           [...]

This should avoid a major perf hit or API change for currently-working cases. The downside is it isn't inferring the best reso so much as the highest viable reso. Also wouldn't match scalar behavior.

mroeschke commented 2 years ago

would be identical to date_range("2022", periods=3, freq="D").astype("M8[ms]")

Okay that is reasonable. I think if constructors have arguments that allow multiple ways to specify resolutions (freq, dtype, reso), we should definitely document the "order of operations"

wiedeflo commented 1 year ago

Since I could not find anything on this in the current release notes for 2.0.0 I wanted to ask if there are any updates on this issue?

jbrockmendel commented 1 year ago

there is now a "unit"keyword in date_range and timedelta_range that specifies resolution. Haven't done to_datetime and to_timedelta yet.

satyrmipt commented 5 months ago

there is now a "unit"keyword in date_range and timedelta_range that specifies resolution. Haven't done to_datetime and to_timedelta yet.

Documentation is silent about what is resolution and its possible values. Please add link for possible values on this page https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html

Edit: if you pass any string to unit, Value error would provide you with documentation: ValueError("'unit' must be one of 's', 'ms', 'us', 'ns'")