API: Timestamp and Timedelta .value changing in 2.0

jbrockmendel commented 2 years ago

import pandas as pd
import numpy as np

dt = np.datetime64("2016-01-01", "ms")
ts = pd.Timestamp(dt)

>>> ts.value
1451606400000        # <- 2.0/main
1451606400000000000  # <- 1.x

In previous versions .value has always been in nanoseconds. By changing the reso we get with some Timestamp/Timedelta inputs, we change the .value, which is technically public API.

One option would be just to document the change in .value behavior along with the other breaking changes docs.

Another would be to use e.g. ._value internally and keep .value as representing nanos.

mroeschke commented 2 years ago

Probably least disruptive to have .value continue to represent nanos.

Personally, I was never fond of the vague meaning of .value and could use this as an opportunity to deprecate in favor of Timestamp.to_epoch #14772 and Timedelta.to_something

jorisvandenbossche commented 1 year ago

The Arrow CI was failing because of this (in combination with the changes parsing of strings in Timestamp(..), https://github.com/pandas-dev/pandas/issues/50704). Now, the failure was only because of using .value in a test to construct the expected result, and that is something easy to update. However, I also looked into our actual conversion code, and it seems we do use .value to access the integer value, and we do assume that this is always nanoseconds at the moment (https://github.com/apache/arrow/blob/2b50694c10e09e4a1343b62c6b5f44ad4403d0e1/python/pyarrow/src/arrow/python/python_to_arrow.cc#L360-L365)

It's a bit of a corner case (and so apparently also not covered by a test, since we didn't get a failure for this), but can be triggered by converting a list of Timestamp objects (or object dtype array), and explicitly passing a nanoseconds timestamp type:

# using current pandas main branch
>>> pa.array([pd.Timestamp("2012-01-01 09:01:02")], type=pa.timestamp("ns"))
<pyarrow.lib.TimestampArray object at 0x7f46ce55ec80>
[
  1970-01-01 00:00:01.325408462
]

# triggering the pd.Timestamp to be nanosecond resolution -> correct result
>>> pa.array([pd.Timestamp("2012-01-01 09:01:02.000000000")], type=pa.timestamp("ns"))
<pyarrow.lib.TimestampArray object at 0x7f46ac585480>
[
  2012-01-01 09:01:02.000000000
]

This could be fixed on the pyarrow side by checking the unit of the Timestamp, and only using .value when it is actually nanoseconds, and otherwise falling back to interpreting it as datetime.datetime and getting the components that way.

But it would break current pyarrow releases, so if we want to change .value, ideally we would wait a bit longer with that to give time for pyarrow to update for this.

So to summarize, I agree with the above that the easiest would be to keep .value as returning nanoseconds, always, regardless the actual unit.

But long term it would be good to have some way to get the raw underlying integer (that is more efficient that building up this value from the components, as we do for datetime.datetime objects). But this could indeed also be a method, as suggested by @mroeschke