pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.94k stars 18.04k forks source link

API: Timestamp and Timedelta .value changing in 2.0 #49076

Closed jbrockmendel closed 1 year ago

jbrockmendel commented 2 years ago
import pandas as pd
import numpy as np

dt = np.datetime64("2016-01-01", "ms")
ts = pd.Timestamp(dt)

>>> ts.value
1451606400000        # <- 2.0/main
1451606400000000000  # <- 1.x

In previous versions .value has always been in nanoseconds. By changing the reso we get with some Timestamp/Timedelta inputs, we change the .value, which is technically public API.

One option would be just to document the change in .value behavior along with the other breaking changes docs.

Another would be to use e.g. ._value internally and keep .value as representing nanos.

mroeschke commented 2 years ago

Probably least disruptive to have .value continue to represent nanos.

Personally, I was never fond of the vague meaning of .value and could use this as an opportunity to deprecate in favor of Timestamp.to_epoch #14772 and Timedelta.to_something

jorisvandenbossche commented 1 year ago

The Arrow CI was failing because of this (in combination with the changes parsing of strings in Timestamp(..), https://github.com/pandas-dev/pandas/issues/50704). Now, the failure was only because of using .value in a test to construct the expected result, and that is something easy to update. However, I also looked into our actual conversion code, and it seems we do use .value to access the integer value, and we do assume that this is always nanoseconds at the moment (https://github.com/apache/arrow/blob/2b50694c10e09e4a1343b62c6b5f44ad4403d0e1/python/pyarrow/src/arrow/python/python_to_arrow.cc#L360-L365)

It's a bit of a corner case (and so apparently also not covered by a test, since we didn't get a failure for this), but can be triggered by converting a list of Timestamp objects (or object dtype array), and explicitly passing a nanoseconds timestamp type:

# using current pandas main branch
>>> pa.array([pd.Timestamp("2012-01-01 09:01:02")], type=pa.timestamp("ns"))
<pyarrow.lib.TimestampArray object at 0x7f46ce55ec80>
[
  1970-01-01 00:00:01.325408462
]

# triggering the pd.Timestamp to be nanosecond resolution -> correct result
>>> pa.array([pd.Timestamp("2012-01-01 09:01:02.000000000")], type=pa.timestamp("ns"))
<pyarrow.lib.TimestampArray object at 0x7f46ac585480>
[
  2012-01-01 09:01:02.000000000
]

This could be fixed on the pyarrow side by checking the unit of the Timestamp, and only using .value when it is actually nanoseconds, and otherwise falling back to interpreting it as datetime.datetime and getting the components that way.

But it would break current pyarrow releases, so if we want to change .value, ideally we would wait a bit longer with that to give time for pyarrow to update for this.

So to summarize, I agree with the above that the easiest would be to keep .value as returning nanoseconds, always, regardless the actual unit.

But long term it would be good to have some way to get the raw underlying integer (that is more efficient that building up this value from the components, as we do for datetime.datetime objects). But this could indeed also be a method, as suggested by @mroeschke

MarcoGorelli commented 1 year ago

+1 on @mroeschke 's suggestion

I'll work on .value today

jbrockmendel commented 1 year ago

Two bikeshed thoughts to what to use instead of .value internally: 1) using asi8 would allow for some code-sharing with DTA/DTI/TDA/TDI, 2) having something distinctive would make it easier to grep for places where we use it.

MarcoGorelli commented 1 year ago

For now, in https://github.com/pandas-dev/pandas/pull/50891, I just made it _value

I wasn't quite prepared for how many files it would involve changing

MarcoGorelli commented 1 year ago

@jorisvandenbossche @jbrockmendel @mroeschke are we sure this is a good idea?

Because then the public-facing .value would be unavailable for old timestamps, e.g. Timestamp('300-01-01 00:00:00'), because it would overflow (and this is what motivated the introduction of non-nanosecond resolution)

jbrockmendel commented 1 year ago

i guess we'd document that it is for nanos-only and direct users to use e.g. .asm8.view('i8') more generally

MarcoGorelli commented 1 year ago

sure, I'll add that to the error message

Timedelta doesn't support non-nano reso yet, right? If so, perhaps I'll hold off timedelta changes for now then, that'll reduce the diff and make review a bit easier

jbrockmendel commented 1 year ago

Timedelta supports non-nano, but doesn't do inference on strings

MarcoGorelli commented 1 year ago

thanks - how do I set it? I'm seeing

In [6]: Timedelta(days=1, unit='s').unit
Out[6]: 'ns'
MarcoGorelli commented 1 year ago

ah got it, nvm

In [2]: Timedelta(days=1, unit='s').as_unit('s').unit
Out[2]: 's'