Closed jbrockmendel closed 1 year ago
Probably least disruptive to have .value
continue to represent nanos.
Personally, I was never fond of the vague meaning of .value
and could use this as an opportunity to deprecate in favor of Timestamp.to_epoch
#14772 and Timedelta.to_something
The Arrow CI was failing because of this (in combination with the changes parsing of strings in Timestamp(..)
, https://github.com/pandas-dev/pandas/issues/50704). Now, the failure was only because of using .value
in a test to construct the expected result, and that is something easy to update.
However, I also looked into our actual conversion code, and it seems we do use .value
to access the integer value, and we do assume that this is always nanoseconds at the moment (https://github.com/apache/arrow/blob/2b50694c10e09e4a1343b62c6b5f44ad4403d0e1/python/pyarrow/src/arrow/python/python_to_arrow.cc#L360-L365)
It's a bit of a corner case (and so apparently also not covered by a test, since we didn't get a failure for this), but can be triggered by converting a list of Timestamp objects (or object dtype array), and explicitly passing a nanoseconds timestamp type:
# using current pandas main branch
>>> pa.array([pd.Timestamp("2012-01-01 09:01:02")], type=pa.timestamp("ns"))
<pyarrow.lib.TimestampArray object at 0x7f46ce55ec80>
[
1970-01-01 00:00:01.325408462
]
# triggering the pd.Timestamp to be nanosecond resolution -> correct result
>>> pa.array([pd.Timestamp("2012-01-01 09:01:02.000000000")], type=pa.timestamp("ns"))
<pyarrow.lib.TimestampArray object at 0x7f46ac585480>
[
2012-01-01 09:01:02.000000000
]
This could be fixed on the pyarrow side by checking the unit of the Timestamp, and only using .value
when it is actually nanoseconds, and otherwise falling back to interpreting it as datetime.datetime and getting the components that way.
But it would break current pyarrow releases, so if we want to change .value
, ideally we would wait a bit longer with that to give time for pyarrow to update for this.
So to summarize, I agree with the above that the easiest would be to keep .value
as returning nanoseconds, always, regardless the actual unit.
But long term it would be good to have some way to get the raw underlying integer (that is more efficient that building up this value from the components, as we do for datetime.datetime objects). But this could indeed also be a method, as suggested by @mroeschke
+1 on @mroeschke 's suggestion
I'll work on .value
today
Two bikeshed thoughts to what to use instead of .value
internally: 1) using asi8
would allow for some code-sharing with DTA/DTI/TDA/TDI, 2) having something distinctive would make it easier to grep for places where we use it.
For now, in https://github.com/pandas-dev/pandas/pull/50891, I just made it _value
I wasn't quite prepared for how many files it would involve changing
@jorisvandenbossche @jbrockmendel @mroeschke are we sure this is a good idea?
Because then the public-facing .value
would be unavailable for old timestamps, e.g. Timestamp('300-01-01 00:00:00')
, because it would overflow (and this is what motivated the introduction of non-nanosecond resolution)
i guess we'd document that it is for nanos-only and direct users to use e.g. .asm8.view('i8') more generally
sure, I'll add that to the error message
Timedelta
doesn't support non-nano reso yet, right? If so, perhaps I'll hold off timedelta changes for now then, that'll reduce the diff and make review a bit easier
Timedelta supports non-nano, but doesn't do inference on strings
thanks - how do I set it? I'm seeing
In [6]: Timedelta(days=1, unit='s').unit
Out[6]: 'ns'
ah got it, nvm
In [2]: Timedelta(days=1, unit='s').as_unit('s').unit
Out[2]: 's'
In previous versions .value has always been in nanoseconds. By changing the reso we get with some Timestamp/Timedelta inputs, we change the .value, which is technically public API.
One option would be just to document the change in .value behavior along with the other breaking changes docs.
Another would be to use e.g.
._value
internally and keep.value
as representing nanos.