pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.58k stars 17.57k forks source link

API: allow casting datetime64 to int64? #45034

Open jorisvandenbossche opened 2 years ago

jorisvandenbossche commented 2 years ago

The original deprecation happened in https://github.com/pandas-dev/pandas/pull/38544

This comment is from https://github.com/pandas-dev/pandas/issues/22384#issuecomment-921024358, moving it here to a separate issue.

But, on the specific datetime -> integer deprecation:

There is then some follow-up discussion in the issue below https://github.com/pandas-dev/pandas/issues/22384#issuecomment-921024358

I would personally propose to keep allowing astype() for datetime64 -> int64, and not steer users to view() for this.

cc @jbrockmendel

jbrockmendel commented 2 years ago

@jorisvandenbossche can you add to the OP the 2-3 relevant responses from that thread.

jreback commented 2 years ago

ok I reread the discussion a bit.

jbrockmendel commented 2 years ago

The other way around (integer -> datetime / timedelta) is not deprecated

This point I find compelling. IIRC deprecating in that direction was too invasive to be feasible.

A thought that didn't come up on the old thread: what happens if/when we have non-nano? Does dt64second.astype(int64) also do a .view(int64), or does it do some division?

Implementation questions, some from the old thread:

1) Do we raise on dt64.astype(int64) when NaTs are present? (analogous to what we do for float->int with nans) 2) Can we at least only allow dt64.astype(int64), i.e. not allow dt64.astype(int32) or dt64.astype(uint64) (which ATM we ignore and just cast to int64) 3) Do we allow dt64.astype(float)? 4) If we're pretending that dt64.astype(int64) is semantically meaningful, do we do the same for dt64tz or Period? Heck even Categorical?

jorisvandenbossche commented 2 years ago

A thought that didn't come up on the old thread: what happens if/when we have non-nano? Does dt64second.astype(int64) also do a .view(int64), or does it do some division?

I would think that it will return the underlying integers (no calculation), so the exact integers you get is dependent on the resolution you have. That's also one of the reasons we can't simply change the default resolution. But that's an issue anyhow, regardless of people using astype vs view for this conversion.

1. Do we raise on dt64.astype(int64) when NaTs are present? (analogous to what we do for float->int with nans)

Long term I would say yes, but that's something we will also need to deprecate first, as currently it returns the integer representation of NaT.

2. Can we at least only allow dt64.astype(int64), i.e. not allow dt64.astype(int32) or dt64.astype(uint64) (which ATM we ignore and just cast to int64)

Casting to int32 already raises an error. Personally, I would allow this in the future (if we error on overflow), but since it raises already there is no urgency on this aspect.

3. Do we allow dt64.astype(float)?

That doesn't work currently, so I think we can defer that question to a later discussion on the more general casting rules (which is now partly happening in https://github.com/pandas-dev/pandas/issues/22384, but I need to open dedicated issue for aspects of that discussion (such as also the idea of safer casting by default detecting overflow etc).

  1. If we're pretending that dt64.astype(int64) is semantically meaningful, do we do the same for dt64tz or Period? Heck even Categorical?

For Categorical, you have the .codes to access the underlying integers using public API, so I don't think it's necessarily needed to support this through casting as well (for categorical, the casting generally happens at the level of the categories, not codes).

For Period it's a bit less clear: the scalar as .ordinal, and the array and index have .asi8, but that's not accessible from Series. But so I see now that the Period -> int64 casting is deprecated similarly as datetime64 (this issue). So if we undeprecate datetime64->int64 casting, we can for now do the same for Period?

jorisvandenbossche commented 2 years ago
  1. Do we raise on dt64.astype(int64) when NaTs are present? (analogous to what we do for float->int with nans)

Long term I would say yes, but that's something we will also need to deprecate first, as currently it returns the integer representation of NaT.

Actually, it's a bit more complicated than that. We did that in the last release (1.3), but did raise an error already before that. But only for tz-naive, and not for tz-aware .. Also casting to uint64 already raised an error before 1.3, while this works in 1.3 / master (and it actually also doesn't trigger a deprecation warning in 1.3). But again only for tz-naive, while for tz-aware it always worked.

An overview of the behaviours:

pandas 1.0 - 1.2

has_nat no NaT with NaT
dtype target_dtype
datetime64[ns, Europe/Brussels] int32 works works
int64 works works
uint64 works works
datetime64[ns] int32 cannot astype a datetimelike from [datetime64[ns]] to [int32] cannot astype a datetimelike from [datetime64[ns]] to [int32]
int64 works Cannot convert NaT values to integer
uint64 cannot astype a datetimelike from [datetime64[ns]] to [uint64] cannot astype a datetimelike from [datetime64[ns]] to [uint64]

pandas 1.3

has_nat no NaT with NaT
dtype target_dtype
datetime64[ns, Europe/Brussels] int32 cannot astype a datetimelike from [datetime64[ns, Europe/Brussels]] to [int32] cannot astype a datetimelike from [datetime64[ns, Europe/Brussels]] to [int32]
int64 works (+warning) works (+warning)
uint64 works (+warning) works (+warning)
datetime64[ns] int32 cannot astype a datetimelike from [datetime64[ns]] to [int32] cannot astype a datetimelike from [datetime64[ns]] to [int32]
int64 works (+warning) works (+warning)
uint64 works (+warning) works (+warning)
Code to generate the table ```python import pandas as pd import warnings warnings.simplefilter("always") print(pd.__version__) results = [] for dt_dtype in ["datetime64[ns]", "datetime64[ns, Europe/Brussels]"]: s = pd.Series(["2012-01-01", "2012-01-02", "NaT"], dtype=dt_dtype) for end in [2, 3]: has_nat = "no NaT" if end == 3: has_nat = "with NaT" for dtype in ["int64", "int32", "uint64"]: try: with warnings.catch_warnings(record=True) as record: s.iloc[:end].astype(dtype) except Exception as err: res = str(err) else: if record: res = "works (+warning)" else: res = "works" results.append((dt_dtype, dtype, has_nat, res)) df = pd.DataFrame(results, columns=["dtype", "target_dtype", "has_nat", "result"]) print(df.pivot(index=["dtype", "target_dtype"], columns=["has_nat"], values="result").to_html()) ```
jbrockmendel commented 2 years ago

It gets more complicated yet: Series.astype dissallows casting to int32, but DatetimeIndex and DatetimeArray treat any numpy integer dtype as i8.

jbrockmendel commented 2 years ago

Apparently we've got relevant astype behavior defined in DatetimeLikeArrayMixin.astype, astype_nansafe, and astype_array.

Thoughts for the eventual behavior: 1) All of these should match. Ideally there should just be one code path. 2) .astype(int_dtype) should raise for any int_dtype other than np.int64. 3) raise if there are NaTs present, like we do for float->int

jorisvandenbossche commented 2 years ago

2. astype(int_dtype) should raise for any int_dtype other than np.int64.

I personally don't see why such restriction is needed (if we make that cast safe, so checked for overflow).

Now, for Series, that is more or less the status quo, so that is certainly OK as the starting point (except for "uint64", which currently is allowed).

jreback commented 2 years ago

deprecation undone.

simonjayhawkins commented 2 years ago

removing milestone

jbrockmendel commented 2 years ago

Now, for Series, that is more or less the status quo, so that is certainly OK as the starting point (except for "uint64", which currently is allowed).

We want internal consistency between Series and DTI/DTA, which means either loosening Series or tightening DTI/DTA. I am averse to loosening Series on the grounds that all of these are semantic gibberish entirely dependent on implementation details. i8 is reasonable to allow only because we allow the conversion in the other direction.

updateNever mind. I'm choosing to care about consistency and forget about the rest.

jorisvandenbossche commented 2 years ago

Sidenote: I also think the "casting to non-int64" dicussion is not that important. From the table above (https://github.com/pandas-dev/pandas/issues/45034#issuecomment-1000998887), it seems that in pandas 1.0-1.2 casting to int32 actually worked for tz-aware data, and it started raising an error in pandas 1.3 (for Series at least). And I can't remember any one complaining about this (of course tz-aware might only be the smaller subset of datetime usage).

jbrockmendel commented 2 years ago

I also think the "casting to non-int64" dicussion is that important

is there a missing negative here? if not and you do think its important, then i propose you open a PR and as long as the behavior is consistent ill give it a thumbs up

jorisvandenbossche commented 2 years ago

is there a missing negative here?

Yes, certainly :) (updated my comment)

jorisvandenbossche commented 2 years ago

It also extends to non-int dtypes. For example, we allow casting from float to datetime64/timedelta64, but then casting back doesn't work:

>>> pd.Series([1.0, 2.0], dtype="float64").astype("timedelta64[ns]")
0   0 days 00:00:00.000000001
1   0 days 00:00:00.000000002
dtype: timedelta64[ns]

>>> pd.Series([1.0, 2.0], dtype="float64").astype("timedelta64[ns]").astype("float64")
...
TypeError: Cannot cast TimedeltaArray to dtype float64

Which creates some inconsistency (why not allow to cast back to float if we allow casting from float?). But of course could also go towards disallowing the float->datetimelike cast to solve that inconsistency.
On the other hand, then you could still always do Series[float].astype("int64").astype("timedelta64[ns]") to achieve exactly the same, so why bother with disallowing the direct cast if a user for some reason wants to do such a cast? (as long as we allow the int->datetimelike cast)

jbrockmendel commented 2 years ago

For example, we allow casting from float to datetime64/timedelta64

side-note: We allow this for Series but not for Float64Index.

On the other hand, then you could still always do Series[float].astype("int64").astype("timedelta64[ns]") to achieve exactly the same, so why bother with disallowing the direct cast if a user for some reason wants to do such a cast?

I find this reasonable in this particular case, but I'm wary of transitivity-based arguments more generally bc it would mean that since we allow Series[dt64].astype(int64).astype(td64) we should then allow Series[dt64].astype(td64)

jorisvandenbossche commented 2 years ago

I think there is still a clear difference between both cases, which can be enough reason to allow the direct cast in the one case but not in the other. In case of float->int->datetime, the first part of float->int doesn't change the interpretation of the values (only potentially some truncation) and only the int->datetime step does (and the actual step from float to datetime therefore still makes sense). While in the datetime->int->timedelta, you have twice a change in interpretation of the values, and the direct step from datetime to timedelta no longer makes much sense.

jbrockmendel commented 2 years ago

This could tie in to the "safe" discussion if coercion behavior is also lumped in to that (which on the last call seemed popular). e.g. the coerce=True (or whatever the keyword/value ends up being) behavior could allow the controversial casting and the coerce=False could disallow it.

_from_sequence_not_strict and maybe_cast_pointwise_result are both a bit kludgy, might be handle-able by such a keyword.

jbrockmendel commented 1 year ago

just found that trying to ArrowExtensionArray with datetime-like dtype can't be cast to int64. Only tangentially relevant, but need to write it down somewhere

jbrockmendel commented 1 year ago

More for the pile:

if we allow Series(dt64).astype(np.int64), does that mean we should allow Series(dt64, dtype=np.int64)? ATM that works but is deprecated if the values are ndarray, just works for EA.

jorisvandenbossche commented 1 year ago

if we allow Series(dt64).astype(np.int64), does that mean we should allow Series(dt64, dtype=np.int64)?

Probably yes? Unless we would move away of the idea that Series(..., dtype=dtype) should be consistent with Series(..).astype(dtype) ?

jbrockmendel commented 1 year ago

Another tangentially related datapoint: we have special-casing in Index.astype:

        elif is_float_dtype(self.dtype) and needs_i8_conversion(dtype):
            # NB: this must come before the ExtensionDtype check below
            # TODO: this differs from Series behavior; can/should we align them?
            raise TypeError(
                f"Cannot convert Float64Index to dtype {dtype}; integer "
                "values are required for conversion"
            )

AFAICT this exists to make test_subtype_datetimelike in the IntervalIndex tests to work:

    @pytest.mark.parametrize("subtype", ["datetime64[ns]", "timedelta64[ns]"])
    def test_subtype_datetimelike(self, index, subtype):
        dtype = IntervalDtype(subtype, "right")
        msg = "Cannot convert .* to .*; subtypes are incompatible"
        with pytest.raises(TypeError, match=msg):
           index.astype(dtype)

I have no problem with disallowing the IntervalIndex.astype here, but it falls into the "allow all of them or none of them" category.

jorisvandenbossche commented 1 year ago

Can you give an actual example? It's a bit hard to interpret those snippets out of context

jbrockmendel commented 1 year ago

Can you give an actual example? It's a bit hard to interpret those snippets out of context

idx = pd.IntervalIndex.from_breaks([1.5, 2.5, 3.5])
dtype = pd.IntervalDtype("datetime64[ns]")

idx.left.astype("datetime64[ns]")  # <- raises, AFAICT only to make sure the next line raises
idx.astype(dtype)  # <- raises

But bc this special-casing is done specifically inside Index.astype, we also have:

pd.Series(idx.left).astype("datetime64[ns]")  # <- doesn't raise
jorisvandenbossche commented 1 year ago

Why do you think this would only have been done for IntervalIndex? Because we only have tests for IntervalIndex that covers this? So for Series we do allow casting float to datetime (https://github.com/pandas-dev/pandas/issues/45034#issuecomment-1020249768). Personally, I don't have a strong opinion about casting to and from float for datetimelike (float->datetime and datetime->float). I think it would also be fine to disallow it altogether, but we should probably do it consistently for both directions?

jbrockmendel commented 1 year ago

Why do you think this would only have been done for IntervalIndex? Because we only have tests for IntervalIndex that covers this?

Most of the tests that hit this are for IntervalIndex. Now that I look at the blame, the check specific to needs_i8_conversion dtypes was added https://github.com/pandas-dev/pandas/pull/18937/files#diff-04d55a40a3293df94601d8b4aff4babebe4c1532d8174692bdef7f5bcb12c33fR315 and before that it would raise but looked like a catch-all.

I think it would also be fine to disallow it altogether, but we should probably do it consistently for both directions?

Agreed.