Open jorisvandenbossche opened 2 years ago
@jorisvandenbossche can you add to the OP the 2-3 relevant responses from that thread.
ok I reread the discussion a bit.
.view
(though common in numpy) is not common in pandas and we should undeprecate here and allow this type of casting (note that we did this in 1.3 so its a change again)The other way around (integer -> datetime / timedelta) is not deprecated
This point I find compelling. IIRC deprecating in that direction was too invasive to be feasible.
A thought that didn't come up on the old thread: what happens if/when we have non-nano? Does dt64second.astype(int64) also do a .view(int64), or does it do some division?
Implementation questions, some from the old thread:
1) Do we raise on dt64.astype(int64) when NaTs are present? (analogous to what we do for float->int with nans) 2) Can we at least only allow dt64.astype(int64), i.e. not allow dt64.astype(int32) or dt64.astype(uint64) (which ATM we ignore and just cast to int64) 3) Do we allow dt64.astype(float)? 4) If we're pretending that dt64.astype(int64) is semantically meaningful, do we do the same for dt64tz or Period? Heck even Categorical?
A thought that didn't come up on the old thread: what happens if/when we have non-nano? Does dt64second.astype(int64) also do a .view(int64), or does it do some division?
I would think that it will return the underlying integers (no calculation), so the exact integers you get is dependent on the resolution you have. That's also one of the reasons we can't simply change the default resolution. But that's an issue anyhow, regardless of people using astype vs view for this conversion.
1. Do we raise on dt64.astype(int64) when NaTs are present? (analogous to what we do for float->int with nans)
Long term I would say yes, but that's something we will also need to deprecate first, as currently it returns the integer representation of NaT.
2. Can we at least only allow dt64.astype(int64), i.e. not allow dt64.astype(int32) or dt64.astype(uint64) (which ATM we ignore and just cast to int64)
Casting to int32 already raises an error. Personally, I would allow this in the future (if we error on overflow), but since it raises already there is no urgency on this aspect.
3. Do we allow dt64.astype(float)?
That doesn't work currently, so I think we can defer that question to a later discussion on the more general casting rules (which is now partly happening in https://github.com/pandas-dev/pandas/issues/22384, but I need to open dedicated issue for aspects of that discussion (such as also the idea of safer casting by default detecting overflow etc).
- If we're pretending that dt64.astype(int64) is semantically meaningful, do we do the same for dt64tz or Period? Heck even Categorical?
For Categorical, you have the .codes
to access the underlying integers using public API, so I don't think it's necessarily needed to support this through casting as well (for categorical, the casting generally happens at the level of the categories, not codes).
For Period it's a bit less clear: the scalar as .ordinal
, and the array and index have .asi8
, but that's not accessible from Series. But so I see now that the Period -> int64 casting is deprecated similarly as datetime64 (this issue). So if we undeprecate datetime64->int64 casting, we can for now do the same for Period?
- Do we raise on dt64.astype(int64) when NaTs are present? (analogous to what we do for float->int with nans)
Long term I would say yes, but that's something we will also need to deprecate first, as currently it returns the integer representation of NaT.
Actually, it's a bit more complicated than that. We did that in the last release (1.3), but did raise an error already before that. But only for tz-naive, and not for tz-aware .. Also casting to uint64 already raised an error before 1.3, while this works in 1.3 / master (and it actually also doesn't trigger a deprecation warning in 1.3). But again only for tz-naive, while for tz-aware it always worked.
An overview of the behaviours:
pandas 1.0 - 1.2
has_nat | no NaT | with NaT | |
---|---|---|---|
dtype | target_dtype | ||
datetime64[ns, Europe/Brussels] | int32 | works | works |
int64 | works | works | |
uint64 | works | works | |
datetime64[ns] | int32 | cannot astype a datetimelike from [datetime64[ns]] to [int32] | cannot astype a datetimelike from [datetime64[ns]] to [int32] |
int64 | works | Cannot convert NaT values to integer | |
uint64 | cannot astype a datetimelike from [datetime64[ns]] to [uint64] | cannot astype a datetimelike from [datetime64[ns]] to [uint64] |
pandas 1.3
has_nat | no NaT | with NaT | |
---|---|---|---|
dtype | target_dtype | ||
datetime64[ns, Europe/Brussels] | int32 | cannot astype a datetimelike from [datetime64[ns, Europe/Brussels]] to [int32] | cannot astype a datetimelike from [datetime64[ns, Europe/Brussels]] to [int32] |
int64 | works (+warning) | works (+warning) | |
uint64 | works (+warning) | works (+warning) | |
datetime64[ns] | int32 | cannot astype a datetimelike from [datetime64[ns]] to [int32] | cannot astype a datetimelike from [datetime64[ns]] to [int32] |
int64 | works (+warning) | works (+warning) | |
uint64 | works (+warning) | works (+warning) |
It gets more complicated yet: Series.astype dissallows casting to int32, but DatetimeIndex and DatetimeArray treat any numpy integer dtype as i8.
Apparently we've got relevant astype behavior defined in DatetimeLikeArrayMixin.astype, astype_nansafe, and astype_array.
Thoughts for the eventual behavior: 1) All of these should match. Ideally there should just be one code path. 2) .astype(int_dtype) should raise for any int_dtype other than np.int64. 3) raise if there are NaTs present, like we do for float->int
2. astype(int_dtype) should raise for any int_dtype other than np.int64.
I personally don't see why such restriction is needed (if we make that cast safe, so checked for overflow).
Now, for Series, that is more or less the status quo, so that is certainly OK as the starting point (except for "uint64", which currently is allowed).
deprecation undone.
removing milestone
Now, for Series, that is more or less the status quo, so that is certainly OK as the starting point (except for "uint64", which currently is allowed).
We want internal consistency between Series and DTI/DTA, which means either loosening Series or tightening DTI/DTA. I am averse to loosening Series on the grounds that all of these are semantic gibberish entirely dependent on implementation details. i8 is reasonable to allow only because we allow the conversion in the other direction.
updateNever mind. I'm choosing to care about consistency and forget about the rest.
Sidenote: I also think the "casting to non-int64" dicussion is not that important. From the table above (https://github.com/pandas-dev/pandas/issues/45034#issuecomment-1000998887), it seems that in pandas 1.0-1.2 casting to int32 actually worked for tz-aware data, and it started raising an error in pandas 1.3 (for Series at least). And I can't remember any one complaining about this (of course tz-aware might only be the smaller subset of datetime usage).
I also think the "casting to non-int64" dicussion is that important
is there a missing negative here? if not and you do think its important, then i propose you open a PR and as long as the behavior is consistent ill give it a thumbs up
is there a missing negative here?
Yes, certainly :) (updated my comment)
It also extends to non-int dtypes. For example, we allow casting from float to datetime64/timedelta64, but then casting back doesn't work:
>>> pd.Series([1.0, 2.0], dtype="float64").astype("timedelta64[ns]")
0 0 days 00:00:00.000000001
1 0 days 00:00:00.000000002
dtype: timedelta64[ns]
>>> pd.Series([1.0, 2.0], dtype="float64").astype("timedelta64[ns]").astype("float64")
...
TypeError: Cannot cast TimedeltaArray to dtype float64
Which creates some inconsistency (why not allow to cast back to float if we allow casting from float?). But of course could also go towards disallowing the float->datetimelike cast to solve that inconsistency.
On the other hand, then you could still always do Series[float].astype("int64").astype("timedelta64[ns]")
to achieve exactly the same, so why bother with disallowing the direct cast if a user for some reason wants to do such a cast? (as long as we allow the int->datetimelike cast)
For example, we allow casting from float to datetime64/timedelta64
side-note: We allow this for Series but not for Float64Index.
On the other hand, then you could still always do Series[float].astype("int64").astype("timedelta64[ns]") to achieve exactly the same, so why bother with disallowing the direct cast if a user for some reason wants to do such a cast?
I find this reasonable in this particular case, but I'm wary of transitivity-based arguments more generally bc it would mean that since we allow Series[dt64].astype(int64).astype(td64) we should then allow Series[dt64].astype(td64)
I think there is still a clear difference between both cases, which can be enough reason to allow the direct cast in the one case but not in the other.
In case of float->int->datetime
, the first part of float->int
doesn't change the interpretation of the values (only potentially some truncation) and only the int->datetime
step does (and the actual step from float to datetime therefore still makes sense). While in the datetime->int->timedelta
, you have twice a change in interpretation of the values, and the direct step from datetime to timedelta no longer makes much sense.
This could tie in to the "safe" discussion if coercion behavior is also lumped in to that (which on the last call seemed popular). e.g. the coerce=True
(or whatever the keyword/value ends up being) behavior could allow the controversial casting and the coerce=False
could disallow it.
_from_sequence_not_strict and maybe_cast_pointwise_result are both a bit kludgy, might be handle-able by such a keyword.
just found that trying to ArrowExtensionArray with datetime-like dtype can't be cast to int64. Only tangentially relevant, but need to write it down somewhere
More for the pile:
if we allow Series(dt64).astype(np.int64)
, does that mean we should allow Series(dt64, dtype=np.int64)
? ATM that works but is deprecated if the values are ndarray, just works for EA.
if we allow
Series(dt64).astype(np.int64)
, does that mean we should allowSeries(dt64, dtype=np.int64)
?
Probably yes? Unless we would move away of the idea that Series(..., dtype=dtype)
should be consistent with Series(..).astype(dtype)
?
Another tangentially related datapoint: we have special-casing in Index.astype
:
elif is_float_dtype(self.dtype) and needs_i8_conversion(dtype):
# NB: this must come before the ExtensionDtype check below
# TODO: this differs from Series behavior; can/should we align them?
raise TypeError(
f"Cannot convert Float64Index to dtype {dtype}; integer "
"values are required for conversion"
)
AFAICT this exists to make test_subtype_datetimelike
in the IntervalIndex tests to work:
@pytest.mark.parametrize("subtype", ["datetime64[ns]", "timedelta64[ns]"])
def test_subtype_datetimelike(self, index, subtype):
dtype = IntervalDtype(subtype, "right")
msg = "Cannot convert .* to .*; subtypes are incompatible"
with pytest.raises(TypeError, match=msg):
index.astype(dtype)
I have no problem with disallowing the IntervalIndex.astype here, but it falls into the "allow all of them or none of them" category.
Can you give an actual example? It's a bit hard to interpret those snippets out of context
Can you give an actual example? It's a bit hard to interpret those snippets out of context
idx = pd.IntervalIndex.from_breaks([1.5, 2.5, 3.5])
dtype = pd.IntervalDtype("datetime64[ns]")
idx.left.astype("datetime64[ns]") # <- raises, AFAICT only to make sure the next line raises
idx.astype(dtype) # <- raises
But bc this special-casing is done specifically inside Index.astype, we also have:
pd.Series(idx.left).astype("datetime64[ns]") # <- doesn't raise
Why do you think this would only have been done for IntervalIndex? Because we only have tests for IntervalIndex that covers this? So for Series we do allow casting float to datetime (https://github.com/pandas-dev/pandas/issues/45034#issuecomment-1020249768). Personally, I don't have a strong opinion about casting to and from float for datetimelike (float->datetime and datetime->float). I think it would also be fine to disallow it altogether, but we should probably do it consistently for both directions?
Why do you think this would only have been done for IntervalIndex? Because we only have tests for IntervalIndex that covers this?
Most of the tests that hit this are for IntervalIndex. Now that I look at the blame, the check specific to needs_i8_conversion dtypes was added https://github.com/pandas-dev/pandas/pull/18937/files#diff-04d55a40a3293df94601d8b4aff4babebe4c1532d8174692bdef7f5bcb12c33fR315 and before that it would raise but looked like a catch-all.
I think it would also be fine to disallow it altogether, but we should probably do it consistently for both directions?
Agreed.
The original deprecation happened in https://github.com/pandas-dev/pandas/pull/38544
This comment is from https://github.com/pandas-dev/pandas/issues/22384#issuecomment-921024358, moving it here to a separate issue.
But, on the specific datetime -> integer deprecation:
astype
, but then point people to theview
instead. There are usecases where you need the integers (eg if you want to do some custom rounding, or need to feed it to a system that requires unix time as integers, ...), and personally I would rather have users go toastype
thanview
(becauseastype
is the more standard method for this, + if we would go with copy-on-write, this gets a bit a strange method ...)In addition, using
view
will actually error for non-equal size bitwidth (astype
actually as well, but that's something we can change, while forview
that is inherent to the method). Andview
can also silently overflow if converting to uint64, while forastype
we could check for that. In general, I seeview
as an advanced method you should only use if you really know what you are doing (and in general you don't really need in pandas, I think)There is then some follow-up discussion in the issue below https://github.com/pandas-dev/pandas/issues/22384#issuecomment-921024358
I would personally propose to keep allowing
astype()
for datetime64 -> int64, and not steer users toview()
for this.cc @jbrockmendel