Open jamesdow21 opened 2 weeks ago
side note at the beginning: this is somewhat similar to the fix in #53382, but that was handling a DatetimeTZDtype being converted to a pyarrow timestamp dtype
After a touch of further investigation, I've mostly found what changed between 2.0.3 and 2.1.0rc0 to cause this
Call stack setup (which is just forwarding along the keyword arguments in this case):
pd.Series.convert_dtypes
is inherited from pandas.core.generic.NDFrame
.convert_dtypes
.apply
.convert_dtypes
on each Blockpandas.core.dtypes.cast.convert_dtypes
with the underlying .values
array (as the argument input_array
)In this case, input_array
is an ArrowExtensionArray
still with the original "timestamp[ns, tz=UTC][pyarrow]" ArrowDtype
The first big if
and elif
blocks in pandas.core.dtypes.cast.convert_dtypes
are False, so it goes to the else block which sets
inferred_dtype = input_array.dtype
The remainder of the function is
if dtype_backend == "pyarrow":
from pandas.core.arrays.arrow.array import to_pyarrow_type
from pandas.core.arrays.string_ import StringDtype
assert not isinstance(inferred_dtype, str)
if (
(convert_integer and inferred_dtype.kind in "iu")
or (convert_floating and inferred_dtype.kind in "fc")
or (convert_boolean and inferred_dtype.kind == "b")
or (convert_string and isinstance(inferred_dtype, StringDtype))
or (
inferred_dtype.kind not in "iufcb"
and not isinstance(inferred_dtype, StringDtype)
)
):
if isinstance(inferred_dtype, PandasExtensionDtype) and not isinstance(
inferred_dtype, DatetimeTZDtype
):
base_dtype = inferred_dtype.base
elif isinstance(inferred_dtype, (BaseMaskedDtype, ArrowDtype)):
base_dtype = inferred_dtype.numpy_dtype
elif isinstance(inferred_dtype, StringDtype):
base_dtype = np.dtype(str)
else:
base_dtype = inferred_dtype
if (
base_dtype.kind == "O" # type: ignore[union-attr]
and input_array.size > 0
and isna(input_array).all()
):
import pyarrow as pa
pa_type = pa.null()
else:
pa_type = to_pyarrow_type(base_dtype)
if pa_type is not None:
inferred_dtype = ArrowDtype(pa_type)
elif dtype_backend == "numpy_nullable" and isinstance(inferred_dtype, ArrowDtype):
# GH 53648
inferred_dtype = _arrow_dtype_mapping()[inferred_dtype.pyarrow_dtype]
# error: Incompatible return value type (got "Union[str, Union[dtype[Any],
# ExtensionDtype]]", expected "Union[dtype[Any], ExtensionDtype]")
return inferred_dtype # type: ignore[return-value]
inferred_dtype.kind
is "M" and it's not a StringDtype, so the last condition of the if
is True
inferred_dtype
is a ArrowDtype
so inferred_dtype.numpy_dtype
get assigned to base_dtype
In pandas 2.0.3, the numpy dtype was dtype('O')
, so the to_pyarrow_type
returned None and inferred_dtype
never gets modified.
But in pandas >=2.1.0rc0, the numpy dtype is dtype('<M8[ns]')
and to_pyarrow_type
converts it to a tz-naive timestamp dtype because numpy datetimes do not have timezone information.
The changes to ArrowDtype.numpy_dtype
in #51800 are the root cause of this regression
def numpy_dtype(self) -> np.dtype:
"""Return an instance of the related numpy dtype"""
+ if pa.types.is_timestamp(self.pyarrow_dtype):
+ # pa.timestamp(unit).to_pandas_dtype() returns ns units
+ # regardless of the pyarrow timestamp units.
+ # This can be removed if/when pyarrow addresses it:
+ # https://github.com/apache/arrow/issues/34462
+ return np.dtype(f"datetime64[{self.pyarrow_dtype.unit}]")
+ if pa.types.is_duration(self.pyarrow_dtype):
+ # pa.duration(unit).to_pandas_dtype() returns ns units
+ # regardless of the pyarrow duration units
+ # This can be removed if/when pyarrow addresses it:
+ # https://github.com/apache/arrow/issues/34462
+ return np.dtype(f"timedelta64[{self.pyarrow_dtype.unit}]")
if pa.types.is_string(self.pyarrow_dtype):
# pa.string().to_pandas_dtype() = object which we don't want
return np.dtype(str)
try:
return np.dtype(self.pyarrow_dtype.to_pandas_dtype())
except (NotImplementedError, TypeError):
return np.dtype(object)
Without those, self.pyarrow_dtype.to_pandas_dtype()
returns a DatetimeTZDtype, which then raises a TypeError
when passed to np.dtype
and returns the object numpy dtype instead.
The linked issues have been resolved, so those if
blocks could presumably now just be removed (while ensuring that the minimum pyarrow version includes those fixes) and this would be fixed.
An even simpler fix though is to just add modify the line in pandas.core.dtypes.cast.convert_dtypes
to be if dtype_backend == "pyarrow" and not isinstance(inferred_dtype, ArrowDtype):
I'm not sure why it's currently trying to go from arrow dtype -> numpy dtype -> arrow dtype
It seems intentional though since there's explicitly
elif isinstance(inferred_dtype, (BaseMaskedDtype, ArrowDtype)):
base_dtype = inferred_dtype.numpy_dtype
Thanks for the report, confirmed on main. The suggested fix of removing the if statements seems to resolve the issue, and I'm seeing no test failures locally. PRs are welcome!
take
take
The table in this section of the data types API docs reads to me as implying that a pyarrow tz-aware timestamp dtype should map to a numpy datetime64 dtype.
I would definitely defer to someone else's judgement on whether that is correct, or if there should be a distinction in that table linked between a pa.timestamp()
type with and without timezone
take
take
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[x] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Calling
.convert_dtypes(dtype_backend="pyarrow")
on a Series that is already a timezone aware pyarrow timestamp dtype strips the timezone information.Testing on older versions, this seems to be a regression introduced sometime between versions 2.0.3 and 2.1.0rc0
Expected Behavior
No change should be made to the dtype
Installed Versions