pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.87k stars 18.02k forks source link

BUG: `.convert_dtypes(dtype_backend="pyarrow")` strips timezone from tz-aware pyarrow timestamp Series #60237

Open jamesdow21 opened 2 weeks ago

jamesdow21 commented 2 weeks ago

Pandas version checks

Reproducible Example

>>> import pandas as pd
>>> s = pd.Series(pd.to_datetime(range(5), utc=True, unit="h"), dtype="timestamp[ns, tz=UTC][pyarrow]")
>>> s
0    1970-01-01 00:00:00+00:00
1    1970-01-01 01:00:00+00:00
2    1970-01-01 02:00:00+00:00
3    1970-01-01 03:00:00+00:00
4    1970-01-01 04:00:00+00:00
dtype: timestamp[ns, tz=UTC][pyarrow]
>>> s.convert_dtypes(dtype_backend="pyarrow")
0    1970-01-01 00:00:00
1    1970-01-01 01:00:00
2    1970-01-01 02:00:00
3    1970-01-01 03:00:00
4    1970-01-01 04:00:00
dtype: timestamp[ns][pyarrow]

Issue Description

Calling .convert_dtypes(dtype_backend="pyarrow") on a Series that is already a timezone aware pyarrow timestamp dtype strips the timezone information.

Testing on older versions, this seems to be a regression introduced sometime between versions 2.0.3 and 2.1.0rc0

Expected Behavior

No change should be made to the dtype

Installed Versions

INSTALLED VERSIONS ------------------ commit : 3f7bc81ae6839803ecc0da073fe83e9194759550 python : 3.12.2 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : English_United States.1252 pandas : 3.0.0.dev0+1654.g3f7bc81ae6 numpy : 2.1.3 dateutil : 2.9.0.post0 pip : 24.3.1 Cython : None sphinx : None IPython : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None blosc : None bottleneck : None fastparquet : None fsspec : None html5lib : None hypothesis : None gcsfs : None jinja2 : None lxml.etree : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None psycopg2 : None pymysql : None pyarrow : 18.0.0 pyreadstat : None pytest : None python-calamine : None pytz : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlsxwriter : None zstandard : None tzdata : 2024.2 qtpy : None pyqt5 : None
jamesdow21 commented 2 weeks ago

side note at the beginning: this is somewhat similar to the fix in #53382, but that was handling a DatetimeTZDtype being converted to a pyarrow timestamp dtype

After a touch of further investigation, I've mostly found what changed between 2.0.3 and 2.1.0rc0 to cause this

Call stack setup (which is just forwarding along the keyword arguments in this case):

In this case, input_array is an ArrowExtensionArray still with the original "timestamp[ns, tz=UTC][pyarrow]" ArrowDtype

The first big if and elif blocks in pandas.core.dtypes.cast.convert_dtypes are False, so it goes to the else block which sets inferred_dtype = input_array.dtype

The remainder of the function is

    if dtype_backend == "pyarrow":
        from pandas.core.arrays.arrow.array import to_pyarrow_type
        from pandas.core.arrays.string_ import StringDtype

        assert not isinstance(inferred_dtype, str)

        if (
            (convert_integer and inferred_dtype.kind in "iu")
            or (convert_floating and inferred_dtype.kind in "fc")
            or (convert_boolean and inferred_dtype.kind == "b")
            or (convert_string and isinstance(inferred_dtype, StringDtype))
            or (
                inferred_dtype.kind not in "iufcb"
                and not isinstance(inferred_dtype, StringDtype)
            )
        ):
            if isinstance(inferred_dtype, PandasExtensionDtype) and not isinstance(
                inferred_dtype, DatetimeTZDtype
            ):
                base_dtype = inferred_dtype.base
            elif isinstance(inferred_dtype, (BaseMaskedDtype, ArrowDtype)):
                base_dtype = inferred_dtype.numpy_dtype
            elif isinstance(inferred_dtype, StringDtype):
                base_dtype = np.dtype(str)
            else:
                base_dtype = inferred_dtype
            if (
                base_dtype.kind == "O"  # type: ignore[union-attr]
                and input_array.size > 0
                and isna(input_array).all()
            ):
                import pyarrow as pa

                pa_type = pa.null()
            else:
                pa_type = to_pyarrow_type(base_dtype)
            if pa_type is not None:
                inferred_dtype = ArrowDtype(pa_type)
    elif dtype_backend == "numpy_nullable" and isinstance(inferred_dtype, ArrowDtype):
        # GH 53648
        inferred_dtype = _arrow_dtype_mapping()[inferred_dtype.pyarrow_dtype]

    # error: Incompatible return value type (got "Union[str, Union[dtype[Any],
    # ExtensionDtype]]", expected "Union[dtype[Any], ExtensionDtype]")
    return inferred_dtype  # type: ignore[return-value]

inferred_dtype.kind is "M" and it's not a StringDtype, so the last condition of the if is True

inferred_dtype is a ArrowDtype so inferred_dtype.numpy_dtype get assigned to base_dtype

In pandas 2.0.3, the numpy dtype was dtype('O'), so the to_pyarrow_type returned None and inferred_dtype never gets modified.

But in pandas >=2.1.0rc0, the numpy dtype is dtype('<M8[ns]') and to_pyarrow_type converts it to a tz-naive timestamp dtype because numpy datetimes do not have timezone information.

jamesdow21 commented 2 weeks ago

The changes to ArrowDtype.numpy_dtype in #51800 are the root cause of this regression

    def numpy_dtype(self) -> np.dtype:
        """Return an instance of the related numpy dtype"""
+        if pa.types.is_timestamp(self.pyarrow_dtype):
+            # pa.timestamp(unit).to_pandas_dtype() returns ns units
+            # regardless of the pyarrow timestamp units.
+            # This can be removed if/when pyarrow addresses it:
+            # https://github.com/apache/arrow/issues/34462
+            return np.dtype(f"datetime64[{self.pyarrow_dtype.unit}]")
+        if pa.types.is_duration(self.pyarrow_dtype):
+            # pa.duration(unit).to_pandas_dtype() returns ns units
+            # regardless of the pyarrow duration units
+            # This can be removed if/when pyarrow addresses it:
+            # https://github.com/apache/arrow/issues/34462
+            return np.dtype(f"timedelta64[{self.pyarrow_dtype.unit}]")
        if pa.types.is_string(self.pyarrow_dtype):
            # pa.string().to_pandas_dtype() = object which we don't want
            return np.dtype(str)
        try:
            return np.dtype(self.pyarrow_dtype.to_pandas_dtype())
        except (NotImplementedError, TypeError):
            return np.dtype(object)

Without those, self.pyarrow_dtype.to_pandas_dtype() returns a DatetimeTZDtype, which then raises a TypeError when passed to np.dtype and returns the object numpy dtype instead.

The linked issues have been resolved, so those if blocks could presumably now just be removed (while ensuring that the minimum pyarrow version includes those fixes) and this would be fixed.

An even simpler fix though is to just add modify the line in pandas.core.dtypes.cast.convert_dtypes to be if dtype_backend == "pyarrow" and not isinstance(inferred_dtype, ArrowDtype):

I'm not sure why it's currently trying to go from arrow dtype -> numpy dtype -> arrow dtype

It seems intentional though since there's explicitly

            elif isinstance(inferred_dtype, (BaseMaskedDtype, ArrowDtype)):
                base_dtype = inferred_dtype.numpy_dtype
rhshadrach commented 2 weeks ago

Thanks for the report, confirmed on main. The suggested fix of removing the if statements seems to resolve the issue, and I'm seeing no test failures locally. PRs are welcome!

jasonmokk commented 2 weeks ago

take

road-1 commented 2 weeks ago

take

jamesdow21 commented 2 weeks ago

The table in this section of the data types API docs reads to me as implying that a pyarrow tz-aware timestamp dtype should map to a numpy datetime64 dtype.

I would definitely defer to someone else's judgement on whether that is correct, or if there should be a distinction in that table linked between a pa.timestamp() type with and without timezone

AmeyGaneshMedewar commented 2 weeks ago

take

NOBODIDI commented 1 day ago

take