pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.34k stars 17.81k forks source link

BUG: Timestamp.fromisoformat drops ISO 8601 timezone information #56398

Open kohlerjl opened 9 months ago

kohlerjl commented 9 months ago

Pandas version checks

Reproducible Example

import pandas as pd
timestr = '2023-11-05T08:30:00Z'
ts = pd.Timestamp.fromisoformat(timestr)
assert ts == pd.to_datetime(timestr)

timestr = '2023-11-05T08:30:00+0000'
ts = pd.Timestamp.fromisoformat(timestr)
assert ts == pd.to_datetime(timestr)

ts = pd.Timestamp.utcnow()
assert ts == pd.Timestamp.fromisoformat(ts.isoformat())

Issue Description

Timestamp.fromisoformat seems to ignore timezone offsets in the ISO 8601 formatted string, and returns a timezone-naive Timestamp. This issue does not seem to be present for pd.to_datetime, which does return the correct timezone-aware Timestamp. By consequence, the roundtrip conversion pd.Timestamp.fromisoformat(ts.isoformat()) loses the timezone information.

Expected Behavior

I would expect the Timestamp returned by fromisoformat to be time-zone aware, if the ISO string has UTC offset information. This is the behavior of the datetime.datetime objects from the standard library.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 3d492f175077788df89b6fc82608c201164d81e2 python : 3.11.6.final.0 python-bits : 64 OS : Linux OS-release : 6.6.3-arch1-1 Version : #1 SMP PREEMPT_DYNAMIC Wed, 29 Nov 2023 00:37:40 +0000 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_US.utf8 LOCALE : en_US.UTF-8 pandas : 2.2.0.dev0+831.g3d492f1750 numpy : 1.26.2 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 65.5.0 pip : 23.2.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.17.2 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.2 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.8.1 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.11.3 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None
nekkomira commented 9 months ago

Hi, I tried addressing the bug but it seemed a bit out of scope and complex for me. It seems that ther is an issue with fromisoformat to be dropping the timezone information as you indicated. I added a test case using your example when addressing the bug as a benchmark for any future fixes, ensuring that the corrected implementation handles timezone data as expected. The PR is listed above but here's the link as well. https://github.com/pandas-dev/pandas/pull/56477

lithomas1 commented 6 months ago

Hi, I can reproduce this bug on main.

fromisoformat doesn't appear to be a method that we override itself (it is from the stdlib), but that calls back into the Timestamp constructor, so that's where I'd look to see if something has gone wrong.

(This file would be a Cython file pandas/_libs/tslibs/timestamps.pyx

bionicles commented 1 month ago

this is a "wtf" issue, we need proper timezone support in pandas, losing offsets is not an option, this is extremely serious because it can cause timestamps to be off by an entire DAY in production for many apps right now (!)

bionicles commented 1 month ago

reproduction

import datetime

import pandas as pd
import pytest

utc_example = "2024-08-02T00:00:00-00:00"
edt_example = "2024-08-02T00:00:00-04:00"
zulu_example = "2024-08-02T00:00:00Z"

@pytest.mark.parametrize("iso8601_str", [zulu_example, utc_example, edt_example])
def test_pandas_does_not_ignore_timezone(iso8601_str: str):
    pd_timestamp = pd.Timestamp.fromisoformat(iso8601_str)
    print("PANDAS pd.Timestamp:")
    print(f"{iso8601_str} -> {pd_timestamp=}")
    tz = pd_timestamp.tz
    print(f"{tz=}")
    assert tz is not None, "pandas pd.Timestamp.fromisoformat lost timezone"
    assert 0, "pandas pd.Timestamp.fromisoformat OK"

@pytest.mark.parametrize("iso8601_str", [zulu_example, utc_example, edt_example])
def test_python_datetime_does_not_ignore_timezone(iso8601_str: str):
    py_datetime = datetime.datetime.fromisoformat(iso8601_str)
    print("python stdlib datetime.datetime:")
    print(f"{iso8601_str} -> {py_datetime}")
    tzinfo = py_datetime.tzinfo
    print(f"{tzinfo=}")
    assert tzinfo is not None, "python datetime.datetime.fromisoformat lost timezone"
    assert 0, "python datetime.datetime.fromisoformat OK"

result image yes, python gets it right and pandas gets it wrong image

bionicles commented 1 month ago

yo, workaround:

pd.to_datetime works, but converts the "America/New_York" to "UTC-4:00" and fails string equality checks on the timezone :(

however, i can't find the code for pd.Timestamp.fromisoformat in order to look at it. anyone know where that is?