pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.83k stars 18k forks source link

BUG: dataframe.to_hdf function incompatible with timestamps of format "datetime64[us, UTC]" #60353

Open ChristophWiesmeyr opened 3 days ago

ChristophWiesmeyr commented 3 days ago

Pandas version checks

Reproducible Example

import pandas as pd

if __name__ == "__main__":
    dataframe = pd.DataFrame(
        {
            "start_time": [
                pd.to_datetime("2024-08-26 15:13:14.700000+00:00"),
                pd.to_datetime("2024-08-26 15:14:14.700000+00:00"),
            ]
        }
    )
    dataframe["start_time_us"] = dataframe.start_time.astype("datetime64[us, UTC]")

    dataframe.to_hdf("test.hdf", key="Annotations", mode="w")

    recovered_dataframe = pd.read_hdf("test.hdf", key="Annotations")

    pd.testing.assert_frame_equal(dataframe, recovered_dataframe)

Issue Description

Dumping a dataframe with a column of datetime64[us, UTC] datetype to an HDF file seems to write datetime[ns, UTC] into the file. When recovering the data from the HDF file it seems that the dates are wrong, which is probably caused by an erroneous interpretation of the values as datetime[ns, UTC].

Expected Behavior

datetime64[us, UTC] values which have been written from a dataframe into an HDF file using the to_hdf function should be recoverable using the pd.read_hdf function.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 0691c5cf90477d3503834d983f69350f250a6ff7 python : 3.10.12 python-bits : 64 OS : Linux OS-release : 5.15.0-118-generic Version : #128-Ubuntu SMP Fri Jul 5 09:28:59 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.3 numpy : 1.26.4 pytz : 2024.2 dateutil : 2.9.0.post0 pip : 24.2 Cython : None sphinx : None IPython : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None blosc : None bottleneck : None dataframe-api-compat : None fastparquet : 2024.5.0 fsspec : 2024.9.0 html5lib : None hypothesis : None gcsfs : None jinja2 : 3.1.4 lxml.etree : 5.3.0 matplotlib : 3.7.3 numba : None numexpr : 2.10.1 odfpy : None openpyxl : 3.1.5 pandas_gbq : None psycopg2 : None pymysql : None pyarrow : 17.0.0 pyreadstat : None pytest : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.14.1 sqlalchemy : None tables : 3.10.1 tabulate : None xarray : None xlrd : None xlsxwriter : None zstandard : None tzdata : 2024.2 qtpy : None pyqt5 : None
rhshadrach commented 3 days ago

Thanks for the report! I cannot reproduce this on main, but can on 2.2.x. Wondering if it might be due to #59089. Still need to run a git bisect to see what fixed this and if a test needs to be added.

rhshadrach commented 2 days ago

This was inadvertently fixed by #59018. Could use a test.

KevsterAmp commented 22 hours ago

Take