pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.83k stars 18k forks source link

BUG: subtle conversion issue for to_datetime #60371

Open michael72 opened 2 days ago

michael72 commented 2 days ago

Pandas version checks

Reproducible Example

import pandas as pd

dates = ["2024-01-01", "2024-01-02", "2024-01-03"]
df = pd.DataFrame({"date": dates})
df["date"] = pd.to_datetime(df["date"], format="%Y-%m-%d")
print(df.dtypes["date"]) # datetime64[ns]

df["date"] = df["date"].astype("datetime64[ms]")
print(df.dtypes["date"]) # datetime64[ms]
print(df) # dates are OK
#         date
# 0 2024-01-01
# 1 2024-01-02
# 2 2024-01-03
# up to now the data was actually read from a parquet file
# where the date column was datetime64[ms]

df["date"] = pd.to_datetime(df["date"], unit="ns")
# it still is datetime64[ms]
assert(df.dtypes["date"] == "datetime64[ns]")

Issue Description

This is a rather constructed case but we had a very subtle bug reading parquet data that had the ds column stored as datetime64[ms] (previously str) and actually needed the result as datetime64[ns]. While to_datetime works for other types it does not change the used unit, when the underlying type is already datetime64 - and it does so silently.

Expected Behavior

I would either expect an exception, that the type already is datetime64 or the result to have the correct unit - here: datetime64[ns] and not using the same as the input (here: datetime64[ms])

Installed Versions

INSTALLED VERSIONS ------------------ commit : 0691c5cf90477d3503834d983f69350f250a6ff7 python : 3.12.2 python-bits : 64 OS : Linux OS-release : 6.8.0-48-generic Version : #48~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Oct 7 11:24:13 UTC 2 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.3 numpy : 2.1.3 pytz : 2024.2 dateutil : 2.9.0.post0 pip : 24.0 Cython : None sphinx : None IPython : 8.29.0 adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 blosc : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.10.0 html5lib : None hypothesis : None gcsfs : None jinja2 : 3.1.4 lxml.etree : None matplotlib : 3.9.2 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None psycopg2 : None pymysql : None pyarrow : 18.0.0 pyreadstat : None pytest : None python-calamine : None pyxlsb : None s3fs : 2024.10.0 scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlsxwriter : None zstandard : None tzdata : 2024.2 qtpy : None pyqt5 : None
michael72 commented 2 days ago

issue in prophet that stems from this one: https://github.com/facebook/prophet/issues/2529

jorisvandenbossche commented 1 day ago

@michael72 this is a subtle issue and something that caused confusion before, if I recall correctly.

The main explanation for the behaviour is that the unit keyword denotes the resolution of the input, and not the output (the docstring for it also says that somewhat, although reading it now it's not super clear ..), and so it is only used for numeric input. So essentially in the example above it is just ignored, because the input is not numeric, and since it is already datetime64, the input is returned as is.

Thus, as far as I know, this is the expected behaviour. Though, I agree this is confusing (as long as we only supported ns, that was probably OK, but now that we support multiple units, it's very logical to think that specifying unit means the output unit ..). I think it would be nice if we could for example change the default for unit to None (essentially make it required to specify if you want to use), so we can raise an informative error message when the user passes it in a wrong case.

If you want to converting an existing datetime64 column to a different unit, you can use the general astype or the specific as_unit() (available on DatetimeIndex or on Series through the .dt. accessor)

rhshadrach commented 12 hours ago

Agreed the docs could be clarified here. Perhaps we could also rename unit? Not sure if that'd be worth the churn. Another possibility would be to raise if unit is provided but not used because the input is not numeric.