pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.57k stars 17.9k forks source link

BUG: Wrong timestamp resolution when parsing timestamp string with comma separated milliseconds #59256

Open rocodero opened 3 months ago

rocodero commented 3 months ago

Pandas version checks

Reproducible Example

import pandas as pd

pd.Timestamp('2024-06-17T18:57:43,567')
pd.Timestamp('2024-06-17T18:57:43,567').unit
pd.Timestamp('2024-06-17T18:57:43,567') - pd.Timedelta(0)

pd.Timestamp('2024-06-17T18:57:43.567')
pd.Timestamp('2024-06-17T18:57:43.567').unit
pd.Timestamp('2024-06-17T18:57:43.567') - pd.Timedelta(0)

Issue Description

When parsing a timestamp string that separates the milliseconds with a comma instead of a dot, the timestamp representation shows microseconds, but the actual stored resolution / unit that is used for any calculations with the timestamp is only seconds.

The problem does not occur with more then three digits after the comma or when using a dot as a separator.

Expected Behavior

Timestamp resolution is properly parsed to the millisecond

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.12.4.final.0 python-bits : 64 OS : Windows OS-release : 11 Version : 10.0.22621 machine : AMD64 processor : Intel64 Family 6 Model 141 Stepping 1, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : de_DE.cp1252

pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 69.5.1 pip : 24.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 5.2.1 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.4 IPython : 8.25.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : 1.3.7 dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.8.4 numba : None numexpr : 2.8.7 odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.13.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None

Anurag-Varma commented 3 months ago

take

Anurag-Varma commented 3 months ago

Hi @rocodero

You can use the below code to format comma between seconds and milliseconds.

By default Python ISO 8601 Expects a . But as you have a , you can parse the datetime

from datetime import datetime

# Sample datetime string
datetime_str = "2024-06-17T18:57:43,567"

# Define the format
datetime_format = "%Y-%m-%dT%H:%M:%S,%f"

# Parse the datetime string
parsed_datetime = datetime.strptime(datetime_str, datetime_format)
Anurag-Varma commented 3 months ago

You can also follow this link

And use their suggested similar alternative:

date_string.replace(',', '.'))
rocodero commented 3 months ago

Hi @Anurag-Varma,

thank you for the suggestions, plenty of options there, I know. I was trying to focus on the bug itself and keep it clean, but if workarounds are appreciated I can add them in the future.

What I forgot to add, though: It's only present since pandas 2.2. Version 2.1.4 parses the comma-separated milliseconds without problems.

MarcoGorelli commented 2 months ago

thanks @rocodero for the report

aside from whether it should parse or not, it looks pretty wild to me that

In [6]: t
Out[6]: Timestamp('2024-06-17 18:57:43.567000')

In [7]: t.unit
Out[7]: 's'

In [8]: t.microsecond
Out[8]: 567000

it records non-zero microseconds but has 's' unit


looks like it's going down the dateutil path

In [9]: pd.to_datetime(['2024-06-17T18:57:43,567']*2)
<ipython-input-9-ad2044f68d63>:1: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(['2024-06-17T18:57:43,567']*2)
Out[9]: DatetimeIndex(['2024-06-17 18:57:43', '2024-06-17 18:57:43'], dtype='datetime64[ns]', freq=None)