pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.19k stars 17.77k forks source link

BUG: pandas.to_datetime reports incorrect index when failing. #59298

Closed flownt closed 1 week ago

flownt commented 1 month ago

Pandas version checks

Reproducible Example

pandas as pd
valid = '2024-01-01'
invalid = 'N'
S = pd.Series([valid]*45_000 + [invalid])
T = pd.to_datetime(S, format='%Y-%m-%d', exact=True) 

################################
# Traceback redacted.
################################
ValueError: time data "N" doesn't match format "%Y-%m-%d", at position 1. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.

Issue Description

As you can see a ValueError is raised. That is fine.

The description says it happens for item at postition 1, however, which is not correct. It happens for item at position 45000.

Expected Behavior

The correct position would be reported to the user. I'd like to do S[] and be able to see why it raised valueError.

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.11.8.final.0 python-bits : 64 OS : Linux OS-release : 6.9.7-amd64 Version : #1 SMP PREEMPT_DYNAMIC Debian 6.9.7-1 (2024-06-27) machine : x86_64 processor : byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.2 numpy : 2.0.1 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 68.1.2 pip : 24.1.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None
Lollitor commented 1 month ago

Hi, I noticed this issue and I'd like to help triage and potentially fix it. Is it okay if I proceed with investigating this bug? Thanks!

flownt commented 1 month ago

Sure! From the outside I would guess the series is processed in batches and accidentally the numeric index into the batch is returned instead of the index of the series element. If you play with the the length of the valid list you'll notice it jumping around but rarely being correct.

Lollitor commented 1 month ago

Thank you for the insight! I will look into this and investigate further, after I will update here. Appreciate the guidance!

rhshadrach commented 1 month ago

Thanks for the report. It appears to me that duplicates are first removed from the values. It seems easiest to just remove the index from the error message. Further investigations and PRs to fix are welcome!

KevsterAmp commented 1 month ago

Take

KevsterAmp commented 1 month ago

I'd like to work on this, thanks

flownt commented 1 month ago

Thanks for the report. It appears to me that duplicates are first removed from the values. It seems easiest to just remove the index from the error message. Further investigations and PRs to fix are welcome!

I took some time to look into the code and this seems correct.

The error message comes from:

array_strptime in pandas/pandas/_libs/tslibs/strptime.pyx

which appears to be called indirectly from pandas/pandas/core/tools/datetimes.py:1015

The cause appears to be as @rhshadrach said: the series is first deduplicated by _maybe_cache and only then the date-parsing is applied.

The branches at 1023 and 1029 appear to also be affected.

As a pandas.Series has an index it seems most consistent to me to report the item causing the error by the index.

maushumee commented 4 weeks ago

Thanks for the report. It appears to me that duplicates are first removed from the values. It seems easiest to just remove the index from the error message. Further investigations and PRs to fix are welcome!

Just reproduced the issue and looks like this is the cause. I would like to contribute towards the fix.

maushumee commented 4 weeks ago

take