pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.53k stars 17.88k forks source link

BUG: pd.read_csv date parsing not working with dtype_backend="pyarrow" and missing values #59904

Open mhabets opened 2 weeks ago

mhabets commented 2 weeks ago

Pandas version checks

Reproducible Example

import io

data = """date,id
20/12/2025,a
,b
31/12/2020,c
"""
df = pd.read_csv(io.StringIO(data), parse_dates=["date"], dayfirst=True, dtype_backend="pyarrow")
df.dtypes
# date        string[pyarrow_numpy]
# id          large_string[pyarrow]

Issue Description

dtype for the date column is not a date dtype as a result of a date parsing failure because of the missing value on the second row

Expected Behavior

dtype for the date column should be timestamp[ns][pyarrow]

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.12.5.final.0 python-bits : 64 OS : Windows OS-release : 11 Version : 10.0.22631 machine : AMD64 processor : Intel64 Family 6 Model 154 Stepping 3, GenuineIntel byteorder : little LC_ALL : None LANG : en LOCALE : English_United Kingdom.1252 pandas : 2.2.2 numpy : 2.0.1 pytz : 2024.1 dateutil : 2.9.0 setuptools : 72.1.0 pip : 24.2 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 5.3.0 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.4 IPython : 8.26.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.9.2 numba : None numexpr : None odfpy : None openpyxl : 3.1.5 pandas_gbq : None pyarrow : 17.0.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.14.0 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : 2.0.1 zstandard : 0.23.0 tzdata : 2024.1 qtpy : None pyqt5 : None
yuanx749 commented 2 weeks ago

It seems on main branch, df.dtypes are

date      datetime64[s]
id      string[pyarrow]
rhshadrach commented 2 weeks ago

Thanks for the report. As @yuanx749 mentioned, it appears this is already fixed on the main branch. We should run a git bisect to see what PR fixed this in order to determine whether tests need to be added.

rhshadrach commented 2 weeks ago

A git bisect points at #58622 fixing this. From what I can tell, I don't think that was intentional and could use a test.

cc @mroeschke

wshaoul commented 1 week ago

take

tohfas commented 1 week ago

Hi, I am looking for this issue for my university assignment. Can this issue be assigned to me? Please let me know. Thanks.

sunlight798 commented 1 week ago

Hello, i am new contributor in pandas, can i work on this issue? Thanks

bcxpro commented 6 days ago

A git bisect points at #58622 fixing this. From what I can tell, I don't think that was intentional and could use a test.

cc @mroeschke

I had not seen your post and I also run a bisect with the same conclusion.

For tag v2.2.3

date             object
id      string[pyarrow]

After a17d44943f5be20172715a0929c32b7bf5e45840:

date     datetime64[ns]
id      string[pyarrow]