pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.27k stars 17.8k forks source link

BUG: pd.read_csv error when parse_dates used on index_col for dtype_backend="pyarrow" #57930

Open MCRE-BE opened 6 months ago

MCRE-BE commented 6 months ago

Pandas version checks

Reproducible Example

import datetime

import pandas as pd

df = pd.DataFrame({'a': [1], 'b': [datetime.datetime.now()]})
df.to_csv('test.csv', index=None)

a = pd.read_csv(
    'test.csv',
    parse_dates=['b'],
    index_col="b",
    dtype_backend="pyarrow",
    engine="pyarrow",
)

pd.read_csv(
    'test.csv',
    parse_dates=['b'],
    index_col="b",
)

the following fails :

pd.read_csv(
    'test.csv',
    parse_dates=['b'],
    index_col="b",
    dtype_backend="pyarrow",
)

Issue Description

returns ValueError: not all elements from date_cols are numpy arrays

ValueError Traceback (most recent call last) Cell In[78], [line 1](vscode-notebook-cell:?execution_count=78&line=1) ----> [1](vscode-notebook-cell:?execution_count=78&line=1) pd.read_csv( [2](vscode-notebook-cell:?execution_count=78&line=2) 'test.csv', [3](vscode-notebook-cell:?execution_count=78&line=3) parse_dates=['b'], [4](vscode-notebook-cell:?execution_count=78&line=4) index_col="b", [5](vscode-notebook-cell:?execution_count=78&line=5) dtype_backend="pyarrow", [6](vscode-notebook-cell:?execution_count=78&line=6) ) File [c:\ProgramData\miniconda3\envs\DLL_Forecast\Lib\site-packages\pandas\io\parsers\readers.py:948](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/readers.py:948), in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend) [935](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/readers.py:935) kwds_defaults = _refine_defaults_read( [936](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/readers.py:936) dialect, [937](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/readers.py:937) delimiter, (...) [944](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/readers.py:944) dtype_backend=dtype_backend, [945](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/readers.py:945) ) [946](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/readers.py:946) kwds.update(kwds_defaults) --> [948](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/readers.py:948) return _read(filepath_or_buffer, kwds) File [c:\ProgramData\miniconda3\envs\DLL_Forecast\Lib\site-packages\pandas\io\parsers\readers.py:617](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/readers.py:617), in _read(filepath_or_buffer, kwds) [614](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/readers.py:614) return parser [616](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/readers.py:616) with parser: --> [617](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/readers.py:617) return parser.read(nrows) File [c:\ProgramData\miniconda3\envs\DLL_Forecast\Lib\site-packages\pandas\io\parsers\readers.py:1748](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/readers.py:1748), in TextFileReader.read(self, nrows) [1741](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/readers.py:1741) nrows = validate_integer("nrows", nrows) [1742](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/readers.py:1742) try: [1743](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/readers.py:1743) # error: "ParserBase" has no attribute "read" [1744](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/readers.py:1744) ( [1745](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/readers.py:1745) index, [1746](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/readers.py:1746) columns, [1747](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/readers.py:1747) col_dict, -> [1748](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/readers.py:1748) ) = self._engine.read( # type: ignore[attr-defined] [1749](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/readers.py:1749) nrows [1750](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/readers.py:1750) ) [1751](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/readers.py:1751) except Exception: [1752](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/readers.py:1752) self.close() File [c:\ProgramData\miniconda3\envs\DLL_Forecast\Lib\site-packages\pandas\io\parsers\c_parser_wrapper.py:333](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/c_parser_wrapper.py:333), in CParserWrapper.read(self, nrows) [330](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/c_parser_wrapper.py:330) data = {k: v for k, (i, v) in zip(names, data_tups)} [332](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/c_parser_wrapper.py:332) names, date_data = self._do_date_conversions(names, data) --> [333](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/c_parser_wrapper.py:333) index, column_names = self._make_index(date_data, alldata, names) [335](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/c_parser_wrapper.py:335) return index, column_names, date_data File [c:\ProgramData\miniconda3\envs\DLL_Forecast\Lib\site-packages\pandas\io\parsers\base_parser.py:371](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/base_parser.py:371), in ParserBase._make_index(self, data, alldata, columns, indexnamerow) [369](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/base_parser.py:369) elif not self._has_complex_date_col: [370](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/base_parser.py:370) simple_index = self._get_simple_index(alldata, columns) --> [371](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/base_parser.py:371) index = self._agg_index(simple_index) [372](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/base_parser.py:372) elif self._has_complex_date_col: [373](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/base_parser.py:373) if not self._name_processed: File [c:\ProgramData\miniconda3\envs\DLL_Forecast\Lib\site-packages\pandas\io\parsers\base_parser.py:468](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/base_parser.py:468), in ParserBase._agg_index(self, index, try_parse_dates) [466](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/base_parser.py:466) for i, arr in enumerate(index): [467](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/base_parser.py:467) if try_parse_dates and self._should_parse_dates(i): --> [468](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/base_parser.py:468) arr = self._date_conv( [469](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/base_parser.py:469) arr, [470](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/base_parser.py:470) col=self.index_names[i] if self.index_names is not None else None, [471](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/base_parser.py:471) ) [473](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/base_parser.py:473) if self.na_filter: [474](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/base_parser.py:474) col_na_values = self.na_values File [c:\ProgramData\miniconda3\envs\DLL_Forecast\Lib\site-packages\pandas\io\parsers\base_parser.py:1142](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/base_parser.py:1142), in _make_date_converter..converter(col, *date_cols) [1139](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/base_parser.py:1139) return date_cols[0] [1141](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/base_parser.py:1141) if date_parser is lib.no_default: -> [1142](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/base_parser.py:1142) strs = parsing.concat_date_cols(date_cols) [1143](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/base_parser.py:1143) date_fmt = ( [1144](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/base_parser.py:1144) date_format.get(col) if isinstance(date_format, dict) else date_format [1145](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/base_parser.py:1145) ) [1147](file:///C:/ProgramData/miniconda3/envs/DLL_Forecast/Lib/site-packages/pandas/io/parsers/base_parser.py:1147) with warnings.catch_warnings(): File parsing.pyx:1155, in pandas._libs.tslibs.parsing.concat_date_cols() ValueError: not all elements from date_cols are numpy arrays

Expected Behavior

As with dtype_backend="numpy_nullable" the index column should be parsed correctly.

It works with the following set-up, so likely an issue with the C parser (?)

Installed Versions

INSTALLED VERSIONS ------------------ commit : a671b5a8bf5dd13fb19f0e88edc679bc9e15c673 python : 3.11.8.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19042 machine : AMD64 processor : AMD64 Family 23 Model 24 Stepping 1, AuthenticAMD byteorder : little LC_ALL : None LANG : None LOCALE : Dutch_Belgium.1252 pandas : 2.1.4 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0 setuptools : 69.2.0 pip : 24.0 Cython : None pytest : 8.1.1 hypothesis : None sphinx : 7.2.6 blosc : None feather : None xlsxwriter : 3.1.9 lxml.etree : 5.1.0 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.3 IPython : 8.22.2 pandas_datareader : None bs4 : 4.12.3 bottleneck : 1.3.8 dataframe-api-compat: None fastparquet : None fsspec : 2024.3.0 gcsfs : None matplotlib : 3.8.3 numba : 0.59.0 numexpr : 2.9.0 odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : 15.0.0 pyreadstat : None pyxlsb : 1.0.10 s3fs : None scipy : 1.12.0 sqlalchemy : 2.0.28 tables : None tabulate : 0.9.0 xarray : None xlrd : 2.0.1 zstandard : 0.22.0 tzdata : 2024.1 qtpy : 2.4.1 pyqt5 : None
lithomas1 commented 6 months ago

Hm, this works on main for me.

Can you try a newer version of pandas (e.g. 2.2.1)?

pmhatre1 commented 6 months ago

@lithomas1 I could reproduce this error on pandas version 2.3.0

pmhatre1 commented 6 months ago

@lithomas1 Do you need any help in other triaging other defects/ bugs?

lithomas1 commented 6 months ago

My bad, forgot to run the second command.

Nice catch.

Looks like parsing.concat_date_cols (which concatenates date_cols) is getting passed an ArrowExtensionArray.

That behavior is deprecated (to be removed in 3.0), so it should be easy to make this work for 3.0.

MCRE-BE commented 6 months ago

Can you try a newer version of pandas (e.g. 2.2.1)?

My bad. I should have checked and not have assumed the kernel I was using used the latest version. Thanks for the work.

WillAyd commented 1 month ago

FWIW I was getting the same issue on pandas 2.2 when using "numpy_nullable" as the backend, but this appears to be working on pandas version 3.0.0.dev0+811.gcc427ec305?

MCRE-BE commented 2 weeks ago

Still present in 2.2.2

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.12.5.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19042 machine : AMD64 processor : AMD64 Family 23 Model 24 Stepping 1, AuthenticAMD byteorder : little LC_ALL : None LANG : None LOCALE : Dutch_Belgium.1252 pandas : 2.2.2 numpy : 2.0.1 pytz : 2024.1 dateutil : 2.9.0 setuptools : 72.2.0 pip : 24.2 Cython : None pytest : 8.3.2 hypothesis : None sphinx : 7.4.7 blosc : None feather : None xlsxwriter : 3.2.0 lxml.etree : 5.3.0 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.4 IPython : 8.21.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : 1.4.0 dataframe-api-compat : None fastparquet : None fsspec : 2024.6.1 gcsfs : None matplotlib : 3.9.1 numba : 0.60.0 numexpr : 2.10.0 odfpy : None openpyxl : 3.1.5 pandas_gbq : None pyarrow : 17.0.0 pyreadstat : None python-calamine : None pyxlsb : 1.0.10 s3fs : None scipy : 1.14.1 sqlalchemy : None tables : None tabulate : 0.9.0 xarray : None xlrd : 2.0.1 zstandard : 0.23.0 tzdata : 2024.1 qtpy : 2.4.1 pyqt5 : None