Closed weimenglee closed 1 year ago
Here is a full stack trace for people's reference:
>>> pd.read_csv('flights.csv', usecols=[7,8], use_nullable_dtypes=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/zach/.local/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 918, in read_csv
return _read(filepath_or_buffer, kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zach/.local/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 589, in _read
return parser.read(nrows)
^^^^^^^^^^^^^^^^^^
File "/home/zach/.local/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1721, in read
) = self._engine.read( # type: ignore[attr-defined]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zach/.local/lib/python3.11/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 233, in read
data = _concatenate_chunks(chunks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zach/.local/lib/python3.11/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 387, in _concatenate_chunks
common_type = np.find_common_type(
^^^^^^^^^^^^^^^^^^^^
File "/home/zach/.local/lib/python3.11/site-packages/numpy/core/numerictypes.py", line 649, in find_common_type
array_types = [dtype(x) for x in array_types]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zach/.local/lib/python3.11/site-packages/numpy/core/numerictypes.py", line 649, in <listcomp>
array_types = [dtype(x) for x in array_types]
^^^^^^^^
TypeError: Cannot interpret 'string[pyarrow]' as a data type
thanks for the report - to expedite resolution, could you make a minimal reproducible example please?
Hi! Here is the code:
pd.set_option("mode.dtype_backend", "pyarrow")df = pd.read_csv('flights.csv', usecols=[7,8], use_nullable_dtypes=True)df
Thanks!
On Fri, 10 Mar 2023 at 6:17 PM, Marco Edward Gorelli < @.***> wrote:
thanks for the report - to expedite resolution, could you make a minimal reproducible example please?
— Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/51876#issuecomment-1463582498, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGHP3NAXDJRSNI53RRZF5TW3L5STANCNFSM6AAAAAAVVZWK5I . You are receiving this because you authored the thread.Message ID: @.***>
-- Wei-Meng Lee, Technologist & Founder, Developer Learning Solutions http://www.Learn2Develop.net Tel: (65)-9-692-4065 Twitter: @weimenglee
Author of:
Is there a StringIO
equivalent reproducer of this behavior? I can't access flights.csv
without a Kaggle account
Hi: I have a copy of the flights.csv at: https://bit.ly/4048vMR
Hope this helps!
Thanks!
Wei-Meng Lee, Technologist & Founder, Developer Learning Solutions http://www.Learn2Develop.net Tel: (65)-9-692-4065 Twitter: @weimenglee
Author of:
On Tue, Mar 14, 2023 at 3:43 AM Matthew Roeschke @.***> wrote:
Is there a StringIO equivalent reproducer of this behavior? I can't access flights.csv without a Kaggle account
— Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/51876#issuecomment-1466852253, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGHP3MFAJ2T3NRXGJDBL5TW352HDANCNFSM6AAAAAAVVZWK5I . You are receiving this because you were mentioned.Message ID: @.***>
Thanks. For now, looks like reading the entire CSV in with the pyarrow engine and selecting the 7th and 8th column seems to work
# dtype_backend is new syntax coming in 2.0rc2
In [6]: pd.read_csv("flights.csv", dtype_backend="pyarrow", engine="pyarrow").iloc[:, [7, 8]]
Out[6]:
ORIGIN_AIRPORT DESTINATION_AIRPORT
0 ANC SEA
1 LAX PBI
2 SFO CLT
3 LAX MIA
4 SEA ANC
... ... ...
5819074 LAX BOS
5819075 JFK PSE
5819076 JFK SJU
5819077 MCO SJU
5819078 JFK BQN
[5819079 rows x 2 columns]
This needs a minimal reproducible example, I don't see the error locally on main. Please read https://matthewrocklin.com/minimal-bug-reports
Hm weird now I am getting the error as well, but not all the time. Can you create a reproducible example anyway?
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
I am using the 2015 flights delay dataset: Source: https://www.kaggle.com/datasets/usdot/flight-delays.
When loading columns 7 and 8, which contains mixed types (numeric and strings), loading it using pyarrow as the backend gives this error:
TypeError: Cannot interpret 'string[pyarrow]' as a data type
Using Pandas as the backend have no issues (albeit a type warning).
Expected Behavior
It should load the two columns as string type.
Installed Versions