pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.94k stars 18.04k forks source link

BUG: failed to roundtrip to pyarrow Table with ArrowDtype for DictionaryArray #58207

Open wenjuno opened 7 months ago

wenjuno commented 7 months ago

Pandas version checks

Reproducible Example

import pyarrow as pa
import pandas as pd

tbl = pa.table({'a': pa.array(['a', 'b']).dictionary_encode()})
df = tbl.to_pandas(types_mapper=pd.ArrowDtype)
tbl2 = pa.Table.from_pandas(df)
df2 = tbl2.to_pandas(types_mapper=pd.ArrowDtype)

Issue Description

It seems the arrow specific dtype from pandas is carried over to the pyarrow Table (tbl2 in the example) and when trying to convert to pandas again (last line in the example constructing df2) it somehow cannot understand the metadata. I get this exception from it:

Traceback (most recent call last):
  File "test_pandas_roundtrip.py", line 7, in <module>
    df2 = tbl2.to_pandas(types_mapper=pd.ArrowDtype)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/array.pxi", line 883, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 4251, in pyarrow.lib.Table._to_pandas
  File "/tmp/arrow/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 769, in table_to_dataframe
    ext_columns_dtypes = _get_extension_dtypes(
                         ^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/arrow/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 828, in _get_extension_dtypes
    pandas_dtype = _pandas_api.pandas_dtype(dtype)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/pandas-shim.pxi", line 147, in pyarrow.lib._PandasAPIShim.pandas_dtype
  File "pyarrow/pandas-shim.pxi", line 150, in pyarrow.lib._PandasAPIShim.pandas_dtype
  File "/tmp/arrow/lib/python3.11/site-packages/pandas/core/dtypes/common.py", line 1645, in pandas_dtype
    npdtype = np.dtype(dtype)
              ^^^^^^^^^^^^^^^
  File "/tmp/arrow/lib/python3.11/site-packages/numpy/core/_internal.py", line 176, in _commastring
    raise ValueError(
ValueError: format number 1 of "dictionary<values=string, indices=int32, ordered=0>[pyarrow]" is not recognized

Expected Behavior

I expect to get the same DataFrame as the first DataFrame df.

Installed Versions

INSTALLED VERSIONS ------------------ commit : bdc79c146c2e32f2cab629be240f01658cfb6cc2 python : 3.11.8.final.0 python-bits : 64 OS : Linux OS-release : 6.8.4-arch1-1 Version : #1 SMP PREEMPT_DYNAMIC Fri, 05 Apr 2024 00:14:23 +0000 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.1 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0 setuptools : 69.2.0 pip : 24.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 15.0.2 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None
bretttully commented 7 months ago

This looks the same as https://github.com/pandas-dev/pandas/issues/53011

wenjuno commented 7 months ago

Hi @bretttully , thanks for pointing that out, I didn't notice that ticket. Yes I think that the root cause of the problem is probably the same. I think my example is narrower in the scope where there is no file operation involved, the problem is in roundtripping the data type between pandas and pyarrow. I guess any nontrivial or pyarrow specific data type will trigger the same problem. Thanks.