pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.42k stars 17.84k forks source link

BUG: pyarrow string not saved for index column when writing parquet #53036

Closed qltwis closed 2 months ago

qltwis commented 1 year ago

Pandas version checks

Reproducible Example

import pandas as pd
pd.DataFrame({
 "a":["aiwj","odko"],
 "b":["adwl","gfoe"],
 "c":["okdp","mbna"]
}).astype("string[pyarrow]").set_index("a").to_parquet("test.parq")

pd.read_parquet("test.parq").index.dtype

Issue Description

The dtype information for the index column is lost when saving as parquet.

Expected Behavior

pd.read_parquet("test.parq").index.dtype should return string[python] but instead gives dtype('O')

Installed Versions

INSTALLED VERSIONS ------------------ commit : 37ea63d540fd27274cad6585082c91b1283f963d python : 3.10.6.final.0 python-bits : 64 OS : Linux OS-release : 5.19.0-41-generic Version : #42~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 18 17:40:00 UTC 2 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.0.1 numpy : 1.22.4 pytz : 2023.3 dateutil : 2.8.2 setuptools : 65.3.0 pip : 22.2.2 Cython : None pytest : 7.1.2 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.9.3 jinja2 : 3.1.2 IPython : 8.4.0 pandas_datareader: None bs4 : 4.12.2 bottleneck : None brotli : fastparquet : None fsspec : 2023.4.0 gcsfs : None matplotlib : 3.7.1 numba : 0.55.2 numexpr : None odfpy : None openpyxl : 3.0.10 pandas_gbq : None pyarrow : 10.0.1 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.10.1 snappy : None sqlalchemy : 1.4.45 tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None
mroeschke commented 1 year ago

Thanks for the report. This will be fixed in pyarrow 12 so closing as it will be fixed upstream: https://github.com/apache/arrow/pull/34445

qltwis commented 1 year ago

Thanks for the report. This will be fixed in pyarrow 12 so closing as it will be fixed upstream: apache/arrow#34445

Thank you for the info. With pyarrow 12 pd.read_parquet("test.parq", dtype_backend="pyarrow").index.dtype produces the correct index dtype.

However, with the numpy backend the index type is still read as object. The other columns are correctly identified as string[python]. I imagine that's not intended behavior, right?

qltwis commented 7 months ago

Unfortunately this bug persists despite the fix in pyarrow 12.

mroeschke commented 2 months ago

This is unfortunately the intended behavior as long as pandas default string type is object. parquet does not encode how the string type implementation from pandas.

This can change in pandas 3.0 when the default string implementation is pyarrow if installed as a dependency. Otherwise one will need to specify the dtype_backend argument to recover the implementation so closing