pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.6k stars 17.9k forks source link

BUG: `DataFrame.to_parquet` does not round-trip index dtype #54000

Open randolf-scholz opened 1 year ago

randolf-scholz commented 1 year ago

Pandas version checks

Reproducible Example

import numpy as np
import pandas as pd

df = pd.DataFrame(
    np.random.randint(0, 9, size=(5, 3)),
    columns=["a", "b", "c"],
    dtype="int32[pyarrow]",
).set_index("a")

df.to_parquet("demo.parquet")
df2 = pd.read_parquet("demo.parquet")
pd.testing.assert_frame_equal(df, df2)

Issue Description

The code snippet yields

AssertionError: DataFrame.index are different

Attribute "dtype" are different
[left]:  int32[pyarrow]
[right]: int32

Expected Behavior

The data types should be round tripped for both columns and index.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 0f437949513225922d851e9581723d82120684a6 python : 3.10.6.final.0 python-bits : 64 OS : Linux OS-release : 5.19.0-46-generic Version : #47~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jun 21 15:35:31 UTC 2 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.0.3 numpy : 1.25.0 pytz : 2023.3 dateutil : 2.8.2 setuptools : 67.8.0 pip : 23.1.2 Cython : None pytest : 7.3.2 hypothesis : None sphinx : 7.0.1 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : 1.0.3 psycopg2 : None jinja2 : 3.1.2 IPython : 8.14.0 pandas_datareader: None bs4 : 4.12.2 bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.7.1 numba : None numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : 12.0.1 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.10.1 snappy : None sqlalchemy : 2.0.16 tables : None tabulate : 0.9.0 xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None
randolf-scholz commented 1 year ago

Explicitly settting the option dtype_backend="pyarrow" makes the index use pyarrow types.

randolf-scholz commented 1 year ago

However, this messes things up when string dtypes are involved:

import pandas as pd

df = (
    pd.DataFrame({"a": [1, 2, 3], "b": [1.0, 2.0, None], "c": [None, "x", "y"]})
    .astype({"a": "int64[pyarrow]", "b": "float64[pyarrow]", "c": "string[pyarrow]"})
    .set_index("a")
)
df.to_parquet("demo.parquet")
df2 = pd.read_parquet("demo.parquet", dtype_backend="pyarrow")
pd.testing.assert_frame_equal(df, df2)

results in

Attribute "dtype" are different
[left]:  string[pyarrow]
[right]: string[pyarrow]
lithomas1 commented 1 year ago

I think the OP is intentional. When we write to pyarrow via parquet, we would convert to an Arrow dtype before writing, so on top of my head there is no way to tell if the original DF is backed by numpy/Arrow.

Long term, I think #51846 is the right solution (e.g. opting into the arrow engines will use arrow dtypes).

The second example is definitely a bug, though (lol).

I think it stems from

>>> type(df['c'].dtype)
<class 'pandas.core.arrays.string_.StringDtype'>
>>> type(df2['c'].dtype)
<class 'pandas.core.dtypes.dtypes.ArrowDtype'>

cc @phofl if I missed anything here.

randolf-scholz commented 1 year ago

@lithomas1 for the second example I had opened a separate issue: https://github.com/pandas-dev/pandas/issues/54190

So maybe this one can be closed? Although there is a solution to

so on top of my head there is no way to tell if the original DF is backed by numpy/Arrow.

pandas could write the schema information into metadata when serializing as parquet, and then autocast when deserializing.