Open randolf-scholz opened 1 year ago
Explicitly settting the option dtype_backend="pyarrow"
makes the index use pyarrow types.
However, this messes things up when string dtypes are involved:
import pandas as pd
df = (
pd.DataFrame({"a": [1, 2, 3], "b": [1.0, 2.0, None], "c": [None, "x", "y"]})
.astype({"a": "int64[pyarrow]", "b": "float64[pyarrow]", "c": "string[pyarrow]"})
.set_index("a")
)
df.to_parquet("demo.parquet")
df2 = pd.read_parquet("demo.parquet", dtype_backend="pyarrow")
pd.testing.assert_frame_equal(df, df2)
results in
Attribute "dtype" are different
[left]: string[pyarrow]
[right]: string[pyarrow]
I think the OP is intentional. When we write to pyarrow via parquet, we would convert to an Arrow dtype before writing, so on top of my head there is no way to tell if the original DF is backed by numpy/Arrow.
Long term, I think #51846 is the right solution (e.g. opting into the arrow engines will use arrow dtypes).
The second example is definitely a bug, though (lol).
I think it stems from
>>> type(df['c'].dtype)
<class 'pandas.core.arrays.string_.StringDtype'>
>>> type(df2['c'].dtype)
<class 'pandas.core.dtypes.dtypes.ArrowDtype'>
cc @phofl if I missed anything here.
@lithomas1 for the second example I had opened a separate issue: https://github.com/pandas-dev/pandas/issues/54190
So maybe this one can be closed? Although there is a solution to
so on top of my head there is no way to tell if the original DF is backed by numpy/Arrow.
pandas
could write the schema information into metadata when serializing as parquet, and then autocast when deserializing.
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
The code snippet yields
Expected Behavior
The data types should be round tripped for both columns and index.
Installed Versions