BUG: Assignment of pyarrow arrays yields unexpected dtypes

pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

https://pandas.pydata.org

BSD 3-Clause "New" or "Revised" License

43.42k stars 17.85k forks source link

BUG: Assignment of pyarrow arrays yields unexpected dtypes #56994

Open WillAyd opened 8 months ago

WillAyd commented 8 months ago

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import datetime
import pandas as pd
import pyarrow as pa

df = pd.DataFrame([[42]], columns=["col"])
df["int16"] = pa.array([16], type=pa.int16())
df["date"] = pa.array([datetime.date(2024, 1, 1)], type=pa.date32())
df["string"] = pa.array(["foo"], pa.string())

Issue Description

>>> df.dtypes
col               int64
int16             int16
date      datetime64[s]
string           object
dtype: object

I am surprised that the pyarrow type is not maintained during assignment

Expected Behavior

>>> df.dtypes
col               int64
int16             int16[pyarrow]
date              date32[pyarrow]
string            string[pyarrow]
dtype: object

Installed Versions

on main

droussea2001 commented 5 months ago

take

droussea2001 commented 4 months ago

Hi @WillAyd: I propose in PR https://github.com/pandas-dev/pandas/pull/58601 that during a column assignment in a DataFrame, sanitize_array is called with a dtype equals to ArrowDtype(value.type) if the column value is a pa.lib._PandasConvertible (else the standard behaviour is kept)

Would it be acceptable ?

droussea2001 commented 4 months ago

Hello @jorisvandenbossche : if I remember well you would have some comments to add to this issue ?

rhshadrach commented 4 months ago

We talked about this in the newcomers meeting, if I recall correctly @jorisvandenbossche wasn't certain resulting in pyarrow dtypes would necessarily be the right behavior here.

jorisvandenbossche commented 3 months ago

Apologies for the slow response here. I indeed had the comment that we might not want to use the ArrowDtype in all cases, or at least for me it's not necessarily obvious all the time.

Let's take the example of assigning a pyarrow string array, and assume we are in pandas 3.0 that uses a string dtype by default that uses pyarrow under the hood. Should at that point assigning a pyarrow string array create ArrowDtype(pa.string()) column or a StringDtype() column? The former might be the "closer" dtype for the data, but the latter is a default dtype (and in general, __setitem__ doesn't allow any options, in which case I think we generally should prefer default dtypes).

For Arrow types which don't have an equivalent pandas type, I think preserving the dtype definitely makes sense (eg for nested types, or decimal, etc). But for Arrow types for which we have an equivalent default dtype (and for which the conversion can often be zero-copy), it's not entirely obvious to me that we shouldn't prefer using the default dtypes.

jorisvandenbossche commented 3 months ago

One other remark is that this is not specific to assignment / __setitem__. If we consider it there, I think we should consider it for all places were we accept array-like input data and convert that to a pandas array-like (series/index/dataframe column), to ensure such input is treated consistently?

For example, the Series constructor accepts a pyarrow array as well, but currently also coerces it to the default (numpy) dtype:

>>> pd.Series(pa.array([1,2,3]))
0    1
1    2
2    3
dtype: int64
>>> pd.Series(pa.array([1,2,3])).dtype
dtype('int64')

rhshadrach commented 3 months ago

For Arrow types which don't have an equivalent pandas type, I think preserving the dtype definitely makes sense (eg for nested types, or decimal, etc). But for Arrow types for which we have an equivalent default dtype (and for which the conversion can often be zero-copy), it's not entirely obvious to me that we shouldn't prefer using the default dtypes.

Changing behavior based on whether an alternate dtype exists seems too complex to me. Even if it would be what the user wants in more cases, I think we should value simplicity of "assigning arrow arrays gives you arrow dtypes". This includes string dtypes.

rhshadrach commented 3 months ago

For example, the Series constructor accepts a pyarrow array as well, but currently also coerces it to the default (numpy) dtype

I think this should give arrow-backed dtypes as well.

WillAyd commented 3 months ago

I agree with @rhshadrach for the near term - the rule "you get pyarrow types unless we already have something else" burdens end users with knowing all of the types that are implemented in pandas/NumPy/pyarrow and the subtle differences they may or may not have. While primitive types might be easy to do that with and strings may get easier with PDEP-14, there's also temporal types that I feel like have a lot of potential logic / expectation pitfalls if that isn't really thought out in advance.

Long term I agree these should all just map to logical types, but I think we need to agree on and solidify PDEP-13 before we start developing towards that

droussea2001 commented 3 months ago

the rule "you get pyarrow types unless we already have something else" burdens end users with knowing all of the types

IMHO, as a pandas user, I would agree with @rhshadrach and @WillAyd: if a Series or a DataFrame is created or assigned with data dtype 'X' I would expect (or at least it seems to me clearer), if it is possible, that it keeps the same 'X' dtype (even if I'm agree equally with the long term approach).

droussea2001 commented 2 months ago

@jorisvandenbossche : Hello Joris, hope you're doing well, would you have any news about the previous discussion ? Does the PR https://github.com/pandas-dev/pandas/pull/58601 could answer to this problem ?