Open WillAyd opened 8 months ago
take
Hi @WillAyd: I propose in PR https://github.com/pandas-dev/pandas/pull/58601 that during a column assignment in a DataFrame
, sanitize_array
is called with a dtype
equals to ArrowDtype(value.type)
if the column value
is a pa.lib._PandasConvertible
(else the standard behaviour is kept)
Would it be acceptable ?
Hello @jorisvandenbossche : if I remember well you would have some comments to add to this issue ?
We talked about this in the newcomers meeting, if I recall correctly @jorisvandenbossche wasn't certain resulting in pyarrow dtypes would necessarily be the right behavior here.
Apologies for the slow response here. I indeed had the comment that we might not want to use the ArrowDtype in all cases, or at least for me it's not necessarily obvious all the time.
Let's take the example of assigning a pyarrow string array, and assume we are in pandas 3.0 that uses a string dtype by default that uses pyarrow under the hood.
Should at that point assigning a pyarrow string array create ArrowDtype(pa.string())
column or a StringDtype()
column? The former might be the "closer" dtype for the data, but the latter is a default dtype (and in general, __setitem__
doesn't allow any options, in which case I think we generally should prefer default dtypes).
For Arrow types which don't have an equivalent pandas type, I think preserving the dtype definitely makes sense (eg for nested types, or decimal, etc). But for Arrow types for which we have an equivalent default dtype (and for which the conversion can often be zero-copy), it's not entirely obvious to me that we shouldn't prefer using the default dtypes.
One other remark is that this is not specific to assignment / __setitem__
. If we consider it there, I think we should consider it for all places were we accept array-like input data and convert that to a pandas array-like (series/index/dataframe column), to ensure such input is treated consistently?
For example, the Series constructor accepts a pyarrow array as well, but currently also coerces it to the default (numpy) dtype:
>>> pd.Series(pa.array([1,2,3]))
0 1
1 2
2 3
dtype: int64
>>> pd.Series(pa.array([1,2,3])).dtype
dtype('int64')
For Arrow types which don't have an equivalent pandas type, I think preserving the dtype definitely makes sense (eg for nested types, or decimal, etc). But for Arrow types for which we have an equivalent default dtype (and for which the conversion can often be zero-copy), it's not entirely obvious to me that we shouldn't prefer using the default dtypes.
Changing behavior based on whether an alternate dtype exists seems too complex to me. Even if it would be what the user wants in more cases, I think we should value simplicity of "assigning arrow arrays gives you arrow dtypes". This includes string dtypes.
For example, the Series constructor accepts a pyarrow array as well, but currently also coerces it to the default (numpy) dtype
I think this should give arrow-backed dtypes as well.
I agree with @rhshadrach for the near term - the rule "you get pyarrow types unless we already have something else" burdens end users with knowing all of the types that are implemented in pandas/NumPy/pyarrow and the subtle differences they may or may not have. While primitive types might be easy to do that with and strings may get easier with PDEP-14, there's also temporal types that I feel like have a lot of potential logic / expectation pitfalls if that isn't really thought out in advance.
Long term I agree these should all just map to logical types, but I think we need to agree on and solidify PDEP-13 before we start developing towards that
the rule "you get pyarrow types unless we already have something else" burdens end users with knowing all of the types
IMHO, as a pandas user, I would agree with @rhshadrach and @WillAyd: if a Series
or a DataFrame
is created or assigned with data dtype
'X' I would expect (or at least it seems to me clearer), if it is possible, that it keeps the same 'X' dtype (even if I'm agree equally with the long term approach).
@jorisvandenbossche : Hello Joris, hope you're doing well, would you have any news about the previous discussion ? Does the PR https://github.com/pandas-dev/pandas/pull/58601 could answer to this problem ?
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
I am surprised that the pyarrow type is not maintained during assignment
Expected Behavior
Installed Versions
on main