Open mroeschke opened 2 months ago
For Categorical the exact categories are part of the dtype. I feel pretty strongly that the dtype object should be immutable.
@jorisvandenbossche do you have an opinion on this one?
I think the simpler rule is that a setitem operation simply nevers changes the dtype (which is how PDEP 6 describes). Changing from int64 to category[int64] is clearly changing the dtype IMO (and also meaning of the dtype in this case, and the underlying memory representation as well).
Now, in the context of "logical dtypes", the question is maybe a bit more tricky (should int64 and Int64 be regarded as the same dtype?) However, at that point, we are setting into an existing array, and if you are setting compatible values with a different dtype into an existing array, I would just expect this setitem operation to work?
For example, setting int64
values into an Int64
series still preserves that dtype:
In [16]: s = pd.Series([1, 2, 3], dtype="Int64")
In [17]: s
Out[17]:
0 1
1 2
2 3
dtype: Int64
In [18]: s.iloc[:] = pd.Series([1, 2, 3], dtype="int64")
In [19]: s
Out[19]:
0 1
1 2
2 3
dtype: Int64
Same for setting ArrowDtype into it:
In [20]: s.iloc[:] = pd.Series([1, 2, 3], dtype=pd.ArrowDtype(pa.int64()))
In [21]: s
Out[21]:
0 1
1 2
2 3
dtype: Int64
And the same is also true for category
data with integer categories.
Now, the above example uses a Series, and apparently that is working differently as a DataFrame ... ? (and specifically in the case of categorical data when the dtype is int64
and not Int64
, for any other cases also the DataFrame variant preserves the column's dtype)
For example, this works fine:
In [38]: s = pd.Series([1, 2, 3], dtype="Int64")
In [39]: df = pd.DataFrame({"col" : s})
In [40]: df.iloc[:, 0] = pd.Series([1, 2, 3], dtype="category")
In [41]: df
Out[41]:
col
0 1
1 2
2 3
In [42]: df.dtypes
Out[42]:
col Int64
dtype: object
But when you start with int64
dtype, then it fails. That's maybe rather a bug and we should coerce the categorical data to int64?
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
In the PDEP 6 discussions (https://github.com/pandas-dev/pandas/issues/39584 https://github.com/pandas-dev/pandas/pull/50424), I can't find any discussion really of whether setting values with a different "representation" (e.g. to
category
orsparse
) of the dtype should be disallowed. Technically casting to e.g.category
,sparse
,ArrowDtype
of the same underlying type doesn't upcast, so should that be allowed?cc @MarcoGorelli
Expected Behavior
Setitem with category/sparse/ArrowDtype that doesn't change the underlying type should be allowed
Installed Versions