pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.6k stars 17.57k forks source link

REGR: Index.astype(<numpy string dtype>) started failing #50127

Open jorisvandenbossche opened 1 year ago

jorisvandenbossche commented 1 year ago

On pandas 1.5:

In [2]: pd.Index(['a', 'b']).astype("S3")
Out[2]: Index([b'a', b'b'], dtype='object')

On the main branch:

In [2]: pd.Index(['a', 'b']).astype("S3")
...
File ~/scipy/pandas/pandas/core/indexes/base.py:589, in Index._dtype_to_subclass(cls, dtype)
    584 elif issubclass(
    585     dtype.type, (str, bool, np.bool_, complex, np.complex64, np.complex128)
    586 ):
    587     return Index
--> 589 raise NotImplementedError(dtype)

NotImplementedError: |S3

This started to fail a while ago on pyarrow's CI (https://issues.apache.org/jira/browse/ARROW-18394). This comes up if you roundtrip a pandas DataFrame with bytes column names to arrow and back to pandas.

Didn't yet investigate further what might be the change that caused this / whether this was intentional, etc.

phofl commented 1 year ago

49393

MarcoGorelli commented 1 year ago

From git bisect I'm getting #49718

https://www.kaggle.com/code/marcogorelli/pandas-regression-example?scriptVersionId=113299049

phofl commented 1 year ago

Weird, sorry for the noise. You are correct

MarcoGorelli commented 1 year ago

No worries - @jbrockmendel

jbrockmendel commented 1 year ago

yah i think the np-str-dtype check needs to be added in Index.__new__ after sanitize_array

jorisvandenbossche commented 1 year ago

@mroeschke this is a regression, and if we think it's a valid one, not something to bump to 3.0?

mroeschke commented 1 year ago

Ah okay fine to still mark at the 2.0 milestone

Daquisu commented 1 year ago

take

Daquisu commented 1 year ago

For what it's worth, pd.Index(["abcd", "1234"], dtype="S3") is also failing.

MarcoGorelli commented 1 year ago

~removing from the 2.0 milestone as this is a regression from 1.5 and shouldn't block 2.0~

EDIT: sorry, this one worked in 1.5 - is it a blocker?