pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.71k stars 17.92k forks source link

API: pd.array(index_or_series[object]) should infer like Series and Index constructors #39117

Closed jbrockmendel closed 1 year ago

TomAugspurger commented 3 years ago

What's the motivation here? The docs in https://pandas.pydata.org/docs/reference/api/pandas.array.html state that dtype is optional, and

If not specified, there are two possibilities:

1. When data is a Series, Index, or ExtensionArray, the dtype will be taken from the data.
2. Otherwise, pandas will attempt to infer the dtype from the data.

I'd prefer to avoid special casing object-dtype here, unless we have a compelling reason to (especially since object-dtype should become less common now that we have more extension types).

jbrockmendel commented 3 years ago

What's the motivation here?

To get pd.array behavior to match Index and Series xref #27460

I'd prefer to avoid special casing object-dtype here

I'm on board with the sentiment, but point 1. you quotes means we're special-casing ndarray vs EA/Index/Series. Among other things, this means that pd.array(obj) and pd.array(extract_array(obj, extract_numpy=True)) behave differently.

jorisvandenbossche commented 3 years ago

To get pd.array behavior to match Index and Series xref #27460

It's not possible for pd.array to match Index/Series inference behaviour, for now, because it's explicitly meant to default to infer the nullable dtypes. Rather, the goal is that eventually Index/Series behaviour will match pd.array

jorisvandenbossche commented 3 years ago

we're special-casing ndarray vs EA/Index/Series

Although unfortunate, I think that's inevitable because ndarray cannot hold all the data types that the others can hold. Currently eg nullable integer, or tz-aware datetime64, or period, etc, become an ndarray without proper dtype. So for those I think it makes sense to do more inference on ndarrays.

jbrockmendel commented 3 years ago

Although unfortunate, I think that's inevitable because ndarray cannot hold all the data types that the others can hold. Currently eg nullable integer, or tz-aware datetime64, or period, etc, become an ndarray without proper dtype. So for those I think it makes sense to do more inference on ndarrays.

There's a step in the logic here I don't understand. EA can hold dtypes that ndarray cannot, but this is about dtypes that both can hold.

jorisvandenbossche commented 3 years ago

@jbrockmendel can you give some practical code examples? I think that would help a lot to clear up the misunderstanding/confusion (eg I don't know which dtypes you are speaking about)

jbrockmendel commented 3 years ago

can you give some practical code examples?

The motivation comes from DatetimeLikeArrayMixin._validate_listlike, which uses pd.array for inference and is under the hood of a bunch of methods.

dti = pd.date_range("2016-01-01", periods=3)
dta = dti._data

dta._validate_listlike(dta.astype(object)  # <- works, as dta.astype(object) is ndarray
dta._validate_listlike(dti.astype(object))  # <- raises TypeError

Of course, "dont use pd.array for this" is also a viable approach.

jbrockmendel commented 3 years ago

eg I don't know which dtypes you are speaking about

I am only talking about object dtype, for which we do lib.infer_dtype with ndarray but not Index/Series/PandasArray

jorisvandenbossche commented 3 years ago

Can you make the example even more concrete?

In the title you mention Index or Series of object dtype, if I understand correctly. With a quick test, I don't see different type inference between Series and array constructor for such input:

In [1]: s = pd.Series([1, 2, 3], dtype=object)

In [2]: pd.Series(s)
Out[2]: 
0    1
1    2
2    3
dtype: object

In [3]: pd.array(s)
Out[3]: 
<PandasArray>
[1, 2, 3]
Length: 3, dtype: object

Both preserve the object dtype when passed a Series?

It's actually when passed an ndarray that both infer differently:

In [4]: pd.Series(np.asarray(s))
Out[4]: 
0    1
1    2
2    3
dtype: object

In [5]: pd.array(np.asarray(s))
Out[5]: 
<IntegerArray>
[1, 2, 3]
Length: 3, dtype: Int64
jorisvandenbossche commented 3 years ago

Your example was using datetimes, and also for that I don't see a difference in behaviour for Series vs array:

In [13]: arr = np.array([pd.Timestamp("2020-01-01")], dtype=object)

In [14]: s = pd.Series(arr, dtype=object)

# both Series and array infer object-dtype np.ndarray
In [15]: pd.Series(arr)
Out[15]: 
0   2020-01-01
dtype: datetime64[ns]

In [16]: pd.array(arr)
Out[16]: 
<DatetimeArray>
['2020-01-01 00:00:00']
Length: 1, dtype: datetime64[ns]

# and both Series and array do not infer object-dtype Series
In [17]: pd.Series(s)
Out[17]: 
0    2020-01-01 00:00:00
dtype: object

In [18]: pd.array(s)
Out[18]: 
<PandasArray>
[Timestamp('2020-01-01 00:00:00')]
Length: 1, dtype: object

but there is actually one for Index (both for Index constructor as when passing index object to the Series constructor):

# the Index constructor infers both for ndarray and Series
In [23]: pd.Index(s)
Out[23]: DatetimeIndex(['2020-01-01'], dtype='datetime64[ns]', freq=None)

In [24]: pd.Index(arr)
Out[24]: DatetimeIndex(['2020-01-01'], dtype='datetime64[ns]', freq=None)

# and passing an Index to the Series constructor also infers (in contrast to passing a Series)
In [19]: idx = pd.Index(arr, dtype=object)

In [20]: idx
Out[20]: Index([2020-01-01 00:00:00], dtype='object')

In [21]: pd.Series(idx)
Out[21]: 
0   2020-01-01
dtype: datetime64[ns]

In [22]: pd.array(idx)
Out[22]: 
<PandasArray>
[Timestamp('2020-01-01 00:00:00')]
Length: 1, dtype: object
jorisvandenbossche commented 3 years ago

Additional example: Series constructor does not infer the type when the object-dtype Index holds integers (so in contrast with the example above of an object-dtype Index with timestamps), but the Index constructor does also infer in that case:

In [6]: s = pd.Series([1, 2, 3], dtype=object)

In [7]: idx = pd.Index(s, dtype=object)

In [8]: pd.Series(idx)
Out[8]: 
0    1
1    2
2    3
dtype: object

In [9]: pd.Index(idx)
Out[9]: Int64Index([1, 2, 3], dtype='int64')
jbrockmendel commented 1 year ago

I can't figure out what past-me had in mind here. Best guess is it involved some of the now-deprecated-and-removed string inference that used to be done in the Series constructor. Closing.