Closed jbrockmendel closed 1 year ago
What's the motivation here?
To get pd.array behavior to match Index and Series xref #27460
I'd prefer to avoid special casing object-dtype here
I'm on board with the sentiment, but point 1. you quotes means we're special-casing ndarray vs EA/Index/Series. Among other things, this means that pd.array(obj)
and pd.array(extract_array(obj, extract_numpy=True))
behave differently.
To get pd.array behavior to match Index and Series xref #27460
It's not possible for pd.array
to match Index/Series inference behaviour, for now, because it's explicitly meant to default to infer the nullable dtypes.
Rather, the goal is that eventually Index/Series behaviour will match pd.array
we're special-casing ndarray vs EA/Index/Series
Although unfortunate, I think that's inevitable because ndarray cannot hold all the data types that the others can hold. Currently eg nullable integer, or tz-aware datetime64, or period, etc, become an ndarray without proper dtype. So for those I think it makes sense to do more inference on ndarrays.
Although unfortunate, I think that's inevitable because ndarray cannot hold all the data types that the others can hold. Currently eg nullable integer, or tz-aware datetime64, or period, etc, become an ndarray without proper dtype. So for those I think it makes sense to do more inference on ndarrays.
There's a step in the logic here I don't understand. EA can hold dtypes that ndarray cannot, but this is about dtypes that both can hold.
@jbrockmendel can you give some practical code examples? I think that would help a lot to clear up the misunderstanding/confusion (eg I don't know which dtypes you are speaking about)
can you give some practical code examples?
The motivation comes from DatetimeLikeArrayMixin._validate_listlike, which uses pd.array for inference and is under the hood of a bunch of methods.
dti = pd.date_range("2016-01-01", periods=3)
dta = dti._data
dta._validate_listlike(dta.astype(object) # <- works, as dta.astype(object) is ndarray
dta._validate_listlike(dti.astype(object)) # <- raises TypeError
Of course, "dont use pd.array for this" is also a viable approach.
eg I don't know which dtypes you are speaking about
I am only talking about object dtype, for which we do lib.infer_dtype with ndarray but not Index/Series/PandasArray
Can you make the example even more concrete?
In the title you mention Index or Series of object dtype, if I understand correctly. With a quick test, I don't see different type inference between Series and array constructor for such input:
In [1]: s = pd.Series([1, 2, 3], dtype=object)
In [2]: pd.Series(s)
Out[2]:
0 1
1 2
2 3
dtype: object
In [3]: pd.array(s)
Out[3]:
<PandasArray>
[1, 2, 3]
Length: 3, dtype: object
Both preserve the object dtype when passed a Series?
It's actually when passed an ndarray that both infer differently:
In [4]: pd.Series(np.asarray(s))
Out[4]:
0 1
1 2
2 3
dtype: object
In [5]: pd.array(np.asarray(s))
Out[5]:
<IntegerArray>
[1, 2, 3]
Length: 3, dtype: Int64
Your example was using datetimes, and also for that I don't see a difference in behaviour for Series vs array:
In [13]: arr = np.array([pd.Timestamp("2020-01-01")], dtype=object)
In [14]: s = pd.Series(arr, dtype=object)
# both Series and array infer object-dtype np.ndarray
In [15]: pd.Series(arr)
Out[15]:
0 2020-01-01
dtype: datetime64[ns]
In [16]: pd.array(arr)
Out[16]:
<DatetimeArray>
['2020-01-01 00:00:00']
Length: 1, dtype: datetime64[ns]
# and both Series and array do not infer object-dtype Series
In [17]: pd.Series(s)
Out[17]:
0 2020-01-01 00:00:00
dtype: object
In [18]: pd.array(s)
Out[18]:
<PandasArray>
[Timestamp('2020-01-01 00:00:00')]
Length: 1, dtype: object
but there is actually one for Index (both for Index constructor as when passing index object to the Series constructor):
# the Index constructor infers both for ndarray and Series
In [23]: pd.Index(s)
Out[23]: DatetimeIndex(['2020-01-01'], dtype='datetime64[ns]', freq=None)
In [24]: pd.Index(arr)
Out[24]: DatetimeIndex(['2020-01-01'], dtype='datetime64[ns]', freq=None)
# and passing an Index to the Series constructor also infers (in contrast to passing a Series)
In [19]: idx = pd.Index(arr, dtype=object)
In [20]: idx
Out[20]: Index([2020-01-01 00:00:00], dtype='object')
In [21]: pd.Series(idx)
Out[21]:
0 2020-01-01
dtype: datetime64[ns]
In [22]: pd.array(idx)
Out[22]:
<PandasArray>
[Timestamp('2020-01-01 00:00:00')]
Length: 1, dtype: object
Additional example: Series constructor does not infer the type when the object-dtype Index holds integers (so in contrast with the example above of an object-dtype Index with timestamps), but the Index constructor does also infer in that case:
In [6]: s = pd.Series([1, 2, 3], dtype=object)
In [7]: idx = pd.Index(s, dtype=object)
In [8]: pd.Series(idx)
Out[8]:
0 1
1 2
2 3
dtype: object
In [9]: pd.Index(idx)
Out[9]: Int64Index([1, 2, 3], dtype='int64')
I can't figure out what past-me had in mind here. Best guess is it involved some of the now-deprecated-and-removed string inference that used to be done in the Series constructor. Closing.
What's the motivation here? The docs in https://pandas.pydata.org/docs/reference/api/pandas.array.html state that dtype is optional, and
I'd prefer to avoid special casing object-dtype here, unless we have a compelling reason to (especially since object-dtype should become less common now that we have more extension types).