pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.9k stars 18.03k forks source link

REF/EA-API: EA constructor without dtype specified #56430

Open jbrockmendel opened 11 months ago

jbrockmendel commented 11 months ago

TLDR: we should make dtype required in EA._from_sequence and implement a new EA constructor for flavor-preserving inference.

ATM dtype is not required in EA._from_sequence. The behavior-- and more importantly the usage-- when it is not specified is not standardized. In many cases it does some kind of inference, but how much inference varies.

Most of the places where we don't pass a dtype are aimed at some type of dtype-flavor-retention. e.g. we did some type of operation starting with a pyarrow/masked/sparse dtype and we want the result.dtype to still be pyarrow/masked/sparse, but not necessarily the same exact dtype. The main examples that come to mind are maybe_cast_pointwise_result, MaskedArray._maybe_mask_result.

The main other place where we call _from_sequence without a dtype is pd.array. With a little bit of effort I'm pretty sure we can start passing dtypes there.

cc @jorisvandenbossche

jorisvandenbossche commented 11 months ago

In my mind, _from_sequence already is the "constructor for flavor-preserving inference".

I understand there are multiple use cases, but that can be served by a single method depending on whether a dtype is passed or not? That feels quite clear to me: when a dtype is passed, this is honored, and otherwise the dtype is inferred from the data (with the constraint of that it has to be a dtype supported by the calling class).

The main examples that come to mind are maybe_cast_pointwise_result, MaskedArray._maybe_mask_result.

In MaskedArray._maybe_mask_result, we actually don't use _from_sequence, but the main Array class constructors (but also without specifying a dtype)

jbrockmendel commented 11 months ago

In MaskedArray._maybe_mask_result, we actually don't use _from_sequence, but the main Array class constructors (but also without specifying a dtype)

Correct. My point is that MaskedArray subclasses use a different pattern to achieve the same result. The datetimelike EAs have their own special-casing. If it is feasible (which im not ready to claim), then it would be preferable to have a single shared pattern for these.

I understand there are multiple use cases, but that can be served by a single method depending on whether a dtype is passed or not?

Certainly possible. On the margin I'd prefer the cases where we intentionally want dtype inference to be more explicit. I'm spending some time this week tracking down just where those cases are.

jbrockmendel commented 11 months ago

I've spent some time tracking down the places where we don't pass a dtype to from_sequence:

Also tracking down the various patterns we use for flavor-preserving-partial-inference:

Other places where we have special-casing for Masked/Arrow dtypes related to flavor-retention:

I expect there are more that I have missed, will update here as I find them.

jbrockmendel commented 7 months ago

In my mind, _from_sequence already is the "constructor for flavor-preserving inference".

Re-reading, I think I missed an important point: a big part of the relevant use case is having a BooleanArray method that returns a FloatingArray/IntegerArray etc. (this example could also be addressed by condensing these classes down to just MaskedArray). xref #58258