EA: revisit interface - Githubissues

jbrockmendel commented 4 years ago

This is as good a time as any to revisit the "experimental" EA interface.

My read of the Issues and recollection of threads suggests there are three main groups of topics:

Clarification of the Interface ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1) _values_for_argsort and values_for_factorize

Do we need both? The docs both say they should be order-preserving.
Is it safe to return a view? (Categorical.values_for_argsort makes a copy for no obvious reason)
What else can they be used for internally? e.g. in #32467 _values_for_argsort is used for ExtensionIndex join_non_unique and join_monotonic 2) ~~What characteristics should _ndarray_values have? Is it needed? (#32412)~~ _ndarray_values has been removed 3) What should _from_sequence accept?
Should it only be sequences that are unambiguously this dtype?
In particular, should DTA/TDA/PA not accept i8 values? 4) What should fillna accept? (#22954, #32414) 4.5) Require that __iter__ return native types? #29738

Ndarray Compat ^^^^^^^^^^^^^^^^^ 5) Headaches have been caused by trivial ndarray methods not being on EA

31199 size
32342 "T" (just the most recent; this has come up a lot)
24583 ravel

6) For arithmetic we're going to need something like either tile or broadcast_to

Methods Needed/Wanted For Index/Series/DataFrame/Block ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 7) Suggested Methods (partial list)

27264 duplicated
[x] #23437 _empty
28955 apply
23179 map
22680 hasnas
[x] #27081 equals
[x] #24144 _where
[x] _putmask would be helpful for ExtensionIndex

I suggest we discuss these in order. Before jumping in, is there anything vital missing from this list? (this is only a small subset of the issues on the tracker)

cc @pandas-dev/pandas-core @xhochy

jorisvandenbossche commented 4 years ago

I didn't fully understand it on the call, but so my question was: doesn't Categorical.replace still hold custom logic apart from _can_hold_element? (related to the fact that it can work on its categories to be more efficient, instead of on the actual values, but therefore needs to check whether the replacement is present in the categories already or not, etc ...) Or, as you mention, replace could be implemented generally in terms of other methods like putmask, but so then it is still putmask that we might need to add to the interface?

(I am still wondering if a "putmask" can ever be as efficient as the current Categorical-specific "replace", though)

jbrockmendel commented 4 years ago

I didn't fully understand it on the call

I have a branch that implements what i was describing, will make a draft PR shortly for exposition.

doesn't Categorical.replace still hold custom logic apart from _can_hold_element? (related to the fact that it can work on its categories to be more efficient, instead of on the actual values, but therefore needs to check whether the replacement is present in the categories already or not, etc ...)

It does, but the implementation via putmask I have in mind does something similar. I haven't checked perf.

Or, as you mention, replace could be implemented generally in terms of other methods like putmask, but so then it is still putmask that we might need to add to the interface?

Sort of. The existing ExtensionBlock.putmask is pretty reasonable* as a general case assuming block.values[mask] = other is valid, i.e. if block._can_hold_element(other). But Block.putmask doesn't make that assumption, and has casting logic for when it fails. So the missing thing is not EA.putmask, but "what do we cast to when simple-putmask fails?" (either via a try/except or a can_hold_element check that returns False)

* I'm assuming away intricacies of how np.putmask handles repeating or truncating of other has a mismatched length. For all our internal usages, which is what I really care about, this assumption is benign.

jbrockmendel commented 2 years ago

AFAICT the only thing discussed here that is really up on the air is the strictness of _from_sequence, for which the discussion has moved to #33254. closing.

pandas-dev / pandas

EA: revisit interface #32586

31199 size

32342 "T" (just the most recent; this has come up a lot)

24583 ravel

27264 duplicated

28955 apply

23179 map

22680 hasnas