pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.95k stars 18.04k forks source link

EA: revisit interface #32586

Closed jbrockmendel closed 2 years ago

jbrockmendel commented 4 years ago

This is as good a time as any to revisit the "experimental" EA interface.

My read of the Issues and recollection of threads suggests there are three main groups of topics:

Clarification of the Interface ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1) _values_for_argsort and values_for_factorize

Ndarray Compat ^^^^^^^^^^^^^^^^^ 5) Headaches have been caused by trivial ndarray methods not being on EA

Methods Needed/Wanted For Index/Series/DataFrame/Block ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 7) Suggested Methods (partial list)

I suggest we discuss these in order. Before jumping in, is there anything vital missing from this list? (this is only a small subset of the issues on the tracker)

cc @pandas-dev/pandas-core @xhochy

jorisvandenbossche commented 4 years ago

I didn't fully understand it on the call, but so my question was: doesn't Categorical.replace still hold custom logic apart from _can_hold_element? (related to the fact that it can work on its categories to be more efficient, instead of on the actual values, but therefore needs to check whether the replacement is present in the categories already or not, etc ...) Or, as you mention, replace could be implemented generally in terms of other methods like putmask, but so then it is still putmask that we might need to add to the interface?

(I am still wondering if a "putmask" can ever be as efficient as the current Categorical-specific "replace", though)

jbrockmendel commented 4 years ago

I didn't fully understand it on the call

I have a branch that implements what i was describing, will make a draft PR shortly for exposition.

doesn't Categorical.replace still hold custom logic apart from _can_hold_element? (related to the fact that it can work on its categories to be more efficient, instead of on the actual values, but therefore needs to check whether the replacement is present in the categories already or not, etc ...)

It does, but the implementation via putmask I have in mind does something similar. I haven't checked perf.

Or, as you mention, replace could be implemented generally in terms of other methods like putmask, but so then it is still putmask that we might need to add to the interface?

Sort of. The existing ExtensionBlock.putmask is pretty reasonable* as a general case assuming block.values[mask] = other is valid, i.e. if block._can_hold_element(other). But Block.putmask doesn't make that assumption, and has casting logic for when it fails. So the missing thing is not EA.putmask, but "what do we cast to when simple-putmask fails?" (either via a try/except or a can_hold_element check that returns False)

* I'm assuming away intricacies of how np.putmask handles repeating or truncating of other has a mismatched length. For all our internal usages, which is what I really care about, this assumption is benign.

jbrockmendel commented 2 years ago

AFAICT the only thing discussed here that is really up on the air is the strictness of _from_sequence, for which the discussion has moved to #33254. closing.