Closed jbrockmendel closed 2 years ago
I didn't fully understand it on the call, but so my question was: doesn't Categorical.replace
still hold custom logic apart from _can_hold_element
? (related to the fact that it can work on its categories to be more efficient, instead of on the actual values, but therefore needs to check whether the replacement is present in the categories already or not, etc ...)
Or, as you mention, replace
could be implemented generally in terms of other methods like putmask, but so then it is still putmask
that we might need to add to the interface?
(I am still wondering if a "putmask" can ever be as efficient as the current Categorical-specific "replace", though)
I didn't fully understand it on the call
I have a branch that implements what i was describing, will make a draft PR shortly for exposition.
doesn't Categorical.replace still hold custom logic apart from _can_hold_element? (related to the fact that it can work on its categories to be more efficient, instead of on the actual values, but therefore needs to check whether the replacement is present in the categories already or not, etc ...)
It does, but the implementation via putmask
I have in mind does something similar. I haven't checked perf.
Or, as you mention, replace could be implemented generally in terms of other methods like putmask, but so then it is still putmask that we might need to add to the interface?
Sort of. The existing ExtensionBlock.putmask
is pretty reasonable* as a general case assuming block.values[mask] = other
is valid, i.e. if block._can_hold_element(other)
. But Block.putmask
doesn't make that assumption, and has casting logic for when it fails. So the missing thing is not EA.putmask
, but "what do we cast to when simple-putmask fails?" (either via a try/except or a can_hold_element check that returns False)
* I'm assuming away intricacies of how np.putmask
handles repeating or truncating of other
has a mismatched length. For all our internal usages, which is what I really care about, this assumption is benign.
AFAICT the only thing discussed here that is really up on the air is the strictness of _from_sequence, for which the discussion has moved to #33254. closing.
This is as good a time as any to revisit the "experimental" EA interface.
My read of the Issues and recollection of threads suggests there are three main groups of topics:
Clarification of the Interface ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1) _values_for_argsort and values_for_factorize
What characteristics should _ndarray_values have? Is it needed? (#32412)_ndarray_values has been removed 3) What should _from_sequence accept?__iter__
return native types? #29738Ndarray Compat ^^^^^^^^^^^^^^^^^ 5) Headaches have been caused by trivial ndarray methods not being on EA
31199 size
32342 "T" (just the most recent; this has come up a lot)
24583 ravel
6) For arithmetic we're going to need something like either
tile
orbroadcast_to
Methods Needed/Wanted For Index/Series/DataFrame/Block ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 7) Suggested Methods (partial list)
27264 duplicated
28955 apply
23179 map
22680 hasnas
I suggest we discuss these in order. Before jumping in, is there anything vital missing from this list? (this is only a small subset of the issues on the tracker)
cc @pandas-dev/pandas-core @xhochy