pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.6k stars 17.9k forks source link

Direct delegation of Series methods to ExtensionArrays #21305

Open xhochy opened 6 years ago

xhochy commented 6 years ago

During the implementation of non-numpy backed ExtensionArrays I quite often run into the case where it is simpler for me to write a complete re-implementation of the method defined on pd.Series instead of using the current implementation that only delegates part of the work. It would probably make sense to introduce some sort of delegation mechanism, either we continue the delegation like in https://github.com/pandas-dev/pandas/blob/4274b840e64374a39a0285c2174968588753ec35/pandas/core/base.py#L1041 or we could possibly add really general interface like NumPy's __array_ufunc__: https://docs.scipy.org/doc/numpy/reference/arrays.classes.html#numpy.class.__array_ufunc__

My use case where this arises currently is coming from https://github.com/pandas-dev/pandas/issues/21296 and pd.Series.argsort but I expect that there will be much more cases in this direction while I continue to implement the ExtensionArray interface for Arrow Arrays.

jorisvandenbossche commented 6 years ago

Quick comment: for the argsort case, I think this could be solved by changing np.argsort(values, ..) to values.argsort(..) in the Series.argsort implementation? (if this is blocking you, fix certainly welcome)

But indeed, we should discuss this more in general.

TomAugspurger commented 6 years ago

In the abstract, I'm also interested in this. The set of methods that are dispatched to currently is pretty ad-hoc (essentially enough to get df.groupby('extension_array').mean() working :)

jbrockmendel commented 2 years ago

2 thoughts here

1) Since the OP we've added many private EA methods that we dispatch to under the hood (EA._where, EA._putmask, EA._quantile). We could address many of these cases by leaning heavily on that pattern. 2) Implementing something like __pandas_ufunc__ or __pandas_priority__ might be helpful for eg #38946

jbrockmendel commented 1 year ago

I don't think there's any appetite for adding an __array_ufunc__-like mechanism, but we are definitely moving in the direction of more methods being defined on the EAs and being directly delegated to.