pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.26k stars 17.79k forks source link

New accessor API #26710

Open datapythonista opened 5 years ago

datapythonista commented 5 years ago

Currently, to extend pandas Series, DataFrame and Index with user-defined methods, we use accessors in the next way:

@pandas.api.extensions.register_series_accessor('emoji')
class Emoji:
    def __init__(self, data):
        self.data = data

    def is_monkey(self):
        """
        This would create `Series().emoji.is_monkey`
        """
        return self.data.isin(['🙈', '🙉', '🙊'])

While this works well, I think there are two problems with this approach:

I think all the projects extending pandas I've seen, simply "inject" the methods (except the ones implemented by pandas maintainers). For example:

What I propose is to have a easier/simpler API for the user. To be specific, this is the syntax I'd like when extending Series...

import pandas

@pandas.Series.extend('emoji')
class Emoji:
    def __init__(self, data):
        self.data = data

    def is_monkey(self):
        """
        This would create `Series().emoji.is_monkey`
        """
        return self.data.isin(['🙈', '🙉', '🙊'])

@pandas.Series.extend(namespace='emoji')
def is_monkey(data):
    """
    This would also create `Series().emoji.is_monkey`
    """
    return data.isin(['🙈', '🙉', '🙊'])

@pandas.Series.extend
class Emoji:
    def __init__(self, data):
        self.data = data

    def is_monkey(self):
        """
        This would directly create `Series().is_monkey`
        """
        return self.data.isin(['🙈', '🙉', '🙊'])

@pandas.Series.extend
def is_monkey(data):
    """
    This would create `Series().emoji.is_monkey`
    """
    return data.isin(['🙈', '🙉', '🙊'])

This would make things much easier for the user, because:

CC: @pandas-dev/pandas-core

gfyoung commented 5 years ago
@pandas.Series.extend(namespace='emoji')
def is_monkey(data):
    """
    This would also create `Series().emoji.is_monkey`
    """
    return data.isin(['🙈', '🙉', '🙊'])

The second option (replicated above) seems like a logical one IMO. No overhead of OOP.

datapythonista commented 5 years ago

To be clear, what I'm proposing is:

  1. Let users be able to register both, classes (as we do now) and also single functions
  2. The name change pandas.api.extensions.register_series_accessor -> pandas.Series.extend
  3. Make optional the parameter of the decorator (the one currently named name, and in my example named namespace) . And if it's not present register the methods directly in Series,... and not with an accessor (e.g. str, dt,...)
gfyoung commented 5 years ago

Let users be able to register both, classes (as we do now) and also single functions

That's fair, though I think we should encourage functional over OOP.

Make optional the parameter of the decorator

Right

datapythonista commented 5 years ago

That's fair, though I think we should encourage functional over OOP.

Agree, as far as the class doesn't add value we should encourage using a function, but there will be cases where a class is useful, for example:

@pandas.Series.extend
class Emoji:
    def __init__(self, data):
        self.data = data

    def is_monkey(self):
        return self.data.isin(['🙈', '🙉', '🙊'])

    def is_cat(self):
        return '😺' < self.data < '😾'

    def is_animal(self):
        return self.is_monkey() | self.is_cat()
gfyoung commented 5 years ago

but there will be cases where a class is useful

Hmmm...that's a good point. Not sure right now how we could compose in the functional version, though that would be quite useful.

jreback commented 5 years ago

what is the reason for this? is there some notion that things are 'hard' to extend? is that actually a bad thing? these are generally only for other libraries and NOT for users.

I think all the projects extending pandas I've seen, simply "inject" the methods (except the ones implemented by pandas maintainers). For example:

better to actually have these projects use an official api. if they want to do something ad-hoc that is up to them.

jbrockmendel commented 5 years ago

Two nits to pick:

1) Use a name other than "extend". There is already list.extend and Index.extend (and unfortunately these behave slightly differently). A user could be forgiven for expecting Series.extend to behave like the others.

2) IIRC our internally-implemented accessors have standardized on self._parent to avoid (further) overloading self._data. We should encourage this idiom, even if it isn't required.

datapythonista commented 5 years ago

Good points @jbrockmendel, I was a bit unsure about extend, but couldn't find anything much better, may be register?

@jreback I used those libraries as example, but they are not the point. I think it's about code readability, of third-party libraries, pandas itself, and users of pandas. Adding methods to Series,... is something that applies to the 3 cases.

For pandas, an example where this could be useful is DataFrame.to_stata. Personally I think it'd make more sense that method definition lives in pandas/core/stata.py, where the rest of the related code is. And have it registered in the simple possible way there, so just importing the module adds it to DataFrame. Would even be cool to be able to deregister. I personally never used stata and would be happy to have one method less in DataFrame. :)

For third-party packages and users code, I agree that they should use an official API. If they do, we can warn them when they overwrite an attribute, we can keep track of registered methods... We do all that for accessors, but without providing a way to register methods directly, they use DataFrame.attr = whatever and we can't offer them much.

Not sure what's the drawback here. I see a lot of potential on better code organisation of pandas, more modularity, and more scalability. And may be we'd give up in something by implementing this, but I don't see what.

shoyer commented 5 years ago

I don't love encouraging users to monkey-patch methods directly onto pandas.Series. I guess the argument is that people do it anyways, but that feels like an anti-pattern to me.

I like the class method, though. pandas.Series.extension could be a good name.

TomAugspurger commented 5 years ago

Agreed with Stephan. If people want to monkey patch directly, they're welcome to just do that without our help :)

On Fri, Jun 7, 2019 at 4:42 PM Stephan Hoyer notifications@github.com wrote:

I don't love encouraging users to monkey-patch methods directly onto pandas.Series. I guess the argument is that people do it anyways, but that feels like an anti-pattern to me.

I like the class method, though. pandas.Series.extension could be a good name.

— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/26710?email_source=notifications&email_token=AAKAOISEBOHVB3KHS6U7PILPZLI47A5CNFSM4HVZETP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXHCFPI#issuecomment-500048573, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOIUQI3HE4FYLRP3PZ53PZLI47ANCNFSM4HVZETPQ .

shoyer commented 5 years ago

I think all the projects extending pandas I've seen, simply "inject" the methods (except the ones implemented by pandas maintainers). For example:

These examples look like more of a case of not knowing about pandas' accessor API. They already use a prefix for their special methods, so they might as well use a namespace:

datapythonista commented 5 years ago

I like to see the question as the same as Python with the standard library. Python was design "batteries included" with lots of modules. But also with an standard and easy way for developers to implement an ecosystem of modules around it. That while not included with Python and not maintained by the Python core devs, work exactly the same as the ones in the standard library. Once a module is installed, the difference between a module of the standard library and a third-party module is minimal. And I guess we are all happy and all benefited from this design.

In opposition, pandas is designed as a single piece, with an increasing integration with the ecosystem, but still with a clear distinction on what we provide, and third-party packages. To me, conceptually, pandas.io.stata or pandas-bokeh look like the same concept. An application that is plugged into the pandas core to provide extra functionality. But while conceptually they can be the same, in practice there are some important differences:

Personally, I think the modular design of Python or Django (which also follows the same model) worked really well for them. And I think the steps in the extension arrays, to create a single interface, no matter if it's the core numpy, the other we provide, or third parties also are simplifying things for us.

I see this as moving in the same direction for Series, DataFrame and Index methods. And I think there are many immediate advantages:

I think the proposal here is a good first step to move in this direction, and I don't see any drawback.

jreback commented 5 years ago

@datapythonista your points are pretty general, not objectionable but orthogonal to the issues at hand

how does your proposal advance the current state in a meaningful way?

I am also -1 on patching directly to the main namespace as this very very confusing

how does a shorter accessor api actually help here? I am -1 of you are attempting to make this user accessible

it is library accessible a crucial difference

datapythonista commented 5 years ago

For me the key issue of this proposal is being able to register methods directly in Series,... I guess we agree that renaming pandas.api.extensions.register_series_accessor to something shorter, or implementing accessors as classes or function is no that relevant, just making the code more beautiful (sorry for mixing the 3 here).

I think being able to register methods directly does advance the current state significantly. For example, I could register plot from pandas.plotting, and nothing in the rest of pandas should import it, solving problems with cycles in the imports. Or as I said in the examples, we could have all the stata functionality in pandas.io.stata (and same for excel, gbq...), and not splitted between the pandas core and their modules.

I understand your point about patching the main pandas classes, but Series has currently 204 methods (not counting attributes, accessors,...). I think defining a standard way of patching some of these methods, and using it consistently will make things clearer/easier, and not more confusing.

jorisvandenbossche commented 5 years ago

Something I have been thinking about, not the same but certainly related (quickly going to put it here before I am away for the weekend): that the data type can decide which methods are availabe on a Series. This could also a way to decide on methods on a Series directly as external party, but specifically for when using ExtensionArrays (so certainly not as a replacement of Marc's idea, as not every extension of pandas needs an extension dtype).

Like we now have the dt accessor, the Series could also say: OK, I am a datetime dtype, so for getting my methods/attributes, I will also check a list methods that the dtype/EA listed as methods to be dispatched to the EA (we can also do this in __dir__ so that tab completion on actual objects works).

jbrockmendel commented 5 years ago

pandas.io.stata is not decoupled from the core of pandas (I personally think our code would be much better if it was)

@datapythonista I think this merits its own discussion. Framing the issue in terms of decoupling will make it appealing.

datapythonista commented 5 years ago

Thanks @jbrockmendel that makes sense. I thought this would be non-controversial besides naming things, and once implemented would allow to have the discussion over a simple prototype PR, which would make things less abstract.

Will see what I can do, but I really thing a more modular code base for pandas would be extremely beneficial, so will open the discussion again once I can present my ideas in a more clear way.