pyjanitor-devs / pandas_flavor

The easy way to write your own flavor of Pandas
https://zsailer.github.io/software/pandas-flavor/
MIT License
301 stars 17 forks source link

How to use #3

Closed ericmjl closed 6 years ago

ericmjl commented 6 years ago

Hey @Zsailer, great to meet you at SciPy 2018!

I think pandas_flavor is what I'd like to switch over to in pyjanitor, where I simply register functions as a pandas accessor rather than subclass the entire dataframe outright.

There is something a bit magical about how pandas_flavor works though. With subclassing, everything is quite transparent - I subclass pandas DataFrames, then have the users wrap their existing dataframe inside a Janitor dataframe, following which, all of the data cleaning methods are available:

import pandas as pd
import janitor as jn

df = pd.DataFrame(...)
df = jn.DataFrame(df).clean_names()...

Say I decorated the Janitor functions as pandas accessors. How would things look like for an end-user? Would it be like the following?

import pandas as pd

df = pd.DataFrame(...).clean_names().remove_empty()...

I guess I'm just wondering, where and when does a decorated function get exposed up to pandas?

Thanks again for putting this out!

Zsailer commented 6 years ago

Hi @ericmjl, great to meet you too!

There are two ways you could expose pyjanitor methods to users:

1. Add an accessor with methods underneath

The recommended way is to add them underneath an accessor object. This would look like:

import pandas as pd
import janitor

df = pd.DataFrame(...)
df = df.janitor.clean_names()
df = df.janitor.remove_empty()

When you import janitor, it registers/attaches the .janitor accessor to the pandas DataFrame. All the janitor methods live underneath this accessor. This keeps the janitor methods self-contained. It also means that every DataFrame in the namespace will have the janitor accessor.

To add an accessor and methods:

import pandas_flavor

@pandas_flavor.register_dataframe_accessor('janitor')
class JanitorAccessor(object):

    def __init__(self, df):
        self.df = df

    def clean_names(self):
        ...

2. Add methods directly to the DataFrame

Your second option is to add methods directly to the DataFrame. This would allow you to chain commands like in your example above. The methods are added to the DataFrame object itself, before initialization.

This would look like:

import pandas as pd
import janitor

df = pd.DataFrame(...).clean_names().remove_empty()

To add methods, simple write them as functions and register them with the DF.


import pandas_flavor

@pandas_flavor.register_dataframe_method
def clean_names(df):
    ...

Does this help answer your question?

ericmjl commented 6 years ago

The part that I was missing was that I just had to import janitor, and do nothing with it afterwards :smile:. Thanks for clarifying!

One thing that does happen with Pyjanitor though, is that upon decoration, my functions (which all return a dataframe) now return None, which makes them untestable. I think I know what's going on (there is no return statement when registering a function); is this hypothesis correct? If so, would it make sense to put in a PR to return the original function as well, or will this break the functionality of the pandas_flavor?

Zsailer commented 6 years ago

Ah, you're totally right! There should be return statements inside the inner function of the register_dataframe_method and register_series_method decorators. This won't break functionality and should allow you to run tests.

We need to add a return method after these lines: https://github.com/Zsailer/pandas_flavor/blob/bb892346dbe42c04725f0182c79e401496211bda/pandas_flavor/register.py#L31-L32

and

https://github.com/Zsailer/pandas_flavor/blob/bb892346dbe42c04725f0182c79e401496211bda/pandas_flavor/register.py#L51-L52

If you'd like to put in a PR, that would be great! Otherwise, I can do it later today.

Thanks!

ericmjl commented 6 years ago

I'm on it!