pyjanitor-devs / pyjanitor

Clean APIs for data cleaning. Python implementation of R package Janitor
https://pyjanitor-devs.github.io/pyjanitor
MIT License
1.35k stars 166 forks source link

[ENH] expand_columns should support drop_first #368

Open mattharrison opened 5 years ago

mattharrison commented 5 years ago

Brief Description

In general, when you create dummy variables it is a good idea to drop one of the resultant columns as it is a linear combination of the other columns. See https://datascience.stackexchange.com/questions/27957/why-do-we-need-to-discard-one-dummy-variable

Pandas has the drop_first option in get_dummies

I would like to propose that drop_first be added as a parameter to expand_columns

Example API

Please modify the example API below to illustrate your proposed API, and then delete this sentence.

>>> X_cat2 = pd.DataFrame({'A': [1, None, 3],
...     'names': ['Fred,George', 'George', 'John,Paul']})
>>> jn.expand_column(X_cat2, 'names', sep=',')
     A        names  Fred  George  John  Paul
0  1.0  Fred,George     1       1     0     0
1  NaN       George     0       1     0     0
2  3.0    John,Paul     0       0     1     1

to

>>> X_cat2 = pd.DataFrame({'A': [1, None, 3],
...     'names': ['Fred,George', 'George', 'John,Paul']})
>>> jn.expand_column(X_cat2, 'names', sep=',', drop_first=True)
     A        names  Fred  George  John
0  1.0  Fred,George     1       1     0     
1  NaN       George     0       1     0     
2  3.0    John,Paul     0       0     1     
ericmjl commented 5 years ago

Is this the @mattharrison from Twitter and PyCon? :smile:

Thank you for the proposed API! I like the idea there. In line with good statistical practice, I think it should be enabled by default, would that be correct?

I'm working on getting #361 up and running, which will hopefully make me no longer a blocker to incoming PRs being released ASAP. Apologies if I can't get to it asap, but I will definitely keep this open and tag it as "available for hacking".

mattharrison commented 5 years ago

Hi Eric. Sorry we didn't get to catch up at PyCon. :(

Since you already have released your api, I would think carefully about changing the default behavior (even if it is desired).

I'm not in any particular rush. I'm going to include a reference to this feature in pyjanitor in my upcoming book. I also reference the pandas.get_dummies functionality and dropping of a column in the same section and was wondering if there was parity between the two.

jk3587 commented 5 years ago

@mattharrison Would this be like what you're suggesting?

With drop_first trigger, modifies expanded_df to expanded_df.iloc[:, 1:]

@pf.register_dataframe_method
def expand_column(
    df: pd.DataFrame,
    column_name,
    sep: str,
    concat: bool = True,
    drop_first: bool = False
) -> pd.DataFrame:
    """
    :param df: A pandas DataFrame.
    :param column_name: A `str` indicating which column to expand.
    :param sep: The delimiter. Example delimiters include `|`, `, `, `,` etc.
    :param bool concat: Whether to return the expanded column concatenated to
        the original dataframe (`concat=True`), or to return it standalone
        (`concat=False`).
    :param drop_first: Whether to get k-1 dummies out of k categorical
        levels by removing the first level.

    :returns: A pandas DataFrame with an expanded column.    
    """

    expanded_df = df[column_name].str.get_dummies(sep=sep)
    if concat:
        if drop_first:
            df = df.join(expanded_df.iloc[:, 1:])
            return df

        else:
            df = df.join(expanded_df)
            return df
    else:
        if drop_first:
            return expanded_df.iloc[:, 1:]
        else:
            return expanded_df
jk3587 commented 5 years ago

@sallyhong I would like to add a drop_first trigger to expand_column

jk3587 commented 5 years ago

After some research, there has been some work on link

Series.str.get_dummies() has the sep parameter but no other parameters whereas pandas.get_dummies() has drop_first and a variety of other parameters but not sep.

There is a PR on pandas-dev in progress that would defer Series.str.get_dummies() to pandas.get_dummies().

Overall, we can wait for this to be merged and then change our code to use pandas.get_dummies().