Open mattharrison opened 5 years ago
Is this the @mattharrison from Twitter and PyCon? :smile:
Thank you for the proposed API! I like the idea there. In line with good statistical practice, I think it should be enabled by default, would that be correct?
I'm working on getting #361 up and running, which will hopefully make me no longer a blocker to incoming PRs being released ASAP. Apologies if I can't get to it asap, but I will definitely keep this open and tag it as "available for hacking".
Hi Eric. Sorry we didn't get to catch up at PyCon. :(
Since you already have released your api, I would think carefully about changing the default behavior (even if it is desired).
I'm not in any particular rush. I'm going to include a reference to this feature in pyjanitor in my upcoming book. I also reference the pandas.get_dummies
functionality and dropping of a column in the same section and was wondering if there was parity between the two.
@mattharrison Would this be like what you're suggesting?
With drop_first
trigger, modifies expanded_df
to expanded_df.iloc[:, 1:]
@pf.register_dataframe_method
def expand_column(
df: pd.DataFrame,
column_name,
sep: str,
concat: bool = True,
drop_first: bool = False
) -> pd.DataFrame:
"""
:param df: A pandas DataFrame.
:param column_name: A `str` indicating which column to expand.
:param sep: The delimiter. Example delimiters include `|`, `, `, `,` etc.
:param bool concat: Whether to return the expanded column concatenated to
the original dataframe (`concat=True`), or to return it standalone
(`concat=False`).
:param drop_first: Whether to get k-1 dummies out of k categorical
levels by removing the first level.
:returns: A pandas DataFrame with an expanded column.
"""
expanded_df = df[column_name].str.get_dummies(sep=sep)
if concat:
if drop_first:
df = df.join(expanded_df.iloc[:, 1:])
return df
else:
df = df.join(expanded_df)
return df
else:
if drop_first:
return expanded_df.iloc[:, 1:]
else:
return expanded_df
@sallyhong I would like to add a drop_first
trigger to expand_column
After some research, there has been some work on link
Series.str.get_dummies()
has the sep
parameter but no other parameters whereas pandas.get_dummies()
has drop_first
and a variety of other parameters but not sep
.
There is a PR on pandas-dev in progress that would defer Series.str.get_dummies()
to pandas.get_dummies()
.
Overall, we can wait for this to be merged and then change our code to use pandas.get_dummies()
.
Brief Description
In general, when you create dummy variables it is a good idea to drop one of the resultant columns as it is a linear combination of the other columns. See https://datascience.stackexchange.com/questions/27957/why-do-we-need-to-discard-one-dummy-variable
Pandas has the
drop_first
option inget_dummies
I would like to propose that
drop_first
be added as a parameter toexpand_columns
Example API
Please modify the example API below to illustrate your proposed API, and then delete this sentence.
to