pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.48k stars 17.87k forks source link

ENH: Allow selection of reference values in get_dummies #33702

Open telferm57 opened 4 years ago

telferm57 commented 4 years ago

When one-hot encoding a pandas categorical column, with drop_first = True, there is no control over which value is dropped. So if I need to specify the reference value to drop, I can't use drop_first. I have to manually drop the columns that have been unnecessarily created.

I would like to enhance the get_dummies method to be able to specify for each column in 'columns' a reference value to be used as the dropped column. For example:


df = pd.DataFrame(data={'ageband': np.random.choice(range(20,80,10),100),
                        'country':np.random.choice(['uk', 'nl', 'be', 'fr'],100)}).astype('category')
ref_values = {'ageband': '40', 'country':'be'}

df_encoded = pd.get_dummies(df, columns=['ageband', 'country'],  ref_values= ref_values)

for each categorical column specified in the new ref_values parameter: if value does exists, use that as the reference value if value does not exist, proceed with normal behaviour - i.e. drop the (lexical) first (or ignore with warning?)

I don't think there are any API breaking implications?

The achieve this now, without the enhancement, I have to do something like :

df = pd.DataFrame(data={'ageband': np.random.choice(range(20,80,10),100),
                        'country':np.random.choice(['uk', 'nl', 'be', 'fr'],100)}).astype('category')

prefix_sep = ':'  # say
df = pd.get_dummies(df, columns=['ageband', 'country'], prefix_sep=prefix_sep)
ref_values = {'ageband': 40, 'country': 'be'}

columns_to_drop = [ col + prefix_sep + str(val) for col, val in ref_values.items()]
df.drop(columns=columns_to_drop, inplace=True, errors='ignore')
# additional code would be required to handle errors such as reference value not present

I am willing to have a go at this if it is accepted as an enhancement

guust commented 4 years ago

This would be a very useful feature and would make the setting of reference categories in e.g. regression models much easier and more transparent.

Shirin636 commented 4 years ago

This would be very useful and makes life much easier for managing the data processing. Often wondered why it wasn't a feature, but hopefully this will be added as an enhancement very soon!

den4uk commented 4 years ago

This looks pretty useful and neat. A while ago I had to achieve something similar, and ended up with some horrible looking code. Definitely an enhancement worth adding. 👍

telferm57 commented 4 years ago

I suggest further:

if ref_vals is not a list, or the length of ref_vals != length of 'columns' , an exception will be raised

if drop_first =True and ref_vals is supplied, raise an exception (could just warn?)

if object is a series, allow ref_vals = 'string' as well as ref_vals =['string']

telferm57 commented 4 years ago

I have started coding this, should have it complete soon

iliya-malecki commented 8 months ago

A very useful feature! I would love it if it got implemented