pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.31k stars 17.8k forks source link

Easy function for making dummy variable matrices #955

Closed wesm closed 10 years ago

wesm commented 12 years ago

there are already a few things floating around but having something more structured / more options + in the pandas namespace would be nice

from an e-mail on the statsmodels mailing list

Here's a quick hack at it (not too dissimilar to Aman's code it looks
like)-- I should find a place in the library to put this:

def make_dummies(data, cat_variables):
   result = data.drop(cat_variables, axis=1)

   for variable in cat_variables:
       dummies = _get_dummy_frame(data, variable)
       result = result.join(dummies)
   return result

def _get_dummy_frame(data, column):
   from pandas import Factor
   factor = Factor(data[column])
   dummy_mat = np.eye(len(factor.levels)).take(factor.labels, axis=0)
   dummy_cols = ['%s.%s' % (column, v) for v in factor.levels]
   dummies = DataFrame(dummy_mat, index=data.index,
                       columns=dummy_cols)

   return dummies

In [29]: df
Out[29]:
  gender  hand   color  height  age
0  male    right  green  5.75    23
1  female  right  brown  5.42    27
2  female  left   green  5.58    31
3  male    right  brown  5.92    39
4  male    right  blue   5.83    33

In [30]: make_dummies(df, ['gender', 'hand', 'color']).T
Out[30]:
              0     1     2     3     4
height         5.75  5.42  5.58  5.92  5.83
age            23    27    31    39    33
gender.female  0     1     1     0     0
gender.male    1     0     0     1     1
hand.left      0     0     1     0     0
hand.right     1     1     0     1     1
color.blue     0     0     0     0     1
color.brown    0     1     0     1     0
color.green    1     0     1     0     0

(BTW I read in that data using df = read_clipboard(sep=','))
changhiskhan commented 12 years ago

Not sure what people would want but in the absence of a strong reason to do otherwise, I would prefer to not transpose the axes.

wesm commented 12 years ago

I only transposed there to make it output to the console (lot of long-ish columns)

changhiskhan commented 12 years ago

got it.

wesm commented 12 years ago

i mean, you see the example above, right? You have multiple columns and you want to produce dummy columns for each combination of a set of factors

wesm commented 12 years ago

related: http://scipy-central.org/item/35/1/convert-categorical-data-in-a-structure-numpy-array-to-boolean-fields

cpcloud commented 11 years ago

i think this machinery might already be in patsy...might be possible to lift it from there

jreback commented 10 years ago

looks pretty covered by get_dummies

TomAugspurger commented 10 years ago

@jreback any opinion on reopening this so get_dummies can handle DataFrames?

',PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked\n0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S\n1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C\n'

We could replace this

features = pd.concat([data.get(['Fare', 'Age']),
                      pd.get_dummies(data.Sex, prefix='Sex'),
                      pd.get_dummies(data.Pclass, prefix='Pclass'),
                      pd.get_dummies(data.Embarked, prefix='Embarked')],
                     axis=1)

with this

features = pd.get_dummies(data, include=['Sex', 'Pclass', 'Embarked'], exclude=['Fare', 'Age])

Or we can check they dtypes on the DataFrame to see that [Fare, Age] are numeric and not dummize them automatically, so you can leave off the exclude parameter. The current way seems a bit verbose, especially when you have a mixture of categorical columns that need dummies and numerical columns that don't.

jorisvandenbossche commented 10 years ago

+1

jreback commented 10 years ago

@TomAugspurger nice idea. pls open a new issue for this though.

enmanuelsg commented 7 years ago

Here is another technique to create automatically dummie: http://python-apuntes.blogspot.com.ar/2017/04/creacion-de-variables-de-grupo.html