Closed chrish42 closed 9 years ago
well how about a pseudo code example with inputs and outputs from a sample frame would be useful
@chrish42, an example would be great.
FYI scikit-learn has the OneHotEncoder class which fits into their pipeline.
Something like this should work?
import pandas as pd
from sklearn.pipeline import TransformerMixin
class DummyEncoder(TransformerMixin):
def __init__(self, columns=None):
self.columns = columns
def transform(self, X, y=None, **kwargs):
return pd.get_dummies(X, columns=self.columns)
def fit(self, X, y=None, **kwargs):
return self
Giving
In [15]: df
Out[15]:
A B C
0 1 a a
1 2 b a
In [16]: DummyEncoder().transform(df)
Out[16]:
A B_a B_b C_a
0 1 1 0 1
1 2 0 1 1
Be careful with the ordering of the columns.
@TomAugspurger, actually the compatibility with the sklearn processing pipeline itself is not the part that interests me. What I would like is the ability to save the transformation done by get_dummes() to a dataset, and then apply said transformation as is (creating the exact same columns), even if the second dataset has a subset of the values of the first one in some column, etc. That's actually what I meant by "usable in a train/test framework". Is this explanation clearer? (I can add an example someone thinks that's still needed.)
I'm aware of the OneHotEncoder
class in sklearn, but it has other limitations.
I stumbled upton the same problem as @chrish42 and I found get_dummies giving me some headache.
Let us assume we work with data from the following df_train DataFrame
df_train = pandas.DataFrame({"car":["seat","bmw"], "color":["red","green"]})
pandas.get_dummies(df_train )
car_bmw car_seat color_green color_red
0 0 1 0 1
1 1 0 1 0
Then we are provided with
df_test = pandas.DataFrame({"car":["seat","mercedes"], "color":["red","green"]})
pandas.get_dummies(df_test )
car_mercedes car_seat color_green color_red
0 0 1 0 1
1 1 0 1 0
Since I have never observed a "mercedes" value for variable "car" in df_train I would like to be able to get the following one hot encoding:
car_bmw car_seat color_green color_red
0 0 1 0 1
1 0 0 1 0
Where the column car_mercedes actually never appears.
This could be solved by allowing get_dummies to receive an input dictionary stating the accepted values that we allow for each column.
Returning to the previous example, we could give as input to get_dummies the following dict of sets
accepted_values_per_column = {'car': {'bmw', 'seat'}, 'color': {'green', 'red'}}
and we would expect get_dummies to return
get_dummies(df_test, accepted_values_per_column = accepted_values_per_column)
car_bmw car_seat color_green color_red
0 0 1 0 1
1 0 0 1 0
and expect get_dummies(df_test) to return what already returns.
You simply need make your variables Categorical
if you want to specified possibly unobserved variables. This can be done at creation time or afterwards, see the docs
In [5]: df_train = pd.DataFrame({"car":Series(["seat","bmw"]).astype('category',categories=['seat','bmw','mercedes']),"color":["red","green"]})
In [6]: df_train
Out[6]:
car color
0 seat red
1 bmw green
In [7]: pd.get_dummies(df_train )
Out[7]:
car_seat car_bmw car_mercedes color_green color_red
0 1 0 0 0 1
1 0 1 0 1 0
The original question is not well specified, so closing.
And when you're going the other way, from the encoding to back to Categorical you'll use Categorical.from_codes.
One more bit of unsolicited advice. If you care at all about accurate estimates of the coefficients on the categoricals, drop one of the encoded columns or else you'll have multicolinearity with the intercept (if you have one).
On Oct 5, 2015, at 05:34, Jeff Reback notifications@github.com wrote:
You simply need make your variables Categorical if you want to specified possibly unobserved variables. This can be done at creation time or afterwards, see the docs
In [5]: df_train = pd.DataFrame({"car":Series(["seat","bmw"]).astype('category',categories=['seat','bmw','mercedes']),"color":["red","green"]})
In [6]: df_train Out[6]: car color 0 seat red 1 bmw green
In [7]: pd.get_dummies(df_train ) Out[7]: car_seat car_bmw car_mercedes color_green color_red 0 1 0 0 0 1 1 0 1 0 1 0 The original question is not well specified, so closing.
— Reply to this email directly or view it on GitHub.
@TomAugspurger @jreback I think i have run into the same problem lately and i would like to state an example
train_a = pd.DataFrame({"IsBadBuy":[0,1,0], "Make":['Toyota', 'Mazda','BMW']})
IsBadBuy Make_BMW Make_Mazda Make_Toyota 0 0 0 0 1 1 1 0 1 0 2 0 1 0 0
test_a = pd.DataFrame({"Make":['Toyota','BMW']}) print pd.get_dummies(test_a,columns=['Make'])
Make_BMW Make_Toyota 0 0 1 1 1 0
Here ideally the Make_Mazda column should be preserved as the ML algorithm would expect the same number of features and the values that we get in the test will be a subset of that in train.
Use a Categorical. That will expand to the correct number of columns. I gave a talk about this if you're interested https://m.youtube.com/watch?v=KLPtEBokqQ0
_____________________________
From: Ajay Saxena notifications@github.com Sent: Thursday, January 12, 2017 18:31 Subject: Re: [pandas-dev/pandas] Wishlist: make get_dummies() usable for train / test framework (#8918) To: pandas-dev/pandas pandas@noreply.github.com Cc: Tom Augspurger thomas-augspurger@uiowa.edu, Mention mention@noreply.github.com
@jreback I think i have run into the same problem lately and i would like to state an example
train_a = pd.DataFrame({"IsBadBuy":[0,1,0], "Make":['Toyota', 'Mazda','BMW']})
IsBadBuy Make_BMW Make_Mazda Make_Toyota 0 0 0 0 1 1 1 0 1 0 2 0 1 0 0
test_a = pd.DataFrame({"Make":['Toyota','BMW']}) print pd.get_dummies(test_a,columns=['Make'])
Make_BMW Make_Toyota 0 0 1 1 1 0
Here ideally the Make_Mazda column should be preserved as the ML algorithm would expect the same number of features and the values that we get in the test will be a subset of that in train.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Thanks @TomAugspurger
The PyData Chicago 2016 talk given by @TomAugspurger was really well done. He did a fantastic job of illustrating all the reasons why this issue/request should not be closed. IMHO either his class DummyEncoder or some reasonable equivalent should be included in Pandas proper. Yes I can go to his github and copy/emulate his class but it would be much nicer to just have it supported within the library.
I think there's a need for a library that goes early in the data-modeling pipeline and works well with pandas and scikit-learn. But pandas doesn't depend on scikit-learn and vice-versa. I think there's room for another library built on top of both.
On Wed, May 10, 2017 at 6:13 PM, Brian Wylie notifications@github.com wrote:
The PyData Chicago 2016 talk given by @TomAugspurger https://github.com/TomAugspurger was really well done. He did a fantastic job of illustrating all the reasons why this issue/request should not be closed. IMHO either his class DummyEncoder or some reasonable equivalent should be included in Pandas proper. Yes I can go to his github and copy/emulate his class but it would be much nicer to just have it supported within the library.
BTW I think @TomAugspurger https://github.com/TomAugspurger might be my new favorite PyData guru. I'm going to hunt down everything he's done/working on and try to absorb it.. not in a creepy/stalking way.. you know just in a normal way that's not creepy at all. :)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/8918#issuecomment-300638388, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIpTqgHSE7iFVF9Pp4_YoKB9DPLcEks5r4kSrgaJpZM4DB6Hb .
Here's a little solution some of us worked on that may be helpful for some here. Dummy variables with fit/transform capabilities.
https://github.com/joeddav/get_smarties
Feedback and contributions would be helpful!
This appears related to #14017
I have created a solution that may be helpful exactly in this problem. One Hot Encoding categorical variable in a train test framework. It can also handle cases when the dataset is too large to fit in the machine memory.
https://github.com/yashu-seth/dummyPy
You can also find a small tutorial on this here.
People subscribed to this may be interested in dask-ml's implementation
@TomAugspurger This Code doesn't work. When I go to transform my production single record data it only gives me the one hot encoded column for the single value that is present. What am I missing?
import pyodbc import pickle from sklearn.linear_model import LogisticRegression from sklearn.linear_model import LinearRegression
import numpy as np import pandas as pd from sklearn.pipeline import TransformerMixin from sklearn.pipeline import make_pipeline
class DummyEncoder(TransformerMixin): def fit(self, X, y=None): self.index = X.index self.columns = X.columns self.catcolumns = X.select_dtypes(include=['category']).columns self.non_catcolumns = X.columns.drop(self.catcolumns)
self.cat_map_ = {col: X[col].cat for col in self.cat_columns_}
left = len(self.non_cat_columns_)
self.cat_blocks_ = {}
for col in self.cat_columns_:
right = left + len(X[col].cat.categories)
self.cat_blocks_[col], left = slice(left, right), right
return self
def transform(self, X, y=None):
return np.asarray(pd.get_dummies(X))
def inverse_transform(self, X):
non_cat = pd.DataFrame(X[:, :len(self.non_Cat_columns_)],
columns=self.non_cat_columns_)
cats = []
for col, cat in self.cat_map_.items():
slice_ = self.cat_blocks_[col]
codes = X[:, slice_].argmax(1)
series = pd.Series(pd.Categorical.from_codes(
codes, cat.categories, ordered=cat.ordered
), name=col)
cats.append(series)
df = pd.concat([non_cat] + cats, axis=1)[self.columns_]
return df
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER={XXXXX};DATABASE={ML_Learn_Taxi};UID={XXXX};PWD={XXXX}') sql =""" \ SELECT top 1 CONVERT(int, [order_key]) order_key \ ,CONVERT(int, [service_date_key]) service_date_key \ ,[order_source_desc] \ ,1 as 'return_flag' \ FROM [ML_Return_Customer].[dbo].[return_customers_test_set] \ WHERE [order_source_desc]='Online' \ UNION \ SELECT top 2 CONVERT(int, [order_key]) \ ,CONVERT( int, [service_date_key]) \ ,[order_source_desc] \ ,2 \ FROM [ML_Return_Customer].[dbo].[return_customers_test_set] \ WHERE [order_source_desc]='Inbound Call' \ UNION \ SELECT top 1 CONVERT(int, [order_key]) \ ,CONVERT(int, [service_date_key]) \ ,[order_source_desc] \ ,1 \ FROM [ML_Return_Customer].[dbo].[return_customers_test_set] \ WHERE [order_source_desc]='Outbound Call' """
prod_sql =""" \ SELECT top 1 CONVERT(int, [order_key]) order_key \ ,CONVERT(int, [service_date_key]) service_date_key \ ,[order_source_desc] \ ,1 as 'return_flag' \ FROM [ML_Return_Customer].[dbo].[return_customers_test_set] \ WHERE [order_source_desc]='Online' """
InputDataSet = pd.read_sql(sql, cnxn) ProdDataSet = pd.read_sql(prod_sql, cnxn)
print("*** Data **") print(InputDataSet)
print("**** Category Columns Info ***") columns = ['order_source_desc'] InputDataSet[columns] = InputDataSet[columns].apply(lambda x: x.astype('category'))
InputDataSet.info()
print("**** Linear Regression ***")
X = InputDataSet.drop('return_flag', axis=1) y = InputDataSet['return_flag']
A = ProdDataSet.drop('return_flag', axis=1) B = ProdDataSet['return_flag']
enc = DummyEncoder() enc.fit(X)
Prod = enc.transform(A)
print(Prod)
OUTPUT: * Data ** order_key service_date_key order_source_desc return_flag 0 10087937 20151214 Online 1 1 10088174 20151201 Inbound Call 2 2 10088553 20151217 Inbound Call 2 3 663478 20160806 Outbound Call 1 **** Category Columns Info * <class 'pandas.core.frame.DataFrame'> RangeIndex: 4 entries, 0 to 3 Data columns (total 4 columns): order_key 4 non-null int64 service_date_key 4 non-null int64 order_source_desc 4 non-null category return_flag 4 non-null int64 dtypes: category(1), int64(3) memory usage: 284.0 bytes **** Linear Regression *** [[10087937 20151214 1]]
So I think this thread is a bit messy, so I will try to summarize a simple solution here and how this is already possible. I will demonstrate on one column, but you can generalize it to many.
So in "fit" call you just do:
categories = sorted(training_data.iloc[:, column_index].value_counts(dropna=True).index)
You store categories
into the state you are learning during fitting.
And then in "transform" you do:
from pandas.api import types as pandas_types
categorical_data = testing_data.iloc[:, [column_index]].astype(
pandas_types.CategoricalDtype(categories=categories),
)
one_hot_encoded = pandas.get_dummies(categorical_data)
And it will do one-hot encoding always in the same mapping for values. If some categorical value was not present during training, it will be seen as NaN during testing. If some value is not seen during testing, no column will be set for it.
That's very nice. I just wish everyone who wants to do this didn't have to discover it anew. ;-)
The approach suggested by @mitar is a nice, short example. For a longer exploration of this issue here's a notebook that might be useful/helpful: https://nbviewer.jupyter.org/github/SuperCowPowers/scp-labs/blob/master/notebooks/Categorical_Encoding_Dangers.ipynb
Saw below code in exercise of Kaggle XGBoost tutorial. This does the trick.
X_train = pd.get_dummies(X_train)
X_valid = pd.get_dummies(X_valid)
X_test = pd.get_dummies(X_test)
X_train, X_valid = X_train.align(X_valid, join='left', axis=1)
X_train, X_test = X_train.align(X_test, join='left', axis=1)
I have also faced the same issue multiple times. I have written a class (taking ideas from this discussion) below that made things easier for me.
import pandas
from sklearn.preprocessing import LabelEncoder
class CategoryEncoder:
'''
labelEncoding : boolean -> True If the categorical columns are to be label encoded
oneHotEncoding : boolean -> True If the categorical columns are to be one hot encoded (using pandas.get_dummies method)
dropFirst : boolean -> True if first column is to be dropped (usually to avoid multi-collinearity) post one hot encoding
Doesn't matter if oneHotEncoding = False
df : pandas.DataFrame() -> dataframe object that needs to be encoded
catCols : list -> list of the categorical columns that need to be encoded
'''
def __init__(self,labelEncoding=True,oneHotEncoding=False,dropFirst=False):
self.labelEncoding = labelEncoding
self.oneHotEncoding = oneHotEncoding
self.dropFirst = dropFirst
self.labelEncoder = {}
self.oneHotEncoder = {}
def fit(self,df,catCols=[]):
df1 = df.copy()
if self.labelEncoding:
for col in catCols:
labelEncoder = LabelEncoder()
labelEncoder.fit(df1.loc[:,col].astype(str))
df1.loc[:,col] = labelEncoder.transform(df1.loc[:,col])
self.labelEncoder[col] = labelEncoder.classes_
if self.oneHotEncoding:
for col in catCols:
cats = sorted(df1.loc[:,col].value_counts(dropna=True).index)
self.oneHotEncoder[col] = cats
def transform(self,df,catCols=[]):
df1 = df.copy()
if self.labelEncoding:
for col in catCols:
labelEncoder = self.labelEncoder[col]
labelEncoder = {v:i for i,v in enumerate(labelEncoder.tolist())}
print(labelEncoder)
df1.loc[:,col] = df1.loc[:,col].map(labelEncoder)
if self.oneHotEncoding:
for col in catCols:
oneHotEncoder = self.oneHotEncoder[col]
df1.loc[:,col] = df1.loc[:,col].astype(pandas.CategoricalDtype(categories=oneHotEncoder))
df1 = pandas.get_dummies(df1,columns=catCols,drop_first=self.dropFirst)
return df1
Easy to initiate and use an instance of the encoder as well.
enc1 = CategoryEncoder(True,False) # Will label encode but not one-hot encode
enc2 = CategoryEncoder(False,True,True) # Will one-hot encode but not label encode
enc3 = CategoryEncoder(True,True,True) # Will label encode first and then one-hot encode
# List of categorical columns you want to encode
categorical_columns = ['col_1', 'col_2']
enc1.fit(train_df, categorical_columns)
enc1.transform(test_df, categorical_columns) # Returns the dataframe encoded columns
NOTE: This will not take care of any exceptions e.g. passing the column names that are not available in the dataframe
Having get_dummies() in Pandas is really nice, but to be useful for machine learning, it would need to be usable in a train / test framework (or "fit_transform" and "transform", with the sklearn terminology). Let me know if this needs more explanations.
So, I guess this is a wishlist bug report to add that functionality to Pandas. I can even create a pull request, if people agree this would be something useful to have in Pandas (and are willing to coach a bit and do code review for what would be my first contribution to this project).