Closed christophe-rannou closed 7 years ago
If you want to apply one-hot-encoding to string columns, then you should simply use the sklearn.preprocessing.LabelBinarizer
transformer class for that. It has exactly the same effect as a sequence of LabelEncoder
followed by OneHotEncoder
.
mapper = DataFrameMapper([
("country_name", LabelBinarizer())
])
The OneHotEncoder
transformation makes sense if your input data contains categorical integer columns.
Currently, sklearn_pandas.DataFrameMapper
is unable to apply [LabelEncoder(), OneHotEncoder()]
on a string column due to the above "matrix transpose" problem. You could additionally open an issue with the sklearn_pandas project, and ask for their opinion about it.
It would be possible to make [LabelEncoder(), OneHotEncoder()]
work by developing a custom Scikit-Learn transformer that handles "matrix transpose". For example, [LabelEncoder(), MatrixTransposer(), OneHotEncoder()]
. This MatrixTransposer
operation would be no-op from the PMML perspective.
Thanks I clearly did not understand the LabelBinarizer which indeed fits perfectly my use case.
I am trying to use both a LabelEncoder() and a OneHotEncoder() within the same pipeline (as OneHotEncoder does not support string values) and I cannot find the right way to do so.
I found examples such as
But in my case it is the same column that is LabelEncoded then OntHotEncoded. I tried the following
Which results in an error:
ValueError: Number of labels=16677 does not match number of samples=1
It seems that the problem is that the output of LabelEncoder is of the type [n_samples] while the oneHotEncoder expects an array of shape (n_samples,1) in the case of unique feature such as in the current case.
Is there any way to properly integrate a LabelEncoder prior to a OntHotEncoder ?
EDIT : I found a workaround. Instead of using one mapper I use two mappers and set the parameter 'df_out' of the first mapper at True so that the output of the DataFrameMapper is still a dataframe and not just an array allowing the use of labels ("cat_col_1"). Is this the right way to do ?
When parsing a pipeline with two mappers the follwing error is raised: