jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
531 stars 117 forks source link

Perform OneHotEncoding and LabelEncoding within the same pipeline #38

Closed christophe-rannou closed 7 years ago

christophe-rannou commented 7 years ago

I am trying to use both a LabelEncoder() and a OneHotEncoder() within the same pipeline (as OneHotEncoder does not support string values) and I cannot find the right way to do so.

I found examples such as

my_mapper = DataFrameMapper([
  ("cat_col_1", OneHotEncoder()),
  ("bin_col_2", LabelBinarizer()),
  ("target", None)
])

But in my case it is the same column that is LabelEncoded then OntHotEncoded. I tried the following

mapper = DataFrameMapper([
    ("cat_col_1", [LabelEncoder(), OneHotEncoder()])
])
classifier = RandomForestClassifier()

pipeline = PMMLPipeline([
  ("mapper", mapper),
  ("classifier", classifier)
])
pipeline.fit(df, df["target"])

Which results in an error: ValueError: Number of labels=16677 does not match number of samples=1

It seems that the problem is that the output of LabelEncoder is of the type [n_samples] while the oneHotEncoder expects an array of shape (n_samples,1) in the case of unique feature such as in the current case.

Is there any way to properly integrate a LabelEncoder prior to a OntHotEncoder ?

EDIT : I found a workaround. Instead of using one mapper I use two mappers and set the parameter 'df_out' of the first mapper at True so that the output of the DataFrameMapper is still a dataframe and not just an array allowing the use of labels ("cat_col_1"). Is this the right way to do ?

When parsing a pipeline with two mappers the follwing error is raised:

Exception in thread "main" java.lang.UnsupportedOperationException
    at sklearn_pandas.DataFrameMapper.getOpType(DataFrameMapper.java:47)
    at org.jpmml.sklearn.SkLearnEncoder.updateFeatures(SkLearnEncoder.java:42)
    at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:93)
    at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:118)
    at org.jpmml.sklearn.Main.run(Main.java:146)
    at org.jpmml.sklearn.Main.main(Main.java:93)
vruusmann commented 7 years ago

If you want to apply one-hot-encoding to string columns, then you should simply use the sklearn.preprocessing.LabelBinarizer transformer class for that. It has exactly the same effect as a sequence of LabelEncoder followed by OneHotEncoder.

mapper = DataFrameMapper([
  ("country_name", LabelBinarizer())
])

The OneHotEncoder transformation makes sense if your input data contains categorical integer columns.

Currently, sklearn_pandas.DataFrameMapper is unable to apply [LabelEncoder(), OneHotEncoder()] on a string column due to the above "matrix transpose" problem. You could additionally open an issue with the sklearn_pandas project, and ask for their opinion about it.

It would be possible to make [LabelEncoder(), OneHotEncoder()] work by developing a custom Scikit-Learn transformer that handles "matrix transpose". For example, [LabelEncoder(), MatrixTransposer(), OneHotEncoder()]. This MatrixTransposer operation would be no-op from the PMML perspective.

christophe-rannou commented 7 years ago

Thanks I clearly did not understand the LabelBinarizer which indeed fits perfectly my use case.