sklearn2pmml 0.18 pipeline with SMOTE oversampling

geofizx commented 7 years ago

I have recently been working with the new sklearn2pmml version requiring pipelines, and have stumbled upon an issue when trying to convert to PMML including a label encoding mapper along with SMOTE oversampling of input classes. I include all versions of my packages at the bottom for your reference.

The issue is that it appears as though sklearn2pmml requires all steps in a pipeline to be fit at the same time. This causes an issue for me, since in my pipeline I have a mapper (including a labelencoder) and a random forest classifier. However, the system that eventually will interface with my output PMML will provide raw features (categorical) to the PMML. So, I believe I am required to input raw categorical data to the pipeline.fit() method in order for the PMML to reflect the labelencoding in its data mapping. However, I also want to oversample the data for the classifier training included in the pipeline. This oversampling (using SMOTE) results in non-categorical nd-array data (floats). While these floats allow my classifier to be fit, the mapper doesn't transform these features, since they are now floats and not categorical features. Is there any way to preserve the dataframe mapper (categorical-->labelencoding in this case), while also providing oversampled data (nd-array of floats) to the classifier in the pipeline?

Thanks

Versions ('python: ', '2.7.12') ('sklearn: ', '0.18') ('sklearn.externals.joblib:', '0.10.2') ('pandas: ', u'0.19.0') ('sklearn_pandas: ', '1.2.0') ('sklearn2pmml: ', '0.15.0')

vruusmann commented 7 years ago

I'm not that familiar with SkLearn's over-/undersampling machinery. Could you please draft an example pipeline (using SkLearn's normal Pipeline class, instead of sklearn2pmml's "fixed layout" PMMLPipeline class) that does the right thing? You could base the example on the Audit dataset (that has both categorical and continuous features), as shown here: https://github.com/jpmml/jpmml-sklearn/blob/master/src/test/resources/main.py#L115

The issue is that it appears as though sklearn2pmml requires all steps in a pipeline to be fit at the same time.

The converter expects the pipeline argument to be an instance of sklearn2pmml.PMMLPipeline class. However, it does not care if this pipeline object was fitted by calling PMMLPipeline.fit(X, y), or if it was assembled from already fitted pieces:

iris_clf = DecisionTreeClassifier()
iris_clf.fit(iris_X, iris_y)

iris_pipeline = PMMLPipeline([
  ("estimator", iris_clf)
])
# All pipeline steps contain fitted objects, so there's no need to fit it again
#iris_pipeline.fit(iris_X, iris_y)
sklearn2pmml(iris_pipeline, ...)

geofizx commented 7 years ago

Ok, didn't realize fitted objects were valid. It all works as you suggest, and I am able to fit my classifier with oversampled data just fine now and then pass in categorical data to pipeline just for the mapping.

Thanks!

mohitbadwal commented 7 years ago

@geofizx can i get a sample code of your approach towards solving this SMOTE problem ? I am having a problem , tried the approach given above but not very successful. Please help!

kiran90429 commented 7 years ago

@geofizx can you please give a sample to help to overcome the issue, Please i am trying day and night to overcome that issue, but couldnt please help so that i can fix this issue

jpmml / sklearn2pmml

sklearn2pmml 0.18 pipeline with SMOTE oversampling #23