jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
531 stars 117 forks source link

May I use DataframeMapper twice in a pipeline ? #85

Closed oaksharks closed 6 years ago

oaksharks commented 6 years ago

Here is my code:

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import Imputer
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn2pmml import sklearn2pmml
from sklearn_pandas import DataFrameMapper
df = pd.read_csv('iris.csv')
df.head()
x0 x1 x2 x3 y
0 5.1 3.5 1.4 0.2 Iris-setosa
1 5.0 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
dfm = DataFrameMapper([
    (['x0'], None),
    (['x1'], None),
    (["x2"],  Imputer()),
    (["x3"],  Imputer())
                      ],df_out=True)

dfm1 = DataFrameMapper([
                       (["x0"],  Imputer()),
                       (["x1"],  Imputer()),
                       (['x2'], None),
                       (['x3'], None)
                      ])
lr_estimator = LogisticRegression()
# lr_estimator.fit(df[['x0', 'x1', 'x2', 'x3']], df['y'])
X=df[['x0', 'x1', 'x2', 'x3']]
Y=df['y']
iris_pipeline = PMMLPipeline([
    ('mapper', dfm),
    ('mapper1', dfm1),
    ("classifier", lr_estimator)
])
iris_pipeline.fit(X, Y)
PMMLPipeline(steps=[('mapper', DataFrameMapper(default=False, df_out=True,
        features=[(['x0'], None), (['x1'], None), (['x2'], Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), (['x3'], Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0))],
        input_df=False, sparse=False)),
       ('mapper1', DataFrameMapper(default=False, df_out=False,
        features=[(['x0'], Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), (['x1'], Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), (['x2'], None), (['x3'], None)],
        input_df=False, sparse=False)),
       ('classifier', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])
sklearn2pmml(iris_pipeline, "pmml_imputer_lr.pmml", with_repr = True, debug = True)
python: 3.6.3
sklearn: 0.19.2
sklearn.externals.joblib: 0.11
pandas: 0.23.4
sklearn_pandas: 1.7.0
sklearn2pmml: 0.38.0
java: 1.8.0_144
Executing command:
java -cp /home/aps/.local/lib/python3.6/site-packages/sklearn2pmml/resources/guava-26.0-jre.jar:/home/aps/.local/lib/python3.6/site-packages/sklearn2pmml/resources/h2o-genmodel-3.20.0.6.jar:/home/aps/.local/lib/python3.6/site-packages/sklearn2pmml/resources/istack-commons-runtime-3.0.5.jar:/home/aps/.local/lib/python3.6/site-packages/sklearn2pmml/resources/javax.activation-api-1.2.0.jar:/home/aps/.local/lib/python3.6/site-packages/sklearn2pmml/resources/jaxb-api-2.3.0.jar:/home/aps/.local/lib/python3.6/site-packages/sklearn2pmml/resources/jaxb-core-2.3.0.1.jar:/home/aps/.local/lib/python3.6/site-packages/sklearn2pmml/resources/jaxb-runtime-2.3.0.1.jar:/home/aps/.local/lib/python3.6/site-packages/sklearn2pmml/resources/jcommander-1.72.jar:/home/aps/.local/lib/python3.6/site-packages/sklearn2pmml/resources/jpmml-converter-1.3.3.jar:/home/aps/.local/lib/python3.6/site-packages/sklearn2pmml/resources/jpmml-h2o-1.0.0.jar:/home/aps/.local/lib/python3.6/site-packages/sklearn2pmml/resources/jpmml-lightgbm-1.2.2.jar:/home/aps/.local/lib/python3.6/site-packages/sklearn2pmml/resources/jpmml-sklearn-1.5.6.jar:/home/aps/.local/lib/python3.6/site-packages/sklearn2pmml/resources/jpmml-xgboost-1.3.3.jar:/home/aps/.local/lib/python3.6/site-packages/sklearn2pmml/resources/pmml-agent-1.4.5.jar:/home/aps/.local/lib/python3.6/site-packages/sklearn2pmml/resources/pmml-model-1.4.5.jar:/home/aps/.local/lib/python3.6/site-packages/sklearn2pmml/resources/pmml-model-metro-1.4.5.jar:/home/aps/.local/lib/python3.6/site-packages/sklearn2pmml/resources/pyrolite-4.21.jar:/home/aps/.local/lib/python3.6/site-packages/sklearn2pmml/resources/serpent-1.23.jar:/home/aps/.local/lib/python3.6/site-packages/sklearn2pmml/resources/slf4j-api-1.7.25.jar:/home/aps/.local/lib/python3.6/site-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.25.jar org.jpmml.sklearn.Main --pkl-pipeline-input /tmp/pipeline-dzvhgj25.pkl.z --pmml-output pmml_imputer_lr.pmml
Standard output is empty
Standard error:
Oct 09, 2018 10:55:35 AM org.jpmml.sklearn.Main run
INFO: Parsing PKL..
Oct 09, 2018 10:55:35 AM org.jpmml.sklearn.Main run
INFO: Parsed PKL in 23 ms.
Oct 09, 2018 10:55:35 AM org.jpmml.sklearn.Main run
INFO: Converting..
Oct 09, 2018 10:55:35 AM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException: Expected 0 element(s), got 4 element(s)
    at org.jpmml.sklearn.ClassDictUtil.checkSize(ClassDictUtil.java:63)
    at sklearn.Initializer.encodeFeatures(Initializer.java:39)
    at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:81)
    at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:193)
    at org.jpmml.sklearn.Main.run(Main.java:145)
    at org.jpmml.sklearn.Main.main(Main.java:94)

Exception in thread "main" java.lang.IllegalArgumentException: Expected 0 element(s), got 4 element(s)
    at org.jpmml.sklearn.ClassDictUtil.checkSize(ClassDictUtil.java:63)
    at sklearn.Initializer.encodeFeatures(Initializer.java:39)
    at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:81)
    at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:193)
    at org.jpmml.sklearn.Main.run(Main.java:145)
    at org.jpmml.sklearn.Main.main(Main.java:94)

Preserved joblib dump file(s): /tmp/pipeline-dzvhgj25.pkl.z

---------------------------------------------------------------------------

RuntimeError                              Traceback (most recent call last)

<ipython-input-99-a4d6ed9a8f52> in <module>()
----> 1 sklearn2pmml(iris_pipeline, "pmml_imputer_lr.pmml", with_repr = True, debug = True)

~/.local/lib/python3.6/site-packages/sklearn2pmml/__init__.py in sklearn2pmml(pipeline, pmml, user_classpath, with_repr, debug)
    241                                 print("Standard error is empty")
    242                 if retcode:
--> 243                         raise RuntimeError("The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams")
    244         finally:
    245                 if debug:

RuntimeError: The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams
from sklearn.externals import joblib
joblib.dump(iris_pipeline,'/tmp/model.pkl.z')

If use only one dataframemapper, it works well but double not , is there any advices ?

oaksharks commented 6 years ago

I found that DataFrameMapper regard as initializer transformer, and should has no features befefore. there will be verfiy in sklearn.Initializer.encodeFeatures:

@Override
    public List<Feature> encodeFeatures(List<Feature> features, SkLearnEncoder encoder){
        ClassDictUtil.checkSize(0, features); // ensure no features before
        return initializeFeatures(encoder);
    }

So I try to override method in sklearn_pandas.DataFrameMapper and remove the verify:

@Override
    public List<Feature> encodeFeatures(List<Feature> features, SkLearnEncoder encoder){
        return initializeFeatures(encoder);
    }

It works now, but is it possible? @vruusmann Looking forward to your help .

vruusmann commented 6 years ago

It works now, but is it possible

You propose removing a "sanity check" - the code will execute, but it will most likely be producing insane/non-sensical results.

The solution is to use a FeatureUnion step to combine two DataFrameMapper steps together:

mapper_union = FeatureUnion([
  ("first", dfm),
  ("second", dfm1)
])
pipeline = PMMLPipeline([
  ("preprocessing", mapper_union),
  ("model", lr_estimator)
])

I haven't executed the above code (just typing based on my memory), but this is the pattern that you should be following.