jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
531 stars 117 forks source link

Failure to convert when using StandardScaler and MLPRegressor #66

Closed kb3wmh closed 6 years ago

kb3wmh commented 6 years ago

My Python code using sklearn2pmml

 data_scaler = Pipeline(steps=[('transformer', scaler)])
 mlp_pipeline = PMMLPipeline(steps=[("scaler", data_scaler), ("mlpregressor", EstimatorProxy(mlp))])
 mlp_pipeline.fit(X_train, y_train)

 sklearn2pmml(model, "test.pmml", with_repr = True, debug=True)

The Java Error:

Jan 11, 2018 1:52:42 PM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException: Attribute 'sklearn2pmml.EstimatorProxy.estimator_' has an unsupported value (Python class sklearn2pmml.PMMLPipeline)
        at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:43)
        at sklearn2pmml.EstimatorProxy.getEstimator(EstimatorProxy.java:126)
        at sklearn2pmml.EstimatorProxy.isSupervised(EstimatorProxy.java:67)
        at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:96)
        at org.jpmml.sklearn.Main.run(Main.java:145)
        at org.jpmml.sklearn.Main.main(Main.java:94)
Caused by: java.lang.ClassCastException: Cannot cast sklearn2pmml.PMMLPipeline to sklearn.Estimator
        at java.lang.Class.cast(Class.java:3369)
        at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:41)
        ... 5 more

Exception in thread "main" java.lang.IllegalArgumentException: Attribute 'sklearn2pmml.EstimatorProxy.estimator_' has an unsupported value (Python class sklearn2pmml.PMMLPipeline)
        at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:43)
        at sklearn2pmml.EstimatorProxy.getEstimator(EstimatorProxy.java:126)
        at sklearn2pmml.EstimatorProxy.isSupervised(EstimatorProxy.java:67)
        at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:96)
        at org.jpmml.sklearn.Main.run(Main.java:145)
        at org.jpmml.sklearn.Main.main(Main.java:94)
Caused by: java.lang.ClassCastException: Cannot cast sklearn2pmml.PMMLPipeline to sklearn.Estimator
        at java.lang.Class.cast(Class.java:3369)
        at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:41)
        ... 5 more
vruusmann commented 6 years ago

Is this a complete code example?

You have EstimatorProxy(mlp) in your script, but I can't find the definition of mlp anywhere. The title of this issue suggests that it should be of type MLPRegressor, but the exception message suggests PMMLPipeline instead.

You should be able to make the conversion work by simplifying the PMML pipeline:

  1. Remove the usage of EstimatorProxy. It's not needed here, because class MLPRegressor doesn't contain any non-persisent attributes.
  2. Remove nested pipeline.

This should work without problems:

pipeline = PMMLPipeline([
  ("scaler", StandardScaler()),
  ("regressor", MLPRegressor())
])
pipeline.fit(X_train, y_train)
kb3wmh commented 6 years ago

This still doesn't work:

from sklearn.neural_network import MLPRegressor
from sklearn2pmml import EstimatorProxy
from sklearn2pmml import PMMLPipeline, sklearn2pmml

from sklearn.preprocessing import StandardScaler

from sklearn.pipeline import Pipeline
from sklearn.externals import joblib

if __name__ == "__main__":
    mlp_regressor = MLPRegressor() #I've tried this with and without this line
    scaler = StandardScaler() #Also this one
    mlp_regressor = joblib.load("mlp.pkl") # MLP model previously exported to pkl file
    scaler = joblib.load("scaler.pkl") # StandardScaler()

    pipeline = PMMLPipeline([
        ("scaler", scaler),
        ("regressor", mlp_regressor)
        ])

    sklearn2pmml(pipeline, "pipeline_test.pmml", debug=True)

I have saved the scaler and MLP regressor to pickle files so that I don't have to retrain the model and can more easily apply new data. This works, I can load the model back in, and fit to the pipeline, and get the same results. But I keep getting the java errors when I try to convert these models to a PMML.

It works if I don't use pickle files--which I can work around, if need be. But if you have an idea of what is going on, I'd be very grateful.

I'm very much a noob, so thank you so much for replying to me.

vruusmann commented 6 years ago

But I keep getting the java errors when I try to convert these models to a PMML.

What are those exceptions? They must be something else than the one shown above.

vruusmann commented 6 years ago

Training a Scikit-Learn pipeline, and savings its components to Pickle files:

from sklearn.datasets import load_iris

X, y = load_iris(return_X_y = True)

from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

scaler = StandardScaler()
classifier = MLPClassifier()

pipeline = Pipeline([
    ("scaler", scaler),
    ("classifier", classifier)
])
pipeline.fit(X, y)

from sklearn.externals import joblib

joblib.dump(scaler, "scaler.pkl")
joblib.dump(classifier, "classifier.pkl")

Loading components from Pickle files, and converting to PMML data format:

from sklearn.externals import joblib

scaler2 = joblib.load("scaler.pkl")
classifier2 = joblib.load("classifier.pkl")

from sklearn2pmml import PMMLPipeline

import numpy

pmml_pipeline = PMMLPipeline([
    ("scaler2", scaler2),
    ("classifier2", classifier2)
])
pmml_pipeline.active_fields = numpy.asarray(["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"])
pmml_pipeline.target_fields = numpy.asarray(["Species"])

from sklearn2pmml import sklearn2pmml

sklearn2pmml(pmml_pipeline, "iris.pmml", with_repr = True)
hardianlawi commented 6 years ago

Hi @vruusmann ,

I realize that I get an error if I do it like below:

from sklearn.datasets import load_iris

X, y = load_iris(return_X_y = True)

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

scaler = StandardScaler()

pipeline = Pipeline([
    ("scaler", scaler)
])
pipeline.fit(X, y)

from sklearn2pmml import PMMLPipeline

pmml_pipeline = PMMLPipeline([
    ("scaler", scaler)
])

from sklearn2pmml import sklearn2pmml

sklearn2pmml(pmml_pipeline, "iris.pmml", with_repr = True)

Do you know why? Do I always have to include the MLPClassifier to the PMMLPipeline? What is the correct way to do it if I only need the StandardScaler?

vruusmann commented 6 years ago

@hardianlawi What kind of error are you getting? I believe it's the same as here: https://github.com/jpmml/sklearn2pmml/issues/78

Do I always have to include the MLPClassifier to the PMMLPipeline?

A pipeline is defined as a sequence of transformers, followed by an estimator. If the pipeline does not contain the final estimator step, then it is under-specified.

What is the correct way to do it if I only need the StandardScaler?

Terminate your pipeline with a dummy estimator class such as sklearn.dummy.DummyClassifier or sklearn.dummy.DummyRegressor.

hardianlawi commented 6 years ago

@vruusmann Thanks for your reply.

I get the errors below when trying to run the code:

Standard output is empty
Standard error:
Mar 06, 2018 5:57:28 PM org.jpmml.sklearn.Main run
INFO: Parsing PKL..
Mar 06, 2018 5:57:28 PM org.jpmml.sklearn.Main run
INFO: Parsed PKL in 34 ms.
Mar 06, 2018 5:57:28 PM org.jpmml.sklearn.Main run
INFO: Converting..
Mar 06, 2018 5:57:28 PM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException: Tuple contains an unsupported value (Python class sklearn.preprocessing.data.StandardScaler)
    at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:43)
    at org.jpmml.sklearn.TupleUtil.extractElement(TupleUtil.java:48)
    at sklearn2pmml.PMMLPipeline.getEstimator(PMMLPipeline.java:369)
    at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:85)
    at org.jpmml.sklearn.Main.run(Main.java:145)
    at org.jpmml.sklearn.Main.main(Main.java:94)
Caused by: java.lang.ClassCastException: Cannot cast sklearn.preprocessing.StandardScaler to sklearn.Estimator
    at java.lang.Class.cast(Class.java:3369)
    at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:41)
    ... 5 more

Exception in thread "main" java.lang.IllegalArgumentException: Tuple contains an unsupported value (Python class sklearn.preprocessing.data.StandardScaler)
    at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:43)
    at org.jpmml.sklearn.TupleUtil.extractElement(TupleUtil.java:48)
    at sklearn2pmml.PMMLPipeline.getEstimator(PMMLPipeline.java:369)
    at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:85)
    at org.jpmml.sklearn.Main.run(Main.java:145)
    at org.jpmml.sklearn.Main.main(Main.java:94)
Caused by: java.lang.ClassCastException: Cannot cast sklearn.preprocessing.StandardScaler to sklearn.Estimator
    at java.lang.Class.cast(Class.java:3369)
    at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:41)
    ... 5 more

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-1-34e2323412a5> in <module>()
     21 from sklearn2pmml import sklearn2pmml
     22 
---> 23 sklearn2pmml(pmml_pipeline, "iris.pmml", with_repr = True)

/usr/local/lib/python3.5/dist-packages/sklearn2pmml/__init__.py in sklearn2pmml(pipeline, pmml, user_classpath, with_repr, debug)
    304                                 print("Standard error is empty")
    305                 if retcode:
--> 306                         raise RuntimeError("The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams")
    307         finally:
    308                 if debug:

RuntimeError: The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams

Terminate your pipeline with a dummy estimator class such as sklearn.dummy.DummyClassifier or sklearn.dummy.DummyRegressor

Won't this be adding additional inference step when I use the pipeline in Java? I have trained my model in tensorflow using Python and I am only using StandardScaler to preprocess my data before making any inference using the tensorflow model. Let me know if I miss something here!

vruusmann commented 6 years ago

java.lang.IllegalArgumentException: Tuple contains an unsupported value (Python class sklearn.preprocessing.data.StandardScaler)

This exception doesn't make sense - Python class sklearn.preprocessing.data.StandardScaler is always registered with the SkLearn2PMML/JPMML-SkLearn runtime.

Maybe your SkLearn2PMML installation is corrupt or something.

Won't this be adding additional inference step when I use the pipeline in Java?

They are dummy estimators, so they don't take much resources to fit.

hardianlawi commented 6 years ago

Yo dude,

This exception doesn't make sense - Python class sklearn.preprocessing.data.StandardScaler is always registered with the SkLearn2PMML/JPMML-SkLearn runtime.

Could you try running that on your machine? Because I tried it both on my remote and local machine. Both of them output the same exception.

They are dummy estimators, so they don't take much resources to fit.

What I mean is by adding the dummy estimator, I believe when I load the saved model iris.pmml to Java, I won't be able to only use the StandardScaler() part. I imagine sth like below:

pipeline = load('iris.pmml');
pipeline.transform(x) -> an inference. What I am interested in is the preprocessing step.
vruusmann commented 6 years ago

@hardianlawi Please work on your attitude - I don't owe you anything.

ghost commented 6 years ago

Hi @vruusmann,

I apologize if I sounded rude to you. I didn't mean it that way but thank you for your help anyway! I really appreciate it. Great work for doing everything alone.