jpmml / jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
267 stars 80 forks source link

Add "dummy" estimator classes #34

Open zhangjunqiang opened 6 years ago

zhangjunqiang commented 6 years ago

Hello: When I use Converter like this val oneHotPMML = ConverterUtil.toPMML(onehotSource.schema, oneHotModel) , I got a Error like this:

Exception in thread "main" java.lang.IllegalArgumentException: Expected a pipeline with one or more models, got a pipeline with zero models
    at com.netease.mail.yanxuan.rms.utils.ConverterUtil.toPMML(ConverterUtil.java:118)
    at com.netease.mail.yanxuan.rms.scala.nn.feature.FeatureModelExport$.main(FeatureModelExport.scala:29)
    at com.netease.mail.yanxuan.rms.scala.nn.feature.FeatureModelExport.main(FeatureModelExport.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

After debug, I got the reason. There didn't have any ModelConverter in my model. Is it necessary that must have a ModelConverter in my pipelinemodel?

vruusmann commented 6 years ago

Is it necessary that must have a ModelConverter in my pipelinemodel?

Yes, this requirement is clearly communicated by the exception message.

If you want to export pipelines that are feature transformation-dominant, then you should consider introducing a dummy (ie. no-op) model into the pipeline. For example, in Scikit-Learn you can use estimator types DummyRegressor and DummyClassifier for that purpose.

The model object is needed to define the "schema" of the pipeline - what are the input features, what are the output features. Without the model object the converter can only generate empty PMML documents.

zhangjunqiang commented 6 years ago

Thank you for your reply,does have any dummy model in the spark mllib? I use spark ml in my train.

vruusmann commented 6 years ago

does have any dummy model in the spark mllib?

Depending on your Apache Spark ML version, there may or may not be appropriate technical workarounds available.

For example, a potential solution:

  1. Create a model-less Pipeline and fit it.
  2. Take the fitted PipelineModel and "manually" append an appropriate org.apache.spark.ml.PredictionModel object instance to it. Please note that you would be dealing with a PredictorModel subclass here (representing a model that has been fitted), not with a Predictor subclass (representing a model that is yet to be fitted).
  3. Design the "schema" of the above model to match the inputs and outputs of your feature transformation workflow.
vruusmann commented 6 years ago

Someone might search the Apache Spark JIRA, and see if there is a feature request for dummy estimator classes already available or not.

I wouldn't want to create and maintain these classes myself. But if absolutely necessary, I will do it.

vruusmann commented 6 years ago

Reopening, because I might want to provide some sort of easier workaround in the JPMML-SparkML library.

zhangjunqiang commented 6 years ago

@vruusmann So nice you are!