jpmml / jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
267 stars 80 forks source link

required: org.apache.spark.ml.PipelineModel Found : org.apache.spark.ml.classification.GBTClassificationModel #76

Closed yeikel closed 5 years ago

yeikel commented 5 years ago

Hi ,

I am trying to use this library but I get the following exception :

Error:(21, 47) type mismatch;
 found   : org.apache.spark.ml.classification.GBTClassificationModel
 required: org.apache.spark.ml.PipelineModel

I am using the library like this :

    val model = GBTClassificationModel.load("...")
    val dfSchema = spark.read.parquet("..")
    val pmml = new PMMLBuilder(dfSchema.schema, model).build()

Could you please clarify if this library could be used to transform the GBTClassificationModel to a PMML?

vruusmann commented 5 years ago

Apache Spark ML is built around the pipeline concept. The JPMML-SparkML library follows this idea, and uses org.apache.spark.ml.Pipeline(Model) as a conversion unit.

Could you please clarify if this library could be used to transform the GBTClassificationModel to a PMML?

Create a single-step PipelineModel based on your model object.

yeikel commented 5 years ago

@vruusmann Could you please share an example about your suggestion if you have it?

vruusmann commented 5 years ago

Could you please share an example about your suggestion?

In Java pseudcode:

Model model = GBTClassificationModel.load(...);
PipelineModel pipelineModel = new PipelineModel(new PipelineStage[]{model});

During conversion, the pipeline object is also queried for label and feature information. If this single-step pipeline raises further conversion errors, then you might need to insert (fake-) StringIndexerModel (for label spec) and VectorAssembler (feature spec) into it.

yeikel commented 5 years ago

@vruusmann Thank you for your help. I am close but I am missing something. Any help would be appreciated,

Exception in thread "main" java.util.NoSuchElementException: Failed to find a default value for inputCol

    val ml = GBTClassificationModel.load("....")
    val trainingData = spark.read.parquet("...")
    val fields = Array(".....")
    val assembler = new VectorAssembler().setInputCols(fields).setOutputCol("features")
    val sampleSchema = trainingData.select(fields.map(col): _*)
    val str = new StringIndexer().setOutputCol("label")
    val pipelineEstimator = new Pipeline().setStages(Array(str,assembler, ml)).fit(trainingData)
    val pmml = new PMMLBuilder(sampleSchema.schema, pipelineEstimator).build