jpmml / jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
267 stars 80 forks source link

ClassCastException when export sparkml model to pmml #96

Closed xmas1992 closed 4 years ago

xmas1992 commented 4 years ago

Hi,

When I try to export a spark ml model as pmml file, it shows the following problem:

scala> val pmml = new PMMLBuilder(trainingData.schema, pipelineModel).build() java.lang.ClassCastException: org.dmg.pmml.DataField cannot be cast to org.dmg.pmml.HasDiscreteDomain at org.jpmml.converter.PMMLUtil.getValues(PMMLUtil.java:109) at org.jpmml.converter.PMMLUtil.getValues(PMMLUtil.java:98) at org.jpmml.converter.PMMLEncoder.toCategorical(PMMLEncoder.java:223) at org.jpmml.sparkml.feature.StringIndexerModelConverter.encodeFeatures(StringIndexerModelConverter.java:75) at org.jpmml.sparkml.FeatureConverter.registerFeatures(FeatureConverter.java:50) at org.jpmml.sparkml.PMMLBuilder.build(PMMLBuilder.java:114) ... 55 elided

Here is scala script I am submitting:

import org.apache.spark.ml.linalg. import org.apache.spark.ml.regression. import org.apache.spark.sql._ import org.apache.spark.ml.feature.VectorAssembler import org.apache.spark.ml.linalg.Vectors import org.apache.spark.ml.Pipeline import org.apache.spark.ml.classification.{GBTClassificationModel, GBTClassifier} import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer} import org.dmg.pmml.PMML; import org.jpmml.model.JAXBUtil; import org.jpmml.sparkml.PMMLBuilder; import java.io.File import javax.xml.transform.stream.StreamResult val feature_data = spark.read.json("filepath") val assembler = new VectorAssembler().setInputCols(Array("col1",..., "col2")).setOutputCol("features") val feature_data_vec_without_label = assembler.transform(feature_data).select("features", "label") val labelIndexer = new StringIndexer().setInputCol("label").setOutputCol("indexedLabel").fit(feature_data_vec_without_label) val featureIndexer = new VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures").setMaxCategories(4).fit(feature_data_vec_without_label) val Array(trainingData, testData) = feature_data_vec_without_label.randomSplit(Array(0.7, 0.3)) val gbt = new GBTClassifier().setMaxIter(10).setMaxDepth(5).setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures") val pipeline = new Pipeline().setStages(Array(labelIndexer, featureIndexer, gbt)) val pipelineModel = pipeline.fit(trainingData) val pmml = new PMMLBuilder(trainingData.schema, pipelineModel).build()

vruusmann commented 4 years ago

java.lang.ClassCastException: org.dmg.pmml.DataField cannot be cast to org.dmg.pmml.HasDiscreteDomain

Class org.dmg.pmml.DataField DOES implement the org.dmg.pmml.HasDiscreteDomain marker interface for quite some time already.

If you're getting this exception, then I must assume that you're trying to pair an outdated JPMML-Model library with a more recent JPMML-SparkML library.

TLDR: Fix your classpath. The version compatibility issue is detailed in the README file.