Closed hjfrank1991 closed 4 years ago
` val schema = StructType(Array( StructField("POOD", StringType, nullable = false).withComment("feature"), StructField("MYCT", IntegerType, nullable = false).withComment("feature"), StructField("MMIN", IntegerType, nullable = false).withComment("feature"), StructField("MMAX", IntegerType, nullable = false).withComment("feature"), StructField("CACH", IntegerType, nullable = false).withComment("feature"), StructField("CHMIN", IntegerType, nullable = false).withComment("feature"), StructField("CHMAX", IntegerType, nullable = false).withComment("feature"), StructField("class", IntegerType, nullable = false).withComment("label")))
val indexer = new StringIndexer()
.setInputCol("POOD")
.setOutputCol("label_index")
.setHandleInvalid("keep")
val onehot = new OneHotEncoderEstimator()
.setInputCols(Array(indexer.getOutputCol))
.setOutputCols(Array(indexer.getOutputCol + "_output"))
val vectorAssembler = new VectorAssembler()
.setInputCols(Array("MYCT", "MMIN", "MMAX", "CACH", "CHMIN", "CHMAX", indexer.getOutputCol + "_output"))
.setHandleInvalid("keep")
.setOutputCol("vector_numcol")
val minmax = new MinMaxScaler()
.setInputCol(vectorAssembler.getOutputCol)
.setOutputCol("minmax_minm")
val gbt = new GBTRegressor()
.setFeaturesCol(minmax.getOutputCol)
.setLabelCol("class")
.setPredictionCol("predictionCol")
.setLossType("squared")
.setStepSize(0.1)
.setMaxDepth(5)
.setMaxBins(32)
.setSubsamplingRate(0.01)
.setMaxIter(20)
val pipelineModel: PipelineModel = new Pipeline()
.setStages(Array(indexer, onehot, vectorAssembler, minmax, gbt))
.fit(df)
val frame = pipelineModel.transform(df)
frame.show(false)
frame.printSchema()
val pmml_1 = new PMMLBuilder(df.schema, pipelineModel).build
saveToLocalFile(pmml_1, "gbtTreeReg")
**but i get that:**
<DataField name="MMIN" optype="continuous" dataType="integer"/>
<DataField name="MMAX" optype="continuous" dataType="integer"/>
<DataField name="CACH" optype="continuous" dataType="integer"/>
<DataField name="class" optype="continuous" dataType="double"/>
</DataDictionary>`
missing some feature columns
<DataDictionary>
<DataField name="MYCT" optype="continuous" dataType="integer"/>
<DataField name="MMIN" optype="continuous" dataType="integer"/>
<DataField name="MMAX" optype="continuous" dataType="integer"/>
<DataField name="CACH" optype="continuous" dataType="integer"/>
<DataField name="class" optype="continuous" dataType="double"/>
</DataDictionary>
This has been explained numerous times already - JPMML conversion libraries only retain these features (as DataDictionary/DataField
elements) that are actually needed by the model for making a prediction.
In the current case, MYCT, MMIN, MMAX and CACH are necessary features (are retained) whereas POOD, CHMIN and CHMAN are unnecessary features (are discarded).
Main point - is the model making correct predictions or not?
In the current case, the model is making correct predictions. QED.
ok ,Thanks
When I used spark 2.4.6 to train the gbtressor model, I selected 10 feature columns and a label column. When I used this to export the PMML model, the datadictionary tag of the PMML file was missing some feature columns