Closed hjfrank1991 closed 4 years ago
` val schema = StructType(Array( StructField("POOD", StringType, nullable = false).withComment("feature"), StructField("MYCT", IntegerType, nullable = false).withComment("feature"), StructField("MMIN", IntegerType, nullable = false).withComment("feature"), StructField("MMAX", IntegerType, nullable = false).withComment("feature"), StructField("CACH", IntegerType, nullable = false).withComment("feature"), StructField("CHMIN", IntegerType, nullable = false).withComment("feature"), StructField("CHMAX", IntegerType, nullable = false).withComment("feature"), StructField("class", IntegerType, nullable = false).withComment("label")))
val indexer = new StringIndexer()
val onehot = new OneHotEncoderEstimator()
.setOutputCols(Array(indexer.getOutputCol + "_output"))
val vectorAssembler = new VectorAssembler()
.setInputCols(Array("MYCT", "MMIN", "MMAX", "CACH", "CHMIN", "CHMAX", indexer.getOutputCol + "_output"))
val minmax = new MinMaxScaler()
val gbt = new GBTRegressor()
val pipelineModel: PipelineModel = new Pipeline()
.setStages(Array(indexer, onehot, vectorAssembler, minmax, gbt))
val frame = pipelineModel.transform(df)
val pmml_1 = new PMMLBuilder(df.schema, pipelineModel).build
saveToLocalFile(pmml_1, "gbtTreeReg")
**but i get that:**
<DataField name="MMIN" optype="continuous" dataType="integer"/>
<DataField name="MMAX" optype="continuous" dataType="integer"/>
<DataField name="CACH" optype="continuous" dataType="integer"/>
<DataField name="class" optype="continuous" dataType="double"/>
missing some feature columns
<DataField name="MYCT" optype="continuous" dataType="integer"/>
<DataField name="MMIN" optype="continuous" dataType="integer"/>
<DataField name="MMAX" optype="continuous" dataType="integer"/>
<DataField name="CACH" optype="continuous" dataType="integer"/>
<DataField name="class" optype="continuous" dataType="double"/>
This has been explained numerous times already - JPMML conversion libraries only retain these features (as DataDictionary/DataField
elements) that are actually needed by the model for making a prediction.
In the current case, MYCT, MMIN, MMAX and CACH are necessary features (are retained) whereas POOD, CHMIN and CHMAN are unnecessary features (are discarded).
Main point - is the model making correct predictions or not?
In the current case, the model is making correct predictions. QED.
ok ,Thanks
When I used spark 2.4.6 to train the gbtressor model, I selected 10 feature columns and a label column. When I used this to export the PMML model, the datadictionary tag of the PMML file was missing some feature columns