Use spark2.4.6 GBTRegressor export pmml model datadictionary tag missing some feature columns

hjfrank1991 commented 4 years ago

When I used spark 2.4.6 to train the gbtressor model, I selected 10 feature columns and a label column. When I used this to export the PMML model, the datadictionary tag of the PMML file was missing some feature columns

hjfrank1991 commented 4 years ago

` val schema = StructType(Array( StructField("POOD", StringType, nullable = false).withComment("feature"), StructField("MYCT", IntegerType, nullable = false).withComment("feature"), StructField("MMIN", IntegerType, nullable = false).withComment("feature"), StructField("MMAX", IntegerType, nullable = false).withComment("feature"), StructField("CACH", IntegerType, nullable = false).withComment("feature"), StructField("CHMIN", IntegerType, nullable = false).withComment("feature"), StructField("CHMAX", IntegerType, nullable = false).withComment("feature"), StructField("class", IntegerType, nullable = false).withComment("label")))

val indexer = new StringIndexer()
  .setInputCol("POOD")
  .setOutputCol("label_index")
  .setHandleInvalid("keep")

val onehot = new OneHotEncoderEstimator()
  .setInputCols(Array(indexer.getOutputCol))
  .setOutputCols(Array(indexer.getOutputCol + "_output"))

val vectorAssembler = new VectorAssembler()
  .setInputCols(Array("MYCT", "MMIN", "MMAX", "CACH", "CHMIN", "CHMAX", indexer.getOutputCol + "_output"))
  .setHandleInvalid("keep")
  .setOutputCol("vector_numcol")

val minmax = new MinMaxScaler()
  .setInputCol(vectorAssembler.getOutputCol)
  .setOutputCol("minmax_minm")

val gbt = new GBTRegressor()
  .setFeaturesCol(minmax.getOutputCol)
  .setLabelCol("class")
  .setPredictionCol("predictionCol")
  .setLossType("squared") 
  .setStepSize(0.1) 
  .setMaxDepth(5) 
  .setMaxBins(32) 
  .setSubsamplingRate(0.01) 
  .setMaxIter(20) 

val pipelineModel: PipelineModel = new Pipeline()
  .setStages(Array(indexer, onehot, vectorAssembler, minmax, gbt))
  .fit(df)

val frame = pipelineModel.transform(df)
frame.show(false)
frame.printSchema()
val pmml_1 = new PMMLBuilder(df.schema, pipelineModel).build

saveToLocalFile(pmml_1, "gbtTreeReg")

**but i get that:**

    <DataField name="MMIN" optype="continuous" dataType="integer"/>
    <DataField name="MMAX" optype="continuous" dataType="integer"/>
    <DataField name="CACH" optype="continuous" dataType="integer"/>
    <DataField name="class" optype="continuous" dataType="double"/>
</DataDictionary>`

missing some feature columns

hjfrank1991 commented 4 years ago

<DataDictionary>
    <DataField name="MYCT" optype="continuous" dataType="integer"/>
    <DataField name="MMIN" optype="continuous" dataType="integer"/>
    <DataField name="MMAX" optype="continuous" dataType="integer"/>
    <DataField name="CACH" optype="continuous" dataType="integer"/>
    <DataField name="class" optype="continuous" dataType="double"/>
</DataDictionary>

vruusmann commented 4 years ago

This has been explained numerous times already - JPMML conversion libraries only retain these features (as DataDictionary/DataField elements) that are actually needed by the model for making a prediction.

In the current case, MYCT, MMIN, MMAX and CACH are necessary features (are retained) whereas POOD, CHMIN and CHMAN are unnecessary features (are discarded).

vruusmann commented 4 years ago

Main point - is the model making correct predictions or not?

In the current case, the model is making correct predictions. QED.

hjfrank1991 commented 4 years ago

ok ,Thanks

jpmml / jpmml-sparkml

Use spark2.4.6 GBTRegressor export pmml model datadictionary tag missing some feature columns #100