jpmml / jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
267 stars 80 forks source link

Use spark2.4.6 GBTRegressor export pmml model datadictionary tag missing some feature columns #100

Closed hjfrank1991 closed 4 years ago

hjfrank1991 commented 4 years ago

When I used spark 2.4.6 to train the gbtressor model, I selected 10 feature columns and a label column. When I used this to export the PMML model, the datadictionary tag of the PMML file was missing some feature columns

hjfrank1991 commented 4 years ago

` val schema = StructType(Array( StructField("POOD", StringType, nullable = false).withComment("feature"), StructField("MYCT", IntegerType, nullable = false).withComment("feature"), StructField("MMIN", IntegerType, nullable = false).withComment("feature"), StructField("MMAX", IntegerType, nullable = false).withComment("feature"), StructField("CACH", IntegerType, nullable = false).withComment("feature"), StructField("CHMIN", IntegerType, nullable = false).withComment("feature"), StructField("CHMAX", IntegerType, nullable = false).withComment("feature"), StructField("class", IntegerType, nullable = false).withComment("label")))

val indexer = new StringIndexer()
  .setInputCol("POOD")
  .setOutputCol("label_index")
  .setHandleInvalid("keep")

val onehot = new OneHotEncoderEstimator()
  .setInputCols(Array(indexer.getOutputCol))
  .setOutputCols(Array(indexer.getOutputCol + "_output"))

val vectorAssembler = new VectorAssembler()
  .setInputCols(Array("MYCT", "MMIN", "MMAX", "CACH", "CHMIN", "CHMAX", indexer.getOutputCol + "_output"))
  .setHandleInvalid("keep")
  .setOutputCol("vector_numcol")

val minmax = new MinMaxScaler()
  .setInputCol(vectorAssembler.getOutputCol)
  .setOutputCol("minmax_minm")

val gbt = new GBTRegressor()
  .setFeaturesCol(minmax.getOutputCol)
  .setLabelCol("class")
  .setPredictionCol("predictionCol")
  .setLossType("squared") 
  .setStepSize(0.1) 
  .setMaxDepth(5) 
  .setMaxBins(32) 
  .setSubsamplingRate(0.01) 
  .setMaxIter(20) 

val pipelineModel: PipelineModel = new Pipeline()
  .setStages(Array(indexer, onehot, vectorAssembler, minmax, gbt))
  .fit(df)

val frame = pipelineModel.transform(df)
frame.show(false)
frame.printSchema()
val pmml_1 = new PMMLBuilder(df.schema, pipelineModel).build

saveToLocalFile(pmml_1, "gbtTreeReg")

**but i get that:**

    <DataField name="MMIN" optype="continuous" dataType="integer"/>
    <DataField name="MMAX" optype="continuous" dataType="integer"/>
    <DataField name="CACH" optype="continuous" dataType="integer"/>
    <DataField name="class" optype="continuous" dataType="double"/>
</DataDictionary>`

missing some feature columns

hjfrank1991 commented 4 years ago
<DataDictionary>
    <DataField name="MYCT" optype="continuous" dataType="integer"/>
    <DataField name="MMIN" optype="continuous" dataType="integer"/>
    <DataField name="MMAX" optype="continuous" dataType="integer"/>
    <DataField name="CACH" optype="continuous" dataType="integer"/>
    <DataField name="class" optype="continuous" dataType="double"/>
</DataDictionary>
vruusmann commented 4 years ago

This has been explained numerous times already - JPMML conversion libraries only retain these features (as DataDictionary/DataField elements) that are actually needed by the model for making a prediction.

In the current case, MYCT, MMIN, MMAX and CACH are necessary features (are retained) whereas POOD, CHMIN and CHMAN are unnecessary features (are discarded).

vruusmann commented 4 years ago

Main point - is the model making correct predictions or not?

In the current case, the model is making correct predictions. QED.

hjfrank1991 commented 4 years ago

ok ,Thanks