Closed haofengrushui204 closed 5 years ago
Can you show me your Apache Spark pipeline definition?
You appear to be using a model chain. The first model is GBT (Segment@id=1
), which is then followed by some other model (deleted from the above PMML snippet).
The most likely explanation is that you're using label
as input column to GBT.
OK, thanks for your replying. this is my code:
def train(trainRDD: RDD[LabeledPoint], iterNum: Int, spark: SparkSession): Unit = {
import spark.implicits._
val features_size = trainRDD.take(1)(0).features.size
val featureNames = (0 until features_size).map(x => s"feature_$x") :+ "label"
val schemaAnalysis = featureNames.map(featureName => StructField(featureName, DoubleType)).toArray
val trainDFTmp = trainRDD.map { case LabeledPoint(label, features) =>
val seq: Seq[Double] = features.toArray.toSeq
Row.fromSeq(seq :+ label)
}
val trainDF = spark.createDataFrame(trainDFTmp, new StructType(schemaAnalysis))
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
val vectorAssember = new VectorAssembler()
.setInputCols(featureNames.toArray)
.setOutputCol("features")
// val featuresIndexer = new VectorIndexer()
// .setInputCol("features")
// .setOutputCol("indexedFeatures")
// .setMaxCategories(10)
val gbt = new GBTClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("features")
.setMaxDepth(3)
.setMaxIter(iterNum)
// Convert indexed labels back to original labels.
// val labelConverter = new IndexToString()
// .setInputCol("prediction")
// .setOutputCol("predictedLabel")
// .setLabels(labelIndexer.labels)
val pipeline = new Pipeline()
.setStages(Array(labelIndexer, vectorAssember, gbt))
val pipelineModel = pipeline.fit(trainDF)
// HDFSUtils.deleteDir("hdfs://bitautodmp/data/datamining/ctr/model_gbdt")
// pipelineModel.save("hdfs://bitautodmp/data/datamining/ctr/model_gbdt")
// val pipelineModel = PipelineModel.load("hdfs://bitautodmp/data/datamining/ctr/model_gbdt")
/**
* write model pmml format to hdfs
*/
val pmml = ConverterUtil.toPMML(trainDF.schema, pipelineModel)
HDFSUtils.deleteDir(modelPmmlPath)
val fs: FileSystem = FileSystem.get(new Configuration())
val path = new Path(modelPmmlPath)
val out = fs.create(path)
MetroJAXBUtil.marshalPMML(pmml, out)
}
val featureNames = (0 until featuressize).map(x => s"feature$x") :+ "label"
I'm not very familiar with the Scala language, but is it possible that the above code line is building a collection of strings, where the last element is "label"?
This collection is then used to define VectorAssembler
input columns, which kind of explains how/why the GBT model gets to include the "label" column as regular input column.
It would be nice if the JPMML-SparkML library performed some additional sanity checks on the model schema definition - it should throw an exception if the same column is used as a label and a feature.
ok, thanks very much, as you said, i make a mistake.
hello, I pmml as fllows ,i do not know why “label” is usageType="target" in MiningSchema, but "label‘’ is active in MiningModel/Segmentation[segment@id=1]?