jpmml / jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
267 stars 80 forks source link

Support for `XGBoostRegressor.missing` property #127

Closed xianlin666 closed 1 year ago

xianlin666 commented 1 year ago

I used scala train a XGB model, and got a result before export pmml model by predicting test data, and then I import pmml model to predict by the same test data , I got totally diffrent result. I dont know which kind of situation will lead to this. train code:

val xgboost = new XGBoostRegressor().setFeaturesCol("features")
      .setLabelCol("label")
      .setPredictionCol("prediction")
      .setMissing(0.0F)
      .setMaxDepth(5)
      .setNumRound(100)
      .setColsampleBylevel(0.8)
      .setColsampleBytree(0.8)
    val pipeline = new Pipeline().setStages(Array(vectorAssembler, xgboost))

    println("开始拟合... ...")

    val model = pipeline.fit(sample_train_data_)

    //获取输出,out1为测试集预测,out2为种子样本预测数据
    val schema = sample_train_data_.schema
    val pmml: PMML = new PMMLBuilder(schema, model).build()
    val hadoopConf = new Configuration()    // 获取hadoop的配置文件
    val fs = FileSystem.get(hadoopConf)    // 从配置文件中获取文件系统
    val fpath = new Path(path)       // 将字符串targetFile转换为hdfs上的路径
    if(fs.exists(fpath)){
      fs.delete(fpath, true)
    }
    val fout: FSDataOutputStream = fs.create(fpath)     // hdfs上创建路径targetFile

    JAXBUtil.marshalPMML(pmml,new StreamResult(fout))

import predict:

val fs:FileSystem = FileSystem.get(new Configuration())
    val pmml = fs.open(new Path("hdfs://ns00/lxl/xgb_pmml/xgb_model_3.pmml"))
    val evaluator: ModelEvaluator[_] = new LoadingModelEvaluatorBuilder().load(pmml).build()
    val modelBuilder: TransformerBuilder  = new TransformerBuilder(evaluator)
      .withTargetCols()
      .withOutputCols()
      .exploded(true)
    val model1:Transformer = modelBuilder.build()
    val out1 = model1.transform(last)
vruusmann commented 1 year ago

train code

val xgboost = new XGBoostRegressor()
     .setMissing(0.0F)

It's probably the XGBoostRegressor.missing property that is causing this - it's not converted automatically.

Open your PMML file in text editor and check if all continuous field declarations contain a DataField/Value child element like shown below:

<DataField name="myfield">
  <Value value="0.0" property="missing"/>
</DataField>

If they are missing, then you might try adding them manually, and re-run the prediction - the results should be correct now.

vruusmann commented 1 year ago

What is your Apache Spark, and JPMML-SparkML versions?

I refactored XGBoost missing value handling in JPMML-XGBoost 1.7.1: https://github.com/jpmml/jpmml-xgboost/commit/57192fb9835af9cf9fd8974034afcf76fc107d17

The newly introduced org.jpmml.xgboost.HasXGBoostOptions#OPTION_MISSING conversion option is not integrated into the JPMML-SparkML library yet.

xianlin666 commented 1 year ago

But I have not find any element like missing in my PMML file. Here only "missingValueStrategy" in Segement like this, will it influence the result?

<Segment id="11">
                <True/>
                <TreeModel functionName="regression" missingValueStrategy="defaultChild" splitCharacteristic="binarySplit" x-mathContext="float">
                    <MiningSchema>
                        <MiningField name="float(Min_value_sent)"/>

and My spark version is 2.4.8, and jpmml-sparkml is 1.5.14.

vruusmann commented 1 year ago

But I have not find any element like missing in my PMML file.

If there are no DataField/Value@property="missing" elements in your PMML document, then it means that the (J)PMML evaluator is not instructed to "re-classify" the 0.0 value from the valid value space to the missing value space. Sounds logical, no?

Here only "missingValueStrategy" in Segement like this, will it influence the result?

The TreeModel@missingValueStrategy insttucts what to do about a missing model prediction. It does not interact with model inputs (aka features) in any way.

My spark version is 2.4.8, and jpmml-sparkml is 1.5.14.

That's a really old version, which is no longer supported/maintained by me.

When I implement a fix for this issue, then you need to back-port it to the 1.5.X branch manually.

vruusmann commented 1 year ago

train code

val xgboost = new XGBoostRegressor()
.setMissing(0.0F)

In the meantime, it should be possible to make the (J)PMML prediction come out correct if you replace 0.0 values with Double.NaN values in your test set.

Something like:

test_df = test_df.replace(0.0d, Double.NaN);

The DataField/Value element would do this automatically inside the model, but since it's currently unavailable for you, you could do it manually outside of the model.

xianlin666 commented 1 year ago

Thanks for your patient answering, I have solved this by adding <Value value="0.0" property="missing"/> into my PMML file manually. Now i can get the expected result by importing PMML model.