jpmml / jpmml-sparkml-xgboost

JPMML-SparkML plugin for converting XGBoost4J-Spark models to PMML
GNU Affero General Public License v3.0
36 stars 16 forks source link

Support for `missing` attribute #19

Open liumy601 opened 2 years ago

liumy601 commented 2 years ago

Hi vruusmann,

Sorry to disturb again, i've been headache for the inconsistent problem about several months. after i checked the doc of xgboost4j, i see after version 0.9, they've made some fixes about the missing value problem. so i upgraded xgboost4j-spark to 1.2.0 with spark 3. but now i still get the inconsistent problem.

image

you can see i only have one categorical feature hour which doesn't contain missing values, but if i remove categorical feature and use only numeric features, then the predict is consistent.

do you have any clues?

vruusmann commented 2 years ago

i only have one categorical feature hour which doesn't contain missing values

What is your definition of a "missing value"? A Java null reference, Double.NaN (or Float.NaN value), or something else?

The JPMML-XGBoost library has been very thoroughly tested with continuous/categorical/missing/invalid data form 6+ years, without a single major issue. So, again, I must assume that the problem resides somewhere in your application code.

Please prepare & share a minimal reproducible example - a CSV data file plus an Apache Spark script (Scala or PySpark), which I can run and explore locally.

vruusmann commented 2 years ago

This project contains an integration test that uses sparse categorical data: https://github.com/jpmml/jpmml-sparkml-xgboost/blob/master/src/test/resources/XGBoostAudit.scala

This test is 100% reproducible.

liumy601 commented 2 years ago

This project contains an integration test that uses sparse categorical data: https://github.com/jpmml/jpmml-sparkml-xgboost/blob/master/src/test/resources/XGBoostAudit.scala

This test is 100% reproducible.

i've tried SparseToDenseTransformer before, and see it fixes the inconsistent problem caused by sparse vector problem. But my dataset is big and the features num is over 28000 dimensions, xgboost model can't run successfully as it'll have memory problem

liumy601 commented 2 years ago

i only have one categorical feature hour which doesn't contain missing values

What is your definition of a "missing value"? A Java null reference, Double.NaN (or Float.NaN value), or something else?

The JPMML-XGBoost library has been very thoroughly tested with continuous/categorical/missing/invalid data form 6+ years, without a single major issue. So, again, I must assume that the problem resides somewhere in your application code.

Please prepare & share a minimal reproducible example - a CSV data file plus an Apache Spark script (Scala or PySpark), which I can run and explore locally.

i set missing value to 0, in xgboost4j-spark 1.2.0, if i set missing to other values, then it'll give xgboost training failed error.

vruusmann commented 2 years ago

i set missing value to 0

The DataField element for the "hour" column does not convey any information about the fact that in your case, the 0 value should be regarded as a missing value (and not as a numeric zero value).

How can the PMML engine make correct predictions if it is missing this critical piece of information?

Take the PMML document, and insert the following DataField/Value child element manually:

<DataField name="hour" optype="categorical" dataType="integer">
  <!-- THIS -->
  <Value property="missing" value="0"/>
</DataField>
vruusmann commented 2 years ago

It would be nice to automate the generation of extra DataField/Value@property="missing" etc elements.

Here are some related feature requests: https://github.com/jpmml/jpmml-sparkml/issues/14 and https://github.com/jpmml/jpmml-sparkml/issues/25

Newer XGBoost versions also store this information in model dumps. Here's a related Scikit-Learn issue: https://github.com/jpmml/jpmml-sklearn/issues/166

liumy601 commented 2 years ago

Hi vruusmann,

Unfortunately, after i add the extra DataField/Value@property="missing" fields, the inconsistent problem still exists, i'm frustrated. and i've tried both xgboost4j-spark 0.82 and 1.2.0, both inconsistent. Now i don't have any ideas.