Open liumy601 opened 2 years ago
i only have one categorical feature hour which doesn't contain missing values
What is your definition of a "missing value"? A Java null
reference, Double.NaN
(or Float.NaN
value), or something else?
The JPMML-XGBoost library has been very thoroughly tested with continuous/categorical/missing/invalid data form 6+ years, without a single major issue. So, again, I must assume that the problem resides somewhere in your application code.
Please prepare & share a minimal reproducible example - a CSV data file plus an Apache Spark script (Scala or PySpark), which I can run and explore locally.
This project contains an integration test that uses sparse categorical data: https://github.com/jpmml/jpmml-sparkml-xgboost/blob/master/src/test/resources/XGBoostAudit.scala
This test is 100% reproducible.
This project contains an integration test that uses sparse categorical data: https://github.com/jpmml/jpmml-sparkml-xgboost/blob/master/src/test/resources/XGBoostAudit.scala
This test is 100% reproducible.
i've tried SparseToDenseTransformer before, and see it fixes the inconsistent problem caused by sparse vector problem. But my dataset is big and the features num is over 28000 dimensions, xgboost model can't run successfully as it'll have memory problem
i only have one categorical feature hour which doesn't contain missing values
What is your definition of a "missing value"? A Java
null
reference,Double.NaN
(orFloat.NaN
value), or something else?The JPMML-XGBoost library has been very thoroughly tested with continuous/categorical/missing/invalid data form 6+ years, without a single major issue. So, again, I must assume that the problem resides somewhere in your application code.
Please prepare & share a minimal reproducible example - a CSV data file plus an Apache Spark script (Scala or PySpark), which I can run and explore locally.
i set missing value to 0, in xgboost4j-spark 1.2.0, if i set missing to other values, then it'll give xgboost training failed error.
i set missing value to 0
The DataField
element for the "hour" column does not convey any information about the fact that in your case, the 0
value should be regarded as a missing value (and not as a numeric zero value).
How can the PMML engine make correct predictions if it is missing this critical piece of information?
Take the PMML document, and insert the following DataField/Value
child element manually:
<DataField name="hour" optype="categorical" dataType="integer">
<!-- THIS -->
<Value property="missing" value="0"/>
</DataField>
It would be nice to automate the generation of extra DataField/Value@property="missing"
etc elements.
Here are some related feature requests: https://github.com/jpmml/jpmml-sparkml/issues/14 and https://github.com/jpmml/jpmml-sparkml/issues/25
Newer XGBoost versions also store this information in model dumps. Here's a related Scikit-Learn issue: https://github.com/jpmml/jpmml-sklearn/issues/166
Hi vruusmann,
Unfortunately, after i add the extra DataField/Value@property="missing" fields, the inconsistent problem still exists, i'm frustrated. and i've tried both xgboost4j-spark 0.82 and 1.2.0, both inconsistent. Now i don't have any ideas.
Hi vruusmann,
Sorry to disturb again, i've been headache for the inconsistent problem about several months. after i checked the doc of xgboost4j, i see after version 0.9, they've made some fixes about the missing value problem. so i upgraded xgboost4j-spark to 1.2.0 with spark 3. but now i still get the inconsistent problem.
you can see i only have one categorical feature hour which doesn't contain missing values, but if i remove categorical feature and use only numeric features, then the predict is consistent.
do you have any clues?