Proper representation of missing values when using XGBoost estimators

jpmml / jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML

GNU Affero General Public License v3.0

267 stars 80 forks source link

Proper representation of missing values when using XGBoost estimators #106

Closed liumy601 closed 3 years ago

liumy601 commented 3 years ago

@vruusmann i used jpmml-sparkml to export the xgboost model and jpmml-evaluator to evaluate the same records, features includes both numeric and categorical features, but i found predict result is different. after i kept only numerci features, i find predict result is the same, so i guess there may be some problems with the process of categorical features. i remembered there were issues about this, but i can't find them. Can you please give me some guides?

vruusmann commented 3 years ago

i used jpmml-sparkml to export the xgboost model

Most commonly, this issue is about improper encoding of missing values.

I give it a >95% chance that all the JPMML software stack is correct, and your application code is wrong. Prove me wrong by presenting a reproducible test case.

I'm closing this issue now, because it does not contain any actionable information.

liumy601 commented 3 years ago

@vruusmann the datafield definition looks like this:

<DataField name="c_vip_14_crowd" optype="categorical" dataType="string">
    <Value value="310505"/>
    <Value value="310205"/>
    <Value value="310101"/>
    <Value value="310305"/>
    <Value value="__unknown" property="invalid"/>
</DataField>

it treats missing value as __unknown.

maybe this is defined by stringindexer, in stringindexer, i adds setHandleInvalid("keep").

except this, i didn't do anything special for missing values.

vruusmann commented 3 years ago

it treats missing value as __unknown.

The above PMML fragment clearly states that __unknown values are interpreted as invalid values, not as missing values.

PMML treats valid, invalid and missing value spaces as completely disjoint.

If you change the DataField@property attribute value from invalid to missing, does the prediction come out correct or not?

liumy601 commented 3 years ago

@vruusmann sorry, i just replaced all DataField@property from invalid to missing, the predictions is still different and it doesn't have much difference with using invalid.

vruusmann commented 3 years ago

it doesn't have much difference with using invalid.

The DataField element configuration may be overriden by the MiningField element configuration (all attributes related to missing and invalid value treatment/replacement).

Honestly, it's impossible for me to say anything worthwhile if your complaint/feedback is limited to "predictions don't match". Please provide a reproducible test case, or stfu.

liumy601 commented 3 years ago

@vruusmann https://github.com/jpmml/jpmml-sparkml-xgboost/issues/1#issuecomment-603816378 , here you said we need to be careful when pass onehot encoded feature from spark to xgboost4j, as they treat 0,1,NaN different, i'm not very clear about how to do, now i just pass one hot features into xgboost4j by default.

vruusmann commented 3 years ago

you said we need to be careful when pass onehot encoded feature from spark to xgboost4j, as they treat 0,1,NaN different.

Correct - NaN is a special value for XGBoost, whereas for Apache Spark ML and Scikit-Learn it's a common space-filling value (eg. in case of sparse data matrices).

Your Apache Spark ML pipeline is set up incorrectly (treats 0 and NaN synonymously, but it shouldn't do so), and it is making incorrect predictions. The conversion to PMML makes this incorrect setup visible, because PMML natively treats 0 as valid value and NaN as invalid/missing value.

i'm not very clear about how to do, now i just pass one hot features into xgboost4j by default.

See this example: https://github.com/jpmml/jpmml-sparkml-xgboost/blob/master/src/test/resources/XGBoostAudit.scala

Specifically, convert NaN values to 0 values by inserting an org.jpmml.sparkml.xgboost.SparseToDenseTransformer pipeline stage: https://github.com/jpmml/jpmml-sparkml-xgboost/blob/master/src/test/resources/XGBoostAudit.scala#L16

The above integration test proves that the JPMML software stack (converter + evaluator) is doing correct job.

liumy601 commented 3 years ago

@vruusmann sorry, after i added SparseToDenseTransformer pipeline stage, the execution fails with error, i remembered i used that transformer before and failed, it looks like memory error, are there any other solutions?

vruusmann commented 3 years ago

after i added SparseToDenseTransformer pipeline stage, the execution fails with error,

Converting a dataset from sparse to dense will increase its size a lot. So, Apache Spark raised some out-of-memory error?

are there any other solutions?

Look inside the SparseToDenseTransformer to understand its business logic. You'd need to achieve that by other mans then.

liumy601 commented 3 years ago

after i added SparseToDenseTransformer pipeline stage, the execution fails with error,

Converting a dataset from sparse to dense will increase its size a lot. So, Apache Spark raised some out-of-memory error?

are there any other solutions?

Look inside the SparseToDenseTransformer to understand its business logic. You'd need to achieve that by other mans then.

Hi vruusmann, sorry for disturbing again, recently i tried this model again, when i apply SparseToDenseTransformer to some categorical features which are low dimension, i find the pmml result is consistent with xgboost4j. But the program is very slow when i use high dimension categorical features, do you have any suggestions to apply that transformer to those high dim features?