Closed liumy601 closed 3 years ago
i used jpmml-sparkml to export the xgboost model
Most commonly, this issue is about improper encoding of missing values.
I give it a >95% chance that all the JPMML software stack is correct, and your application code is wrong. Prove me wrong by presenting a reproducible test case.
I'm closing this issue now, because it does not contain any actionable information.
@vruusmann the datafield definition looks like this:
<DataField name="c_vip_14_crowd" optype="categorical" dataType="string">
<Value value="310505"/>
<Value value="310205"/>
<Value value="310101"/>
<Value value="310305"/>
<Value value="__unknown" property="invalid"/>
</DataField>
it treats missing value as __unknown.
maybe this is defined by stringindexer, in stringindexer, i adds setHandleInvalid("keep").
except this, i didn't do anything special for missing values.
it treats missing value as __unknown.
The above PMML fragment clearly states that __unknown
values are interpreted as invalid values, not as missing values.
PMML treats valid, invalid and missing value spaces as completely disjoint.
If you change the DataField@property
attribute value from invalid
to missing
, does the prediction come out correct or not?
@vruusmann sorry, i just replaced all DataField@property from invalid to missing, the predictions is still different and it doesn't have much difference with using invalid.
it doesn't have much difference with using invalid.
The DataField
element configuration may be overriden by the MiningField
element configuration (all attributes related to missing and invalid value treatment/replacement).
Honestly, it's impossible for me to say anything worthwhile if your complaint/feedback is limited to "predictions don't match". Please provide a reproducible test case, or stfu.
@vruusmann https://github.com/jpmml/jpmml-sparkml-xgboost/issues/1#issuecomment-603816378 , here you said we need to be careful when pass onehot encoded feature from spark to xgboost4j, as they treat 0,1,NaN different, i'm not very clear about how to do, now i just pass one hot features into xgboost4j by default.
you said we need to be careful when pass onehot encoded feature from spark to xgboost4j, as they treat 0,1,NaN different.
Correct - NaN is a special value for XGBoost, whereas for Apache Spark ML and Scikit-Learn it's a common space-filling value (eg. in case of sparse data matrices).
Your Apache Spark ML pipeline is set up incorrectly (treats 0 and NaN synonymously, but it shouldn't do so), and it is making incorrect predictions. The conversion to PMML makes this incorrect setup visible, because PMML natively treats 0 as valid value and NaN as invalid/missing value.
i'm not very clear about how to do, now i just pass one hot features into xgboost4j by default.
See this example: https://github.com/jpmml/jpmml-sparkml-xgboost/blob/master/src/test/resources/XGBoostAudit.scala
Specifically, convert NaN values to 0 values by inserting an org.jpmml.sparkml.xgboost.SparseToDenseTransformer
pipeline stage:
https://github.com/jpmml/jpmml-sparkml-xgboost/blob/master/src/test/resources/XGBoostAudit.scala#L16
The above integration test proves that the JPMML software stack (converter + evaluator) is doing correct job.
@vruusmann sorry, after i added SparseToDenseTransformer pipeline stage, the execution fails with error, i remembered i used that transformer before and failed, it looks like memory error, are there any other solutions?
after i added SparseToDenseTransformer pipeline stage, the execution fails with error,
Converting a dataset from sparse to dense will increase its size a lot. So, Apache Spark raised some out-of-memory error?
are there any other solutions?
Look inside the SparseToDenseTransformer
to understand its business logic. You'd need to achieve that by other mans then.
after i added SparseToDenseTransformer pipeline stage, the execution fails with error,
Converting a dataset from sparse to dense will increase its size a lot. So, Apache Spark raised some out-of-memory error?
are there any other solutions?
Look inside the
SparseToDenseTransformer
to understand its business logic. You'd need to achieve that by other mans then.
Hi vruusmann, sorry for disturbing again, recently i tried this model again, when i apply SparseToDenseTransformer to some categorical features which are low dimension, i find the pmml result is consistent with xgboost4j. But the program is very slow when i use high dimension categorical features, do you have any suggestions to apply that transformer to those high dim features?
@vruusmann i used jpmml-sparkml to export the xgboost model and jpmml-evaluator to evaluate the same records, features includes both numeric and categorical features, but i found predict result is different. after i kept only numerci features, i find predict result is the same, so i guess there may be some problems with the process of categorical features. i remembered there were issues about this, but i can't find them. Can you please give me some guides?