jpmml / r2pmml

R library for converting R models to PMML
GNU Affero General Public License v3.0
73 stars 18 forks source link

xgboost PMML incorrect thresholding for negative integers #26

Closed wynag01 closed 6 years ago

wynag01 commented 6 years ago

After debugging differences in prediction from the R model output against the generated PMML file, i found that the PMML generated does not assign the threshold correctly for negative integer fields. Also the file does not always correctly rounds up the integer thresholds. For example, for the integer field, if the threshold is 4.5, the PMML file will round up the threshold to 5 for splitting into "lessThan" or "greatestOrEqual", which makes sense. But even when the threshold is 4, the PMML file will still round up the threshold to 5 which is not correct.

vruusmann commented 6 years ago

The rounding is applied to integer columns. Maybe you can work around your issue temporarily by changing the data type of affected features from integer to float in your feature map specification.

The integer rounding algorithm should match that of the XGBoost codebase. JPMML-XGBoost currently implements "round up (aka round towards positive infinity)". But your analysis suggests that it should be "round away from the zero value (aka round negative values towards negative infinity, and positive values towards positive infinity)".

wynag01 commented 6 years ago

Hi Vrussmann, actually I gave the example for positive integers that have no decimal places. When there are no decimal places, the threshold should be kept as it is. But what I meant was that the negative values should be rounded up - consistent with the xgboost codebase. Currently the negative values are not rounded up. For example, if the threshold recorded in R is -5.5, it is strangely rounded to -4 instead of -5.

Also I tried to change the feature map to floats, but I am getting an error. Would this be the right thing to do?

xg.fmap = genFMap(segment)
xg.fmap$type = "q"
r2pmml(xg.model, fmap=xg.fmap, paste0('xgModel.pmml'))
vruusmann commented 6 years ago

The representation of integer split conditions: https://github.com/dmlc/xgboost/blob/master/src/tree/tree_model.cc#L71

So, the algorithm is (int)(split_cond + 1.0f) - instead of normal rounding (eg. as implemented by the Math.round(float) method), it simply performs cast from float value to integer value.

wynag01 commented 6 years ago

Okay but what is the rationale for producing a PMML file that uses different thresholds than the R object? This causes divergence in prediction results and makes it hard to put the code in production especially since a lot of testing and decisioning was done using R.

Can you also help me clarify how I can change the feature map to floats without throwing an error?

Thank you :)

vruusmann commented 6 years ago

The JPMML-XGBoost library performs the encoding of integer split conditions using exactly the same algorithm: https://github.com/jpmml/jpmml-xgboost/blob/master/src/main/java/org/jpmml/xgboost/RegTree.java#L164

different thresholds than the R object?

What do you mean exactly by "R object"? There is no XGBoost object in R - the XGBoost model is stored in its native data format, and R accesses it using standard XGBoost C API.

What is your R code for visualizing thresholds? It could be the case that you're visualizing the value of the split_cond tree field, and you're accidentally comparing them with "adjusted" values (int)(split_cond + 1). In that case the 1-unit difference would make perfect sense.

wynag01 commented 6 years ago

Could it be an issue with the predict method in R then? predict(xg.model, as.matrix(data))

The prediction results are different in R than the PMML file, and when I perform manual calculations to check the prediction outcomes, for relevant observations, both the R and PMML implementation end up in different branches of the split, particularly in the instances where the split_condition which I extracted from xgb.dump in R are different from the values in the PMML file.

vruusmann commented 6 years ago

Could it be an issue with the predict method in R then? predict(xg.model, as.matrix(data))

Maybe. Is it possible for the R matrix object to contain both float and integer columns? I'm afraid that your matrix object contains all-float columns - all integer features have been automatically "promoted" to float features.

Can you re-run your experiment using Scikit-Learn instead of R? Scikit-Learn has much better support for working with mixed data types.

sudarshan1413 commented 6 years ago

I don't know If I understood the above conversation correctly , but what did wynag01 do to match the results of PMML and R? Is it making sure that all your columns(features) are of same type i.e. float before passing it for model training or doing some changes manually in feature map. Any help/input will be really appreciated as I am facing the same issue with my xgboost.

vruusmann commented 6 years ago

The problem appears to be that R's matrix can only hold values of one datatype. There was a situation where the matrix object had all float columns, but the associated feature map specified a mix of float and integer columns.

The solution is to either replace matrix with some other data container than can hold both float and integer values, or fix the feature map.

@sudarshan1413 Can you get reproducible results if you do exactly as detailed in the README file of the r2pmml package: https://github.com/jpmml/r2pmml#package-xgboost

Specifically, you should generate your data container using the r2pmml::genDMatrix() function, and the associated feature map using the r2pmml::getFMap() function. Once you have verified that the "base setup" works fine, only then you may start replacing those utility functions with something else.

If using the SkLearn2PMML/JPMML-SkLearn stack, then the feature map part of the workflow is hidden away from the end user. Hence, nobody ever complains about mismatching predictions when using SkLearn+XGBoost - very hard to make mistakes, even if acting very carelessly.