jpmml / jpmml-xgboost

Java library and command-line application for converting XGBoost models to PMML
GNU Affero General Public License v3.0
128 stars 43 forks source link

Wrong number of features in the R script sample from README.md #58

Closed kdutkowski closed 2 years ago

kdutkowski commented 2 years ago

I tried running the mtcars training sample script from your example in README and it gives me the following error:

> # Dump the model in text format
> xgb.dump(mtcars.xgb, "xgboost.model.txt", fmap = "xgboost.fmap")
Error in xgb.dump(mtcars.xgb, "xgboost.model.txt", fmap = "xgboost.fmap") : 
  [16:03:26] amalgamation/../src/data/../c_api/c_api_utils.h:240: Check failed: feature_map.Size() == n_features (12 vs. 11) : 

I think it should be mtcars.fmap = as.fmap(mtcars.matrix) instead of mtcars.fmap = as.fmap(mtcars.frame), it works alright when I change it. Am I right?

vruusmann commented 2 years ago

I think it should be mtcars.fmap = as.fmap(mtcars.matrix) instead of mtcars.fmap = as.fmap(mtcars.frame),

This is quite an old code example. I believe that old(er) XGBoost versions did not perform this sanity check, and were happy to accept an extra column.

This extra column represents the label. It's the last one (ie. on the rightmost position), so it does not distort the indices/interpretations of the earlier feature columns. If this extra column was on the first position, then the exported model would be referencing wrong features, and would be making incorrect predictions.

it works alright when I change it. Am I right?

If the conversion process did not raise any errors, and the PMML model makes correct predictions when invoked with sample data, then it most definitely is correct. My approval is not needed.

There are more R code examples here (more complex stuff like categorical features, missing values, etc): https://github.com/jpmml/jpmml-xgboost/blob/1.6.0/pmml-xgboost-testing/src/test/resources/xgboost.R

Some more examples are available in the JPMML-R project: https://github.com/jpmml/jpmml-r/blob/1.4.5/src/test/R/xgboost.R

One thing you could try is embedding model verification dataset into the PMML document: https://github.com/jpmml/jpmml-r/blob/1.4.5/src/test/R/xgboost.R#L90

vruusmann commented 2 years ago

BTW: in future XGBoost version (1.5.X and up), it should be possible to get rid of the "feature map" functionality, because the XGBoost model file will contain basic information about the training dataset - feature names, category levels for categorical features, etc.

Will close this issue with an updated code example someday later.

kdutkowski commented 2 years ago

Ok, that's perfect. My only goal was to give you a heads up about the issue I stumbled on, thanks for your quick response!