jpmml / r2pmml

R library for converting R models to PMML
GNU Affero General Public License v3.0
73 stars 18 forks source link

preProcess is not working with xgboost? #41

Closed wendywangwwt closed 6 years ago

wendywangwwt commented 6 years ago

Hi,

I tried the example code and combined the example of preProcess option and the example of xgboost. However, I got an error. Is it because xgboost wrapper doesn't support the proProcess option or I made some mistake?

The reproducible code is here:

library("xgboost")
library("r2pmml")
library("caret")

data(iris)

# Create a preprocessor
iris.preProcess = preProcess(iris, method = c("range"))

# Use the preprocessor to transform raw Iris data to pre-processed Iris data
iris.transformed = predict(iris.preProcess, newdata = iris)

iris_X = iris.transformed[, 1:4]
iris_y = as.integer(iris.transformed[, 5]) - 1

# Generate XGBoost feature map
iris.fmap = genFMap(iris_X)

# Generate XGBoost DMatrix
iris.DMatrix = genDMatrix(iris_y, iris_X)

# Train a model
iris.xgb = xgboost(data = iris.DMatrix, missing = NULL, objective = "multi:softmax", num_class = 3, nrounds = 13)

# Export the model to PMML.
# Pass the feature map as the `fmap` argument.
# Pass the name and category levels of the target field as `response_name` and `response_levels` arguments, respectively.
# Pass the value of missing value as the `missing` argument
# Pass the optimal number of trees as the `ntreelimit` argument (analogous to the `ntreelimit` argument of the `xgb::predict.xgb.Booster` function)
r2pmml(iris.xgb, "iris_xgb.pmml", 
       fmap = iris.fmap, response_name = "Species", response_levels = c("setosa", "versicolor", "virginica"), 
       missing = NULL, ntreelimit = 7, compact = TRUE,
       preProcess = iris.preProcess)

and the error I got is:

> r2pmml(iris.xgb, "iris_xgb.pmml", 
+        fmap = iris.fmap, response_name = "Species", response_levels = c("setosa", "versicolor", "virginica"), 
+        missing = NULL, ntreelimit = 7, compact = TRUE,
+        preProcess = iris.preProcess)
Mar 29, 2018 11:26:12 AM org.jpmml.rexp.Main run
INFO: Parsing RDS..
Mar 29, 2018 11:26:12 AM org.jpmml.rexp.Main run
INFO: Parsed RDS in 16 ms.
Mar 29, 2018 11:26:12 AM org.jpmml.rexp.Main run
INFO: Initializing default Converter
Mar 29, 2018 11:26:12 AM org.jpmml.rexp.Main run
INFO: Initialized org.jpmml.rexp.XGBoostConverter
Mar 29, 2018 11:26:12 AM org.jpmml.rexp.Main run
INFO: Converting..
Mar 29, 2018 11:26:12 AM org.jpmml.rexp.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException
    at org.jpmml.xgboost.RegTree.encodePredicate(RegTree.java:169)
    at org.jpmml.xgboost.RegTree.encodeNode(RegTree.java:115)
    at org.jpmml.xgboost.RegTree.encodeTreeModel(RegTree.java:95)
    at org.jpmml.xgboost.ObjFunction.createMiningModel(ObjFunction.java:65)
    at org.jpmml.xgboost.MultinomialLogisticRegression.encodeMiningModel(MultinomialLogisticRegression.java:55)
    at org.jpmml.xgboost.GBTree.encodeMiningModel(GBTree.java:77)
    at org.jpmml.xgboost.Learner.encodeMiningModel(Learner.java:148)
    at org.jpmml.rexp.XGBoostConverter.encodeModel(XGBoostConverter.java:124)
    at org.jpmml.rexp.XGBoostConverter.encodeModel(XGBoostConverter.java:39)
    at org.jpmml.rexp.ModelConverter.encodePMML(ModelConverter.java:78)
    at org.jpmml.rexp.ModelConverter.encodePMML(ModelConverter.java:70)
    at org.jpmml.rexp.Main.run(Main.java:149)
    at org.jpmml.rexp.Main.main(Main.java:97)

Exception in thread "main" java.lang.IllegalArgumentException
    at org.jpmml.xgboost.RegTree.encodePredicate(RegTree.java:169)
    at org.jpmml.xgboost.RegTree.encodeNode(RegTree.java:115)
    at org.jpmml.xgboost.RegTree.encodeTreeModel(RegTree.java:95)
    at org.jpmml.xgboost.ObjFunction.createMiningModel(ObjFunction.java:65)
    at org.jpmml.xgboost.MultinomialLogisticRegression.encodeMiningModel(MultinomialLogisticRegression.java:55)
    at org.jpmml.xgboost.GBTree.encodeMiningModel(GBTree.java:77)
    at org.jpmml.xgboost.Learner.encodeMiningModel(Learner.java:148)
    at org.jpmml.rexp.XGBoostConverter.encodeModel(XGBoostConverter.java:124)
    at org.jpmml.rexp.XGBoostConverter.encodeModel(XGBoostConverter.java:39)
    at org.jpmml.rexp.ModelConverter.encodePMML(ModelConverter.java:78)
    at org.jpmml.rexp.ModelConverter.encodePMML(ModelConverter.java:70)
    at org.jpmml.rexp.Main.run(Main.java:149)
    at org.jpmml.rexp.Main.main(Main.java:97)
Error in .convert(tempfile, file, converter, converter_classpath, verbose) : 
  1

I saw issue 40 and checked my java version:

java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

Thank you!

vruusmann commented 6 years ago

Thanks for the reproducible example.

The exception that I'm seeing is the following:

Exception in thread "main" java.lang.IllegalArgumentException
    at org.jpmml.xgboost.RegTree.encodePredicate(RegTree.java:169)

The R2PMML/JPMML-R stack has passed the control to the JPMML-XGBoost library, which fails because it encounters an unsupported feature data type.

XGBoost can handle integer and float features. However, when Caret pre-processing is involved, then the feature type is promoted from float to double.

In principle, Caret pre-processing can be combined with any R model type, including the xgb.Booster model type. The R2PMML/JPMML-R stack should simply perform an explicit downcast from double to float here in order to make the JPMML-XGBoost library happy. This is a simple fix that should become available in a couple of days time.

asuskitty commented 6 years ago

I get an error when converting an xgboost model with variable transformations into pmml while the same process for a linear regression works:

library(pmml)
library(pmmlTransformations)
library(xgboost)
data(audit)
audit <- audit[, c("Age", "Deductions", "Hours", "Income", "Sex")]
audit$Sex <- as.character(audit$Sex)
audit$Income <- ifelse(audit$Income > 40000,
1,
0
)

auditBox <- WrapData(audit[, c("Age", "Deductions", "Hours", "Sex", "Income")])
t <- list()
m <- data.frame(c("Sex","string","Male"),
c("d_sex2","integer",1))
t[[1]] <- m
auditBox <- MapXform(auditBox,xformInfo=t,defaultValue=c(0),
mapMissingTo="0")

#######################################
############ LM ##################
fit<-lm(Income~.,data=auditBox$data[, c("Age", "Deductions", "Hours", "d_sex2", "Income")])
fit_pmml = pmml(fit,transforms=auditBox)

#######################################
############ XGBOOST ##################

tablonXg <- xgb.DMatrix(as.matrix(auditBox$data[, c("Age", "Deductions", "Hours", "d_sex2")]),
label = audit[, "Income"]
)
fit_xg <- xgb.train(data = tablonXg,
nrounds = 100,
objective = "binary:logistic",
eval_metric = "auc",
verbose = FALSE
)
xgb.dump(fit_xg, "fitmodel")
xg_pmml = pmml(fit_xg,
transforms=auditBox,
inputFeatureNames = c("Age", "Deductions", "Hours", "d_sex2"),
outputLabelName = "Income",
xgbDumpFile = "fitmodel"
)

Error in .pmmlLocalTransformations(field, transforms, ltNode) :
object 'ltNode' not found
In addition: Warning message:
In pmml.xgb.Booster(fit_xg, transforms = auditBox, inputFeatureNames = c("Age", :
No output categories given; regression model assumed.
vruusmann commented 6 years ago

@asuskitty Your code example is about pmml and pmmlTransformations packages, while this is the r2pmml package.

Please switch from pmml to r2pmml.