jpmml / r2pmml

R library for converting R models to PMML
GNU Affero General Public License v3.0
73 stars 18 forks source link

Error converting Ranger with caret #23

Closed edumucelli closed 7 years ago

edumucelli commented 7 years ago

Hi Villu,

I have been facing a different behavior between the conversion of "rf" and "ranger", the latter fails with the following error

Error in .convert(tempfile, file, ...) : 
  unused argument (variable.levels = c("false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", 
"false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", "false", "true", 

variable.levels is not been taking into consideration, though. However it works normally to convert a rf model with the same configuration. Bellow is the reproducible code

library(caret)
library(r2pmml)

NROW_TRAIN = 2000
columns = c(100, 500, 1000)
rows = c(100, 500, 1000)

train_with_matrix <- function(train_x, train_y, method) {
    crtl = trainControl(method = "cv", 
                        returnData = TRUE)
    tune_length = 1
    method_fit = train(x = train_x,
                       y = train_y, 
                       method = method,     
                       trControl = crtl,
                       tuneLength = tune_length)
    return (method_fit)
}

for (NCOL in columns) {
    CLASS = replicate(NROW_TRAIN, sample(c('true', 'false'), size = 1, replace=FALSE, prob=c(0.7, 0.3)))
    data = data.frame(CLASS = CLASS, replicate(NCOL, sample(rnorm(NROW_TRAIN), replace=TRUE)))

    trainIndex = createDataPartition(data$CLASS, p = .75, times = 1, list = FALSE)

    train = data[ trainIndex,]

    train_x = train[, -1]
    train_y = train[, 1]

    method = "ranger" # "rf" works

    model_fit = train_with_matrix(train_x, train_y, method)
    # Have tried also sapply(data, levels)) as I am not sure whether we have to pass all levels or only objective column's levels.
    r2pmml(model_fit, paste0(method, ".benchmark.", NCOL, ".pmml"), variable.levels = sapply(train_y, levels))
}

And here is the sessionInfo

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ranger_0.6.0    e1071_1.6-8     r2pmml_0.12.3   caret_6.0-76   
[5] ggplot2_2.2.1   lattice_0.20-34

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.9        magrittr_1.5       splines_3.3.2      MASS_7.3-45       
 [5] munsell_0.4.3      colorspace_1.3-2   foreach_1.4.3      minqa_1.2.4       
 [9] stringr_1.2.0      car_2.1-4          plyr_1.8.4         tools_3.3.2       
[13] parallel_3.3.2     nnet_7.3-12        pbkrtest_0.4-6     grid_3.3.2        
[17] gtable_0.2.0       nlme_3.1-131       mgcv_1.8-17        quantreg_5.29     
[21] class_7.3-14       MatrixModels_0.4-1 iterators_1.0.8    lme4_1.1-12       
[25] lazyeval_0.2.0     assertthat_0.1     tibble_1.2         Matrix_1.2-8      
[29] nloptr_1.0.4       reshape2_1.4.2     ModelMetrics_1.1.0 codetools_0.2-15  
[33] stringi_1.1.2      compiler_3.3.2     scales_0.4.1       stats4_3.3.2      
[37] SparseM_1.74    

I guess I am missing something on the variable.levels which I have not been able to figure out. Thank you in advance for any tips on that problem. Cheers!

vruusmann commented 7 years ago

The order of function parameters for the r2pmml.ranger() function was changed between 0.13.0 and 0.13.1 versions: https://github.com/jpmml/r2pmml/commit/15c432278ff7fe8effac2100452ed02eb355c01a

According to sessionInfo(), you're currently using r2pmml version 0.12.3. Your problem should be solved simply by upgrading to the latest version:

library("devtools")

remove.packages("r2pmml")
install_github(repo = "jpmml/r2pmml")

The variable.levels argument is supposed to provide complete set of category levels for categorical features (as the ranger model object does not contain this information). So, you need to do sapply(train_x, levels). If your dataset does not contain any categorical features, then you may leave it empty.

Also, you might want to consider upgrading to ranger version 0.7.0.

vruusmann commented 7 years ago

Sorry, just realized that your issue is about passing variable.levels in a situation where the ranger function is invocated via the caret package, not directly.

In this case, you need to attach the variable.levels field directory to the train$finalModel object:

model_fit = train_with_matrix(train_x, train_y, method)
model_fit$finalModel$variable.levels = sapply(train_x, levels) # THIS
r2pmml(model_fit, paste0(method, ".benchmark.", NCOL, ".pmml"))

I'm reopening this issue in order to figure out a more elegant way of "decorating" caret-trained model objects with extra information.

It affects more model types, not just the ranger model type. For example, there needs to be an easy way to decorate caret-trained xgb.Booster objects with feature map information.

vruusmann commented 7 years ago

Now it's possible to simply do:

model_fit = train_with_matrix(train_x, train_y, method)
r2pmml(model_fit, paste0(method, ".benchmark.", NCOL, ".pmml"), variable.levels = sapply(train_x, levels))
edumucelli commented 7 years ago

Wow, great! Thanks for the quickness and the responsiveness!

EwelinaEwelina commented 6 years ago

It's not working for me for neural networks:

mynn <- nnet(Churn ~ ., data=trainNN, size=3, decay=1.0e-5, maxit=50, softmax = TRUE)
mynn$variable.levels <- lapply(trainNN, function(x){ if(is.factor(x)) { levels(x) } else { NULL }})

After:

r2pmml(mynn,"churn_nnet_pmml.xml", verbose = TRUE, variable.levels = mynn$variable.levels)

I have an error:

Error in decorate.default(x, ...) : 
  unused argument (variable.levels = list(X.area.code.408 = NULL, X.area.code.510 = NULL, X.international.plan.1 = NULL, X.number.vmail.messages. = NULL, X.total.day.charge. = NULL, Churn = NULL))
vruusmann commented 6 years ago

@EwelinaEwelina The "decoration" is different for each R model type. What works for ranger (this original issue) does not work for nnet (your issue).

The r2pmml::r2pmml() function does all the necessary decoration work behind the scenes automatically, so there is no need to assist it in any way.