topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.62k stars 632 forks source link

Building custom Ordinal Forest Model in caret's train framework #902

Open HanjoStudy opened 6 years ago

HanjoStudy commented 6 years ago

Needing help with

I have recently come across the orindalForest package. This package focuses on an implementation of the well known ranger Random Forest model. With this in mind, I am trying to integrate the model into the caret framework without any luck. Here is what I have so far:

The code

Package can be installed from CRAN

# install.packages("ordinalForest")
library(ordinalForest)

# start by naming my method to pass to train
ordinalForest <- list(type = "Classification",
              library = "ordinalForest",
              loop = NULL)

# define the tuning parameters
prm <- data.frame(parameter = c("nsets", "ntreeperdiv", "ntreefinal", "npermtrial", "nbest"),
                  class = rep("numeric", 5),
                  label = c("Score sets", "Number of Trees (small)", "Number of Trees (final)", "Tried score sets" ,"Best score sets" ))

# append them to the list
ordinalForest$parameters <- prm

Now that the initial code is written, I can define the default training grid. Some models can do a random search, but I wont implement that as the [paper] (https://epub.ub.uni-muenchen.de/41183/1/TR.pdf) states the defaults are pretty good.

ordinalForestGrid <- function(x, y, len = NULL, search = grid) {
  if(search == grid) {
    out <- expand.grid(nsets = 1000,
                       ntreeperdiv = 100,
                       ntreefinal = 5000,
                       npermtrial = 500,
                       nbest = 10)
  } else {
    stop('random search not yet implemented')
  }
  out
}

# append to list
ordinalForest$grid <- ordinalForestGrid

For the fitting function, the ordfor function does have a strange implementation in defining X and Y. I try and overcome this by explicitly asking for X and Y, binding them and then implementing the ordfit function:

ordinalForestFit <- function(x, y) {
  library(ordinalForest)

  data <- data.frame(x, Class = y)

  ordfor(depvar = "Class", data, nsets = 1000, ntreeperdiv = 100, ntreefinal = 5000,
         perffunction = c("equal"), classimp,
         classweights, nbest = 10, naive = FALSE, num.threads = NULL,
         npermtrial = 500, permperdefault = FALSE)

}

# append to list
ordinalForest$fit <- ordinalForestFit

I also notice that the classes are stored in a strange slot as well:

ordinalForest$levels <- function(x) x@classes

Next I add the prediction functions:

ordinalForestPred <- function(modelFit, newdata, preProc = NULL, submodels = NULL) {
  predict(modelFit, newdata)
}

# append to list
ordinalForest$predict <- ordinalForestPred

ordinalForestProb <- function(modelFit, newdata, preProc = NULL, submodels = NULL) {
  predict(modelFit, newdata, type = "prob")
}

# append to list
ordinalForest$prob <- ordinalForestProb

Laslty I add the sorting, how the tuning parameters are ordered in case similar performance obtained

ordinalForestSort <- function (x) x[order(x$nsets, x$ntreeperdiv, x$ntreefinal, 
                                  x$npermtrial, x$nbest), ]

# append to list
ordinalForest$sort <- ordinalForestSort

The test

Now that the hard work is done, lets get the party started

# load caret and doParallel library
library(caret)
library(doParallel)
library(ordinalForest)
data(hearth)

# Still not working whether you use parallel or non parallel
#cl <- makePSOCKcluster(3)
#registerDoParallel(cl)

# define grid of parameter values
tuneGrid <- expand.grid(nsets = 1000,
                   ntreeperdiv = 100,
                   ntreefinal = 5000,
                   npermtrial = 500,
                   nbest = 10)

set.seed(825)
OFTune <- train(x = hearth[,-11],
                  y = hearth[,11],
                  method = ordinalForest,
                  tuneGrid = tuneGrid)
Something is wrong; all the Accuracy metric values are missing:
    Accuracy       Kappa    
 Min.   : NA   Min.   : NA  
 1st Qu.: NA   1st Qu.: NA  
 Median : NA   Median : NA  
 Mean   :NaN   Mean   :NaN  
 3rd Qu.: NA   3rd Qu.: NA  
 Max.   : NA   Max.   : NA  
 NA's   :1     NA's   :1    
Error: Stopping
In addition: Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.
#stopCluster(cl)

With great dissapointed, we get the usual fallback error: There were missing values in resampled performance measures

Basic Implementation

To test the basic implementation, we can use the following code:


library(ordinalForest)
data(hearth)

# wrapper for ease of use
ordinalForestFit <- function(x, y) {
  data <- data.frame(x, Class = y)

  ordfor(depvar = "Class", data, nsets = 60, ntreeperdiv = 100, ntreefinal = 5000,
         perffunction = c("equal"), classimp,
         classweights, nbest = 10, naive = FALSE, num.threads = NULL,
         npermtrial = 500, permperdefault = FALSE)
}

ordinalForestFit(x = hearth[,-11],
                 y = hearth[,11])

Session info

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_South Africa.1252 
[2] LC_CTYPE=English_South Africa.1252   
[3] LC_MONETARY=English_South Africa.1252
[4] LC_NUMERIC=C                         
[5] LC_TIME=English_South Africa.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ordinalForest_2.1    quotidieR_0.0.0.9000 usethis_1.3.0       
[4] devtools_1.13.5     

loaded via a namespace (and not attached):
[1] compiler_3.4.2  backports_1.1.1 tools_3.4.2     withr_2.1.2    
[5] yaml_2.1.19     Rcpp_0.12.16    memoise_1.1.0   digest_0.6.12
topepo commented 6 years ago

Can you try this without parallel processing and with a small, reproducible example that I can test with?

Can you also send the results of sessionInfo()?

HanjoStudy commented 6 years ago

Updated as requested. A small, reproducible example is available as a gist here

hadjipantelis commented 6 years ago

@HanjoStudy: Forget parallel altogether. The fit function is never called. The obvious fix for that is to redefine it as: ordinalForestFit <- function(x, y, param, lev, last, classProbs, ...) { ... } but then you have other issues as the predict function is also incomplete.

If you want, I can make a PR about ordinalForest during the weekend.

Suggestion for the future: Try and follow the style of other model files. It will save you time and trouble. I am not suggesting that your code structure is worse or the current style is better. Nevertheless, the current style is the existing one, is prevalent through the whole code base and it makes it easy to compare files.

HanjoStudy commented 6 years ago

Thanks @hadjipantelis for the advise! I do agree with you, I think next time I would rather start with a caret model skeleton and work it back from there. No rush in implementing the model, I am just slowly trying to wrap my head around the caret framework's underlying components and this ordinalForest seemed like a good example to try it out on since it hasn't been implemented. Ill also keep playing with the code a bit over the next week or so

topepo commented 6 years ago

@hadjipantelis is right that the fit module should have the right arguments.

Another issue is the predict (and likely the prob) module. This package follows the crappy precedent set by ranger that doesn't just return the predictions. You'll need to use the ypred component:

> ?ordfor
> data(hearth)
> 
> set.seed(123)
> trainind <- sort(sample(1:nrow(hearth), size=floor(nrow(hearth)*(1/2))))
> testind <- sort(sample(setdiff(1:nrow(hearth), trainind), size=20))
> 
> datatrain <- hearth[trainind,]
> datatest <- hearth[testind,]
> 
> ordforres <- ordfor(depvar="Class", data=datatrain, nsets=60, nbest=5)
> 
> preds <- predict(ordforres, newdata=datatest)
> 
> preds

Predicted values of 20 observations. 

Classes of ordinal target variable: 
"1", "2", "3", "4", "5" 
> names(preds)
[1] "ypred"         "classfreqtree"
> preds$ypred
 [1] 1 1 1 1 1 4 1 1 1 4 1 1 1 5 1 1 4 4 4 4
Levels: 1 2 3 4 5
hadjipantelis commented 6 years ago

@topepo Yes, I fully agree (that's why I said it is "incomplete"). OP seems busy so I will drop the PR in the next day or two.

hadjipantelis commented 6 years ago

@topepo Some quick questions so we save ourselves back and forth after the PR.

  1. I will put this model as "Classification" (as with polr), not "Regression", because the response variable is expected to be a factor. Are we OK with that?
  2. The response variable is not expected to be an "ordered" factor, just a "nominal" factor. Should this be enforced? (Probably this should be addressed at the base package directly but I see a lot of questions about incomprehensible results coming from not enforcing this.) I choose not for the time being but I can do so if you want.
    1. ordfor does not use sample weights but rather class weights. Weights themselves are used when a custom performance function is used (?ordinalForest::perff for details). The user can theoretically still pass them through the ... if inclined but I am not testing this or including this functionality out-of-the-box as it non-standard (usually weights are an nrow(x) long vector).
  3. LOOCV does not work. This is by design as the single row newdata wll err. e.g.
    
    library(ordinalForest)
    data(hearth) 
    datatrain <- hearth[-1,]
    datatest <- hearth[1,]

ordforres <- ordfor(depvar="Class", data=datatrain, nsets=60, nbest=5) predict( ordforres, datatest)

Error in rowSums(ynumpred == x, na.rm = TRUE) :

'x' must be an array of at least two dimensions


I can e-mail the maintainer about this but this is a separate issue.

Aside these the PR is ready on my fork. :)
mshad-star commented 2 years ago

Hello. shouldnt be set hyperparameters in ordinal forests : nbest = 50 and nset = 10 is not enough and ordinal forest will be suboptimal