imbs-hl / ranger

A Fast Implementation of Random Forests
http://imbs-hl.github.io/ranger/
771 stars 193 forks source link

Function for combining two ranger models #356

Open PhilippPro opened 5 years ago

PhilippPro commented 5 years ago

Hi Marvin,

is there already a function for combining two ranger models, as requested here: https://stackoverflow.com/questions/44444140/combing-two-objects-of-class-ranger-in-r-for-survival-analysis

Best regards, Philipp

mnwright commented 5 years ago

Yes, we should have that function. It might be as easy as the function in #351, but I have to check the details. If it's urgent you might try to adapt that function (PR welcome). If not, I will add it later in October (off for holidays from tomorrow).

PhilippPro commented 5 years ago

I wrote a function for combining them (it is not urgent, though). There are some open questions for some list elements that are not easy to aggregate, possible solution is to calculate the weighted average, see the code.

combineRanger = function(mod1, mod2) {
  # make several tests, to see if models fit together?
  # (formula, num.independent.variables, mtry, min.node.size, splitrule, treetype, call?, importance.mode, num.samples, replace, ...)

  res = mod1
  res$num.trees = res$num.trees + mod2$num.trees
  res$inbag.counts = c(res$inbag.counts, mod2$inbag.counts)
  res$forest$child.nodeIDs = c(res$forest$child.nodeIDs, mod2$forest$child.nodeIDs)
  res$forest$split.varIDs = c(res$forest$split.varIDs, mod2$forest$split.varIDs)
  res$forest$.split.values = c(res$forest$split.values, mod2$forest$split.values)
  if (!is.null(res$forest$terminal.class.counts)) {
    res$forest$terminal.class.counts = c(res$forest$terminal.class.counts, mod2$forest$terminal.class.counts)
    res$forest$terminal.class.counts = c(res$forest$terminal.class.counts, mod2$forest$terminal.class.counts)
  }
  res$forest$num.trees = res$forest$num.trees + mod2$forest$num.trees
  res$call$num.trees = res$num.trees
  res

  # list elements that are not simple to aggregate:
  # res$predictions
  # res$prediction.error 
  # res$r.squared
  # res$variable.importance
  # possible solution (like in the randomForest combine function): Just calculate the weighted average (weighted by the trees) for these.
}

mod1 = ranger(Species ~ ., data = iris, keep.inbag = TRUE, probability = TRUE, num.trees = 100, mtry = 2, importance = "impurity")
mod2 = ranger(Species ~ ., data = iris, keep.inbag = TRUE, probability = TRUE, num.trees = 200)
mod_total = combineRanger(mod1, mod2)
mod_total
XavierPrudent commented 5 years ago

Hello Philipp, I have tried the function combineRanger, but when using the combined model for predicting I get:

Error in predict.ranger.forest(forest, data, predict.all, num.trees, type,  : 
  Error: Invalid forest object. Is the forest grown in ranger version <0.3.9? Try to predict with the same version the forest was grown.

I looked into the code, and surprisingly the condition that gives rise to this error is not fulfilled. Could you use it successfully? If needed I can give access the models I trained.

Regards,

Xavier Prudent

PhilippPro commented 5 years ago

Hello Philipp, I have tried the function combineRanger, but when using the combined model for predicting I get:

Error in predict.ranger.forest(forest, data, predict.all, num.trees, type,  : 
  Error: Invalid forest object. Is the forest grown in ranger version <0.3.9? Try to predict with the same version the forest was grown.

I looked into the code, and surprisingly the condition that gives rise to this error is not fulfilled. Could you use it successfully? If needed I can give access the models I trained.

Regards,

Xavier Prudent

Dear Xavier,

can you write some reproducible code that shows what does not work? My code from above is still working for me.

Cheers, Philipp

XavierPrudent commented 5 years ago

Danke sehr Philipp, I prepare a full reproducible example and post it as soon as available. Xavier

jchen1981 commented 5 years ago

Add "res$forest$num.trees = res$forest$num.trees + mod2$forest$num.trees"

PhilippPro commented 5 years ago

Add "res$forest$num.trees = res$forest$num.trees + mod2$forest$num.trees"

Thanks, I updated the code above.

RudolfJagdhuber commented 4 years ago

Hey everyone,

The given solution can produce some very nasty hidden errors.

The main problem is that the split.varIDs are simple integer IDs, which are just combined by c(). Both ranger objects start their index at 1 and then falsely refer to the first element of the independent.variable.names.

This does not produce a visible error but gives completely wrong results in prediction.

Example:

rf1 = ranger(y ~ V1, data = dat, num.trees = 1)
rf2 = ranger(y ~ V2, data = dat, num.trees = 1)

rf = combineRanger(rf1, rf2)

rf$forest$split.varIDs
[[1]]
[1] 1 1 1 0 0 0 0

[[2]]
[1] 1 1 1 0 0 0 0

Both trees are now identical, but they originally refered to different variables.

I don't have a better solution, just trying to warn.

Best regards, Rudi

mnwright commented 4 years ago

I think the function is meant to combine two forests grown on the same data (for parallelization) and not with different variables. Anyway, we should check for that when we add it to the package.

ghost commented 4 years ago

What's wanted is something like the combine function in randomForest so it can be used with, for instance, the .combine= feature of the foreach package. For now because of that absence, I need to rely upon randomForest and I'd much rather be using ranger.

After examining randomForest further and thinking about it, this isn't at all what I want. The combine function in randomForest only facilitates concurrent processing of the same model across parts of a single dataset, not combination of arbitrary forests, even if the covariates are identical.

The further thinking suggests that this combination can be done with ranger as it is, but it entails building an appropriate training dataframe. In my case, there a M datasets, all having the same covariates, but arising from different samples. Say the kth dataset has n{k} records in it, and there are N records overall. So, what I want to do in each case is step over the M datasets, picking the kth at the kth step, and building a training dataset consisting of N-n{k} records and using the n_{k} records of the kth dataset as a test. This involves running ranger M times.

No need for combine in my case.