Open hadjipantelis opened 6 years ago
This could easily be slipped into the upcoming version, but why not just retrain with n-m trees?
Prediction could be parametrized with a logical vector, for example, with 'n-m' entries of 'true' and the remaining 'm' entries 'false'. It seems, though, that these entries would need to be chosen at random, then applied iteratively, in order to get a good sense of over-fitting.
Thank you for the speedy response.
Wouldn't retraining a new RF require additional time? (and space)
The use-case I am thinking of is that as with a GBM one can select the number of iterations, one could do something similar with an RF.
I appreciate that the tree order is irrelevant in the case of RF so what you describe with the use of logical vector is probably an ideal scenario but for a quick check just having the first N-M
entries set to TRUE
and the final M
to FALSE
is probably fine. We are bootstrapping the original data anyway. Just using the first N-M
trees will also be faster because in a large forest we would not have to traverse all the trees and set them to zero.
Wouldn't retraining a new RF require additional time? (and space)
Assuming a "moderate" number of predictors, it takes (very roughly) about ten times as long to train as to predict. So, yes, resampling from the same forest will be faster than retraining each time. The results will not be identical to to those obtained through retraining, but they may be suitable for your purposes. Unless forest are retained following prediction, though, there should not be a memory penalty.
just having the first N-M entries set to TRUE and the final M to FALSE is probably fine.
Yes, but a new feature like this should be sufficiently general to support a variety of use cases.
Just using the first N-M trees will also be faster because in a large forest we would not have to traverse all the trees and set them to zero.
I may be missing your point, but initializing an index vector seems like a two- or three-line operation at worst.
N-M
trees rather than random access. Granted, it is a minor point!assume they are stored sequentially in memory so we will have "unit-stride" if we just used the first N-M trees rather than random access
Trees are stored sequentially but their sizes are not uniform so, in particular, stride is not fixed. In any case, bagging already introduces a precedent for ignoring a given tree at a given row. The feature you propose generalizes this a bit, when prediction is not bagged, by ignoring a given tree at all rows.
I have a feature request: Would it be possible to have
predict.Rborist
to accept annTree
argument similar to the standardRborist
? Given that a forest hasN
trees, we should be able to provide predictions usingN-M
trees.This functionality can be useful to see if/when additional tress lead to over-fitting.