Feature request: `predict.Rborist with nTree argument

suiji / Arborist

Scalable decision tree training and inference.

Other

82 stars 14 forks source link

Feature request: `predict.Rborist with nTree argument #39

Open hadjipantelis opened 6 years ago

hadjipantelis commented 6 years ago

I have a feature request: Would it be possible to have predict.Rborist to accept an nTree argument similar to the standard Rborist? Given that a forest has N trees, we should be able to provide predictions using N-M trees.

This functionality can be useful to see if/when additional tress lead to over-fitting.

suiji commented 6 years ago

This could easily be slipped into the upcoming version, but why not just retrain with n-m trees?

Prediction could be parametrized with a logical vector, for example, with 'n-m' entries of 'true' and the remaining 'm' entries 'false'. It seems, though, that these entries would need to be chosen at random, then applied iteratively, in order to get a good sense of over-fitting.

hadjipantelis commented 6 years ago

Thank you for the speedy response.

Wouldn't retraining a new RF require additional time? (and space)

The use-case I am thinking of is that as with a GBM one can select the number of iterations, one could do something similar with an RF. I appreciate that the tree order is irrelevant in the case of RF so what you describe with the use of logical vector is probably an ideal scenario but for a quick check just having the first N-M entries set to TRUE and the final M to FALSE is probably fine. We are bootstrapping the original data anyway. Just using the first N-M trees will also be faster because in a large forest we would not have to traverse all the trees and set them to zero.

suiji commented 6 years ago

Wouldn't retraining a new RF require additional time? (and space)

Assuming a "moderate" number of predictors, it takes (very roughly) about ten times as long to train as to predict. So, yes, resampling from the same forest will be faster than retraining each time. The results will not be identical to to those obtained through retraining, but they may be suitable for your purposes. Unless forest are retained following prediction, though, there should not be a memory penalty.

just having the first N-M entries set to TRUE and the final M to FALSE is probably fine.

Yes, but a new feature like this should be sufficiently general to support a variety of use cases.

Just using the first N-M trees will also be faster because in a large forest we would not have to traverse all the trees and set them to zero.

I may be missing your point, but initializing an index vector seems like a two- or three-line operation at worst.

hadjipantelis commented 6 years ago

Cool, we are in agreement on that.
Sure thing! I am mostly thinking what would be the most straightforward interface to the user.
Agreed but I was mostly thinking of the overheard of accessing the trees. I assume they are stored sequentially in memory so we will have "unit-stride" if we just used the first N-M trees rather than random access. Granted, it is a minor point!

suiji commented 6 years ago

assume they are stored sequentially in memory so we will have "unit-stride" if we just used the first N-M trees rather than random access

Trees are stored sequentially but their sizes are not uniform so, in particular, stride is not fixed. In any case, bagging already introduces a precedent for ignoring a given tree at a given row. The feature you propose generalizes this a bit, when prediction is not bagged, by ignoring a given tree at all rows.