kogalur / randomForestSRC

DOCUMENTATION:
https://www.randomforestsrc.org/
GNU General Public License v3.0
113 stars 18 forks source link

Cross-validation #254

Closed MoMaz123 closed 2 years ago

MoMaz123 commented 2 years ago

hi guys, I have read the section related to "In-Sample and Out-of-Sample (In-Bag and Out-of-Bag)", however, I am still not 100 % sure if this method needs us to make train/test data or if it has internal train/test. thanks. any explanation would be appreciated!

ishwaran commented 2 years ago

No, training/testing splits are not needed with random forests, this is one of its huge advantages. The out-of-bag method gives an internal estimator for the error rate. Also the OOB predicted values that are returned are cross-validated.

From the in-bag out-of bag vignette:

The IB ensemble is used for prediction on new data. It is almost never used for inference on the training data.

The OOB ensemble is used for inference on the training data and for obtaining OOB performance values such as the prediction error and variable importance.

Here's an illustration for mtcars which shows how the error rate (cross-validated) is obtained from the OOB predictor:

## run mtcars, and print out error rate and other iformation
o <- rfsrc(mpg~.,mtcars)
print(o)

## we can get the error rate directly from the OOB estimator
print(mean((o$predicted.oob-o$yvar)^2))
MoMaz123 commented 2 years ago

oh, great! thanks, also, I was wondering if I can have a categorical variable among my variables (predictors) or if they have to be continuous. I run with Categorical but I haven't got any error,

ishwaran commented 2 years ago

You can use categorical variables. No hot-encoding needed at all! You will see how good the code is in dealing with categorical factors in test data.