kogalur / randomForestSRC

DOCUMENTATION:
https://www.randomforestsrc.org/
GNU General Public License v3.0
115 stars 18 forks source link

How to specify the type of forest when the auto recognition fails #358

Open HaloCollider opened 1 year ago

HaloCollider commented 1 year ago

I've been working on a dataset where y is a bool variable (so as for other models we give a parameter: family = binominal(link = probit)) but for randomForestSRC it returns a regression type, no matter how I change the datatype (bool or double or int). I checked the documentation and it shows that randomForestSRC will automagically recognize the type, but in this particular case there is no way for me to do a manual correction. Besides, I also didn't find the way to do so in the traditional randomForest package for R. Given that I am a rookie into R I really spent much time on this and hopefully you may provide a solution for this or point out the mistake I've made. Thank you so much.

ishwaran commented 1 year ago

From the help file:

Types of forests

There is no need to set the type of forest as the package automagically determines the underlying random forest requested from the type of outcome and the formula supplied. There are several possible scenarios:

Regression forests for continuous outcomes.

Classification forests for factor outcomes.

Therefore, in order for the function to recognize the problem as being classification, the outcome has to be coded as a factor. Something like:

mydata$myoutcome <- factor(mydata$myoutcome)

On a side note, a regression tree with 0/1 binary values (under mean-squared error splitting, the default) is equivalent to fitting a two-class classification tree (under Gini index). So actually it doesn't really matter, although I recommend converting the outcome to a factor (as above) as the output will include values that are normally output only for classification (like misclassification error and so forth)

HaloCollider commented 1 year ago

Thanks for your explanation. This worked. I think it's the little knowledge I had about R that caused the problem. We are using the cross-entropy splitting rule for different models so there are some differences between the MSE and this one. Now the RF model works fine. Again, I am very grateful for your help!

Mosen111 commented 4 months ago

hello. I'm encountering an issue with my analysis involving a binary outcome variable (0, 1), which I've set as a factor. Despite this, I'm receiving an error stating that the variable is not recognized as a factor. Could you please help me understand why this error occurs and how I can resolve it? Thank you in advance for your assistance.

data$d_w_c_r <- factor(data$d_w_c_r)

n<- rfsrc(d_w_c_r ~ ., data = data ) Error in parseFormula(formula, data, ytry) : the y-outcome must be either real or a factor.

In fact, it does not work as numeric as well:

data$d_w_c_r <- as.numeric(data$d_w_c_r) n<- rfsrc(d_w_c_r ~ ., data = data) Error in parseFormula(formula, data, ytry) : the y-outcome must be either real or a factor.

ishwaran commented 4 months ago

Can you show the output from running the command print(summary(data$d_w_c_r))

Mosen111 commented 4 months ago

sure:

is.factor(data$d_w_c_r) [1] TRUE print(summary(data$d_w_c_r)) 0 1 462 258

ishwaran commented 4 months ago

make sure that data is a data.frame

data <- data.frame(data)

Mosen111 commented 4 months ago

thank you very much, it worked.