kogalur / randomForestSRC

DOCUMENTATION:
https://www.randomforestsrc.org/
GNU General Public License v3.0
113 stars 18 forks source link

Understanding randomForestSRC behaviour after variable selection and tuning #416

Open candelas762 opened 4 months ago

candelas762 commented 4 months ago

In a similar case as this question, I have to predict species proportions from some remotely sensed variables (more than 100 variables). I would like to implement a random forest algorithm for such purpose and the package randomForestSRC looks promisisng.

I have training data from field inventories and validation data from an independent dataset. I wanted to test different approaches of fitting a random forest model and then check for the accuracy of the predictions.

First approach is to simply fit a random forest model from the training data and apply it to the test data.

Second approach is to do a variable selection (based on this vignette) before fitting the model using the best variables and finally apply the model to the test data.

Third approach is to do a variable selection and fit a model with the best variables, tune the model (based on the manual) and apply it to the test data.

My surprise is that I get exacly the same accuracy for the three approaches and I am not sure what is the reason.

I have made a reproducible example to show this. In the example I make an artificial dataframe that I divide into train and test data to mimic my study data.

library(randomForestSRC)
library(yaImpute)

##Creating Fake Data##
set.seed(88)#For reproducibility

number_of_observations = 100

#Response variables#
PSME_BA <-rnorm(number_of_observations,50, 15)
TSHE_BA <-rnorm(number_of_observations,40,12)
ALRU2_BA<-rnorm(number_of_observations,20,0.5)
Total_BA<-PSME_BA+TSHE_BA+ALRU2_BA

#Predictor variables#
generate_variables <- function(num_vars, num_samples, min_val, max_val) {
  variable_list <- list()
  for (i in 1:num_vars) {
    variable_list[[paste0("B", i)]] <- runif(num_samples, min_val, max_val)
  }
  return(as.data.frame(variable_list))
}
variables <- generate_variables(30, number_of_observations, 0, max_values <- c(2000, 1800, 3000, 5500, 3800, 5900, rep(10000, 24)))

#Dataset for modeling#
DF<-data.frame(PSME=PSME_BA/Total_BA, TSHE=TSHE_BA/Total_BA, ALRU2=ALRU2_BA/Total_BA, # proportions
               variables)

#use 80% of dataset as training set and 20% as test set
sample <- sample(c(TRUE, FALSE), nrow(DF), replace=TRUE, prob=c(0.8,0.3))
train  <- DF[sample, ]
test   <- DF[!sample, ]

## parse training data into y and x data
ydta = train[,c("PSME", "TSHE", "ALRU2")]
xdta = train[,-c(1:3)]

#--------------------------------------
#
# No variable selection
#
#--------------------------------------

rf_model <- rfsrc(get.mv.formula(colnames(ydta)),
                  train,
                  importance = 'permute', nsplit = 10)

# predict all the random forest data
pred1 <- predict(rf_model, test, na.action= "na.impute")

#--------------------------------------
#
# With variable selection
#
#--------------------------------------

## default composite (independence) splitting
obj <- rfsrc(get.mv.formula(colnames(ydta)),
             train,
             importance=TRUE, nsplit = 10)

imp <- data.frame(default = rowMeans(get.mv.vimp(obj, standardize = TRUE)))
imp$var <- row.names(imp)
print(imp[order(imp$default, decreasing = TRUE)[1:25], ])

best25var = imp[order(imp$default, decreasing = TRUE)[1:25], ]$var

# Make model
rf_model <- rfsrc(get.mv.formula(colnames(ydta)),
                  train[,c(colnames(ydta),best25var)],
                  importance = 'permute', nsplit = 10)

# predict all the random forest data
pred2 <- predict(rf_model, test, na.action= "na.impute")

#--------------------------------------
#
# With variable selection and tuning
#
#--------------------------------------

## default composite (independence) splitting
obj <- rfsrc(get.mv.formula(colnames(ydta)),
             train,
             importance=TRUE, nsplit = 10)

imp <- data.frame(default     = rowMeans(get.mv.vimp(obj, standardize = TRUE)))
imp$var <- row.names(imp)
print(imp[order(imp$default, decreasing = TRUE)[1:25], ])

best25var = imp[order(imp$default, decreasing = TRUE)[1:25], ]$var

# Look for best nodesize and trim
nodesize <- seq(1,20,2)

model <- tune(get.mv.formula(colnames(ydta)), data = train,
              mtryStart = 2,
              nodesizeTry= nodesize, ntreeTry=100,
              blocksize=1, importance=TRUE, seed=40)

# Make model
rf_model <- rfsrc(get.mv.formula(colnames(ydta)), data = train,
                  nodesize = unname(model$optimal["nodesize"]), mtry = unname(model$optimal["mtry"]),
)
# predict all the random forest data
pred3 <- predict(rf_model, test, na.action= "na.impute")

##
# compare predictions using yaImpute::rmsd()
##
# rmsd() needs observed (".o") and predicted values
p1 = data.frame(PSME.o = test$PSME,
                TSHE.o = test$TSHE,
                ALRU2.o = test$ALRU2,

                PSME = pred1$regrOutput$PSME$predicted,
                TSHE = pred1$regrOutput$TSHE$predicted,
                ALRU2 = pred1$regrOutput$ALRU2$predicted)

p2 = data.frame(PSME.o = test$PSME,
                TSHE.o = test$TSHE,
                ALRU2.o = test$ALRU2,

                PSME = pred2$regrOutput$PSME$predicted,
                TSHE = pred2$regrOutput$TSHE$predicted,
                ALRU2 = pred2$regrOutput$ALRU2$predicted)

p3 = data.frame(PSME.o = test$PSME,
                TSHE.o = test$TSHE,
                ALRU2.o = test$ALRU2,

                PSME = pred3$regrOutput$PSME$predicted,
                TSHE = pred3$regrOutput$TSHE$predicted,
                ALRU2 = pred3$regrOutput$ALRU2$predicted)

# Calculate rmsd and make dataframe
RMSD = data.frame(
          Novarsel = unname(round(yaImpute::rmsd(p1),2)),
          Varsel = unname(round(yaImpute::rmsd(p2),2)),
          Varsel_tune = unname(round(yaImpute::rmsd(p3),2))
        )

print(RMSD)
>  print(RMSD)
>       Novarsel Varsel Varsel_tune
> PSME      0.09   0.09        0.09
> TSHE      0.09   0.09        0.09
> ALRU2     0.04   0.04        0.04