In a similar case as this question, I have to predict species proportions from some remotely sensed variables (more than 100 variables). I would like to implement a random forest algorithm for such purpose and the package randomForestSRC looks promisisng.
I have training data from field inventories and validation data from an independent dataset. I wanted to test different approaches of fitting a random forest model and then check for the accuracy of the predictions.
First approach is to simply fit a random forest model from the training data and apply it to the test data.
Second approach is to do a variable selection (based on this vignette) before fitting the model using the best variables and finally apply the model to the test data.
Third approach is to do a variable selection and fit a model with the best variables, tune the model (based on the manual) and apply it to the test data.
My surprise is that I get exacly the same accuracy for the three approaches and I am not sure what is the reason.
I have made a reproducible example to show this. In the example I make an artificial dataframe that I divide into train and test data to mimic my study data.
library(randomForestSRC)
library(yaImpute)
##Creating Fake Data##
set.seed(88)#For reproducibility
number_of_observations = 100
#Response variables#
PSME_BA <-rnorm(number_of_observations,50, 15)
TSHE_BA <-rnorm(number_of_observations,40,12)
ALRU2_BA<-rnorm(number_of_observations,20,0.5)
Total_BA<-PSME_BA+TSHE_BA+ALRU2_BA
#Predictor variables#
generate_variables <- function(num_vars, num_samples, min_val, max_val) {
variable_list <- list()
for (i in 1:num_vars) {
variable_list[[paste0("B", i)]] <- runif(num_samples, min_val, max_val)
}
return(as.data.frame(variable_list))
}
variables <- generate_variables(30, number_of_observations, 0, max_values <- c(2000, 1800, 3000, 5500, 3800, 5900, rep(10000, 24)))
#Dataset for modeling#
DF<-data.frame(PSME=PSME_BA/Total_BA, TSHE=TSHE_BA/Total_BA, ALRU2=ALRU2_BA/Total_BA, # proportions
variables)
#use 80% of dataset as training set and 20% as test set
sample <- sample(c(TRUE, FALSE), nrow(DF), replace=TRUE, prob=c(0.8,0.3))
train <- DF[sample, ]
test <- DF[!sample, ]
## parse training data into y and x data
ydta = train[,c("PSME", "TSHE", "ALRU2")]
xdta = train[,-c(1:3)]
#--------------------------------------
#
# No variable selection
#
#--------------------------------------
rf_model <- rfsrc(get.mv.formula(colnames(ydta)),
train,
importance = 'permute', nsplit = 10)
# predict all the random forest data
pred1 <- predict(rf_model, test, na.action= "na.impute")
#--------------------------------------
#
# With variable selection
#
#--------------------------------------
## default composite (independence) splitting
obj <- rfsrc(get.mv.formula(colnames(ydta)),
train,
importance=TRUE, nsplit = 10)
imp <- data.frame(default = rowMeans(get.mv.vimp(obj, standardize = TRUE)))
imp$var <- row.names(imp)
print(imp[order(imp$default, decreasing = TRUE)[1:25], ])
best25var = imp[order(imp$default, decreasing = TRUE)[1:25], ]$var
# Make model
rf_model <- rfsrc(get.mv.formula(colnames(ydta)),
train[,c(colnames(ydta),best25var)],
importance = 'permute', nsplit = 10)
# predict all the random forest data
pred2 <- predict(rf_model, test, na.action= "na.impute")
#--------------------------------------
#
# With variable selection and tuning
#
#--------------------------------------
## default composite (independence) splitting
obj <- rfsrc(get.mv.formula(colnames(ydta)),
train,
importance=TRUE, nsplit = 10)
imp <- data.frame(default = rowMeans(get.mv.vimp(obj, standardize = TRUE)))
imp$var <- row.names(imp)
print(imp[order(imp$default, decreasing = TRUE)[1:25], ])
best25var = imp[order(imp$default, decreasing = TRUE)[1:25], ]$var
# Look for best nodesize and trim
nodesize <- seq(1,20,2)
model <- tune(get.mv.formula(colnames(ydta)), data = train,
mtryStart = 2,
nodesizeTry= nodesize, ntreeTry=100,
blocksize=1, importance=TRUE, seed=40)
# Make model
rf_model <- rfsrc(get.mv.formula(colnames(ydta)), data = train,
nodesize = unname(model$optimal["nodesize"]), mtry = unname(model$optimal["mtry"]),
)
# predict all the random forest data
pred3 <- predict(rf_model, test, na.action= "na.impute")
##
# compare predictions using yaImpute::rmsd()
##
# rmsd() needs observed (".o") and predicted values
p1 = data.frame(PSME.o = test$PSME,
TSHE.o = test$TSHE,
ALRU2.o = test$ALRU2,
PSME = pred1$regrOutput$PSME$predicted,
TSHE = pred1$regrOutput$TSHE$predicted,
ALRU2 = pred1$regrOutput$ALRU2$predicted)
p2 = data.frame(PSME.o = test$PSME,
TSHE.o = test$TSHE,
ALRU2.o = test$ALRU2,
PSME = pred2$regrOutput$PSME$predicted,
TSHE = pred2$regrOutput$TSHE$predicted,
ALRU2 = pred2$regrOutput$ALRU2$predicted)
p3 = data.frame(PSME.o = test$PSME,
TSHE.o = test$TSHE,
ALRU2.o = test$ALRU2,
PSME = pred3$regrOutput$PSME$predicted,
TSHE = pred3$regrOutput$TSHE$predicted,
ALRU2 = pred3$regrOutput$ALRU2$predicted)
# Calculate rmsd and make dataframe
RMSD = data.frame(
Novarsel = unname(round(yaImpute::rmsd(p1),2)),
Varsel = unname(round(yaImpute::rmsd(p2),2)),
Varsel_tune = unname(round(yaImpute::rmsd(p3),2))
)
print(RMSD)
> print(RMSD)
> Novarsel Varsel Varsel_tune
> PSME 0.09 0.09 0.09
> TSHE 0.09 0.09 0.09
> ALRU2 0.04 0.04 0.04
In a similar case as this question, I have to predict species proportions from some remotely sensed variables (more than 100 variables). I would like to implement a random forest algorithm for such purpose and the package
randomForestSRC
looks promisisng.I have training data from field inventories and validation data from an independent dataset. I wanted to test different approaches of fitting a random forest model and then check for the accuracy of the predictions.
First approach is to simply fit a random forest model from the training data and apply it to the test data.
Second approach is to do a variable selection (based on this vignette) before fitting the model using the best variables and finally apply the model to the test data.
Third approach is to do a variable selection and fit a model with the best variables, tune the model (based on the manual) and apply it to the test data.
My surprise is that I get exacly the same accuracy for the three approaches and I am not sure what is the reason.
I have made a reproducible example to show this. In the example I make an artificial dataframe that I divide into train and test data to mimic my study data.