rvalavi / blockCV

The blockCV package creates spatially or environmentally separated training and testing folds for cross-validation to provide a robust error estimation in spatially structured environments. See
https://doi.org/10.1111/2041-210X.13107
GNU General Public License v3.0
106 stars 22 forks source link

Issue while performing blockCV and then training and testing the Random Forest and QRF for soil mapping #36

Closed rahulkgour closed 1 year ago

rahulkgour commented 1 year ago

Dear @rvalavi,

I have successfully executed the steps mentioned below without encountering any errors. However, I am facing issues with the modeling phase as the expected results are not being obtained.

I am unable to debug the issue especially while using the spatially blocked cross-validations which were created as mentioned in the below code.

May be the way I am creating the spatial_blocks or feeding them during training and testing the model is not correct.

Therefore, I would greatly appreciate your assistance in resolving it. I am intending to use blockCV for my publication.

# Load required packages
library(blockCV)
library(randomForest)
library(ggplot2)

# Reading the SOC (response variable) and covariates
soc_data <- read.csv(file.choose(), header = TRUE)

data <- sf::st_as_sf(soc_data, coords = c("Longitude", "Latitude"), crs = 4326)

# Reading the raster background map for blockCV
elevation <- raster("elevation.tif")

# Create spatial blocks using blockCV package for spatial cross-validation
spatial_blocks <- list()
for (j in 1:20) {  # j determines number of repetitions
  set.seed(j)
            sb <-   cv_spatial(x = data,     # data
                     column = "SOC",         # response variable (variable of interest)
                     r = elevation,          # raster background map
                     size = 1000,            # size of the blocks in meters
                     k = 10,                 # number of folds
                     selection = "random",   # find evenly dispersed folds
                     iteration = 100,        
                     biomod2 = TRUE,
                     raster_colors = terrain.colors(10, rev = TRUE))
  for (k in 1:5) {  # k = 5 -> for each fold
    ind <- sb[[3]][, k]  # [[3]] access biomodtable; [,k] access fold k
    ind_train <- which(ind == TRUE)  # extract indices for training data
    spatial_blocks[[(j - 1) * 5 + k]] <- ind_train  # create a list with j*k spatial blocks with indices of training folds
  }
}

# Train and test the Random Forest model using spatially blocked cross-validation
RF_results <- list()
for (i in 1:length(spatial_blocks)) {
  # Get the training and testing data for this fold
  test_indices <- spatial_blocks[[i]]
  test_data <- soc_data[test_indices, ]
  train_data <- soc_data[-test_indices, ]

  # Fitting the Random Forest model on the training data
  RF_model <- randomForest(SOC ~ ., data = train_data, ntree = 500)

  # Making predictions on the testing data
  RF_predictions <- predict(RF_model, newdata = test_data)

  # Calculating the performance metrics
  RF_MSE <- mean((test_data$SOC - RF_predictions)^2)
  RF_R2 <- cor(test_data$SOC, RF_predictions)^2

  # Save the results for this fold
  RF_results[[i]] <- list(MSE = RF_MSE, R2 = RF_R2)
}

# Combine the results from all folds
RF_MSE <- sapply(RF_results, function(x) x$MSE)
RF_R2 <- sapply(RF_results, function(x) x$R2)
RF_avg_MSE <- mean(RF_MSE)
RF_avg_R2 <- mean(RF_R2)

# Create scatterplot of predicted vs observed (estimated) SOC values for RF model
RFvalues<-data.frame(obs=test_data$SOC, pred=RF_predictions, res=RF_predictions-test_data$SOC)
ggplot(RFvalues, aes(x = obs, y = pred)) +
  geom_point(size = 3, shape = 21, color = "black", fill = "blue") + 
  stat_cor(p.accuracy = 0.001, r.accuracy = 0.01)+  stat_smooth(method=lm)+
  labs(y = "Predicted SOC", x = "Estimated SOC") + theme(plot.title = element_text(hjust = 0.5, size = 18), 
                                                         axis.title = element_text(size = 15), 
                                                         axis.title.y = element_text(angle = 0, vjust = 0.5), 
                                                         axis.text = element_text(size = 11))
+ theme_classic() # add this line to the above code snippet if you want background as plain.

# Create uncertainty map for RF model using the predict function
rf_uncertainty <- predict(RF_model, newdata = test_data, type = "response")
rf_uncertainty_df <- data.frame(test_data$Longitude, test_data$Latitude, rf_uncertainty)
colnames(rf_uncertainty_df) <- c("lon", "lat", "uncertainty")
ggplot() +
  geom_tile(data = rf_uncertainty_df, aes(x = lon, y = lat, fill = uncertainty)) +
  scale_fill_gradient(low = "white", high = "blue") +
  labs(title = "Random Forest Model Uncertainty Map")

# Create a map of predicted SOC values using the ggplot2 library
pred_df <- data.frame(test_data$Longitude, test_data$Latitude, RF_predictions)
colnames(pred_df) <- c("lon", "lat", "pred")
ggplot() +
  geom_tile(data = pred_df, aes(x = lon, y = lat, fill = pred)) +
  scale_fill_gradient(low = "white", high = "blue") +
  labs(title = "Random Forest Model Predicted SOC Map")

Thanks in advance!

Kind regards, Rahul

rahulkgour commented 1 year ago

Dear @rvalavi,

I have attached the snapshot of my data which is being used for the given code.

Screenshot 2023-05-04 at 15 46 49

Please let me know if anything further required. Thanks, Rahul

rvalavi commented 1 year ago

Hi @rahulkgour, just one question. Do you try to run multiple repeats of k-fold cross-validation? e.g. 10 times of 5-fold CV, resulting in fitting your model 50 times.

rahulkgour commented 1 year ago

Hi @rvalavi, Yes, we are trying to run multiple repeats of k-fold CV.

rvalavi commented 1 year ago

@rahulkgour I'm not sure what you have done would work. I think you need something more like the following. This gives you a dataframe with the number of folds and repeats.

Just remember if you choose the iteration too high and there are not many blocks in your landscape, the final folds could be very similar (not much variability). 100 should be fine, but I'm not aware of the number of blocks that are created for you.

# Load required packages
library(blockCV)
library(randomForest)
library(ggplot2)

# Reading the SOC (response variable) and covariates
soc_data <- read.csv(file.choose(), header = TRUE)

data <- sf::st_as_sf(soc_data, coords = c("Longitude", "Latitude"), crs = 4326)

# Reading the raster background map for blockCV
elevation <- raster("elevation.tif")

n_rep <- 20
n_fold <- 10
# Create spatial blocks using blockCV package for spatial cross-validation
model_eval <- data.frame(rep = rep(NA, n_rep*n_fold), fold = NA MSE = NA, R2 = NA)
n <- 0
for (j in 1:n_rep) {  # j determines number of repetitions
  set.seed(j)
  sb <- cv_spatial(x = data,     # data
                   column = "SOC",         # response variable (variable of interest)
                   r = elevation,          # raster background map
                   size = 1000,            # size of the blocks in meters
                   k = n_fold,                 # number of folds
                   selection = "random",   # find evenly dispersed folds
                   iteration = 100,        
                   # biomod2 = TRUE,
                   raster_colors = terrain.colors(10, rev = TRUE))

  for(i in 1:n_fold){
    n <- n + 1

    test_set <- which(sb$folds_ids == i)
    train_set <- which(sb$folds_ids != i)

    test_data <- soc_data[test_set, ]
    train_data <- soc_data[train_set, ]

    # Fitting the Random Forest model on the training data
    RF_model <- randomForest(SOC ~ ., data = train_data, ntree = 500)

    # Making predictions on the testing data
    RF_predictions <- predict(RF_model, newdata = test_data)

    # Calculating the performance metrics
    RF_MSE <- mean((test_data$SOC - RF_predictions)^2)
    RF_R2 <- cor(test_data$SOC, RF_predictions)^2

    # Save the results for this fold
    model_eval$rep[n] <- j
    model_eval$fold[n] <- i
    model_eval$MSE[n] <- RF_MSE
    model_eval$R2[n] <- RF_R2
  }

}
rahulkgour commented 1 year ago

Dear @rvalavi,

Thank you for your kind response. I tried your code snippet as well, still I can see the performance little skewed. However, when I am trying with random 10 Fold Cross Validation through usual means (not implying blockCV) while training the model, I am getting the expected validation parameters.

Regards, Rahul

rvalavi commented 1 year ago

What do mean by "skewed" @rahulkgour ?

rvalavi commented 1 year ago

BTW, now I see your response is a continuous variable, you should not use column for your cv_spatial. That is only for categorical and binary data. Just remove that argument (in your case column = "SOC" must be removed).

rvalavi commented 1 year ago

I'll add a warning in the code that warns you about using the column argument when you have continuous data.

rahulkgour commented 1 year ago

Dear @rvalavi,

Sorry for the late reply.

I tweaked back my code again, and seems working as expected. I am really thankful for your assistance.

Yes, adding a warning in the code about the usage of column for continuous data would be really helpful for the users.

I think we can close this ticket.

Thanks again, Rahul

rvalavi commented 1 year ago

Perfect! I'm glad that was helpful @rahulkgour.