topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.62k stars 632 forks source link

Holdout results different for caret and manual k-folds #1193

Closed Drew2019 closed 3 years ago

Drew2019 commented 3 years ago

I was comparing various resampling methods in caret when I'm a little thrown off by the cross-validation results for "lm" when using k-folds cross validation. Across datasets and seeds, I'm finding much higher cross-validation model performance in caret than when I (a) manually create my own folds, (b) use LOOCV in caret, and (c) boot in carer. Like I said, this is occurring across 5-6 datasets and irrespective of seed. I produce an example without seeds below. Perhaps I'm retrieving the cross-validation results incorrectly?

fit <- train(Sepal.Width ~ ., method = "lm", data = iris, trControl = trainControl(method = "repeatedcv", number=10, repeats=10))
fit

fit <- train(Sepal.Width ~ ., method = "lm", data = iris, trControl = trainControl(method = "LOOCV"))
fit

fit <- train(Sepal.Width ~ ., method = "lm", data = iris, trControl = trainControl(method = "boot", number=1000))
fit

### just quickly doing my own k folds ###
#create folds#
iris <-iris[sample(nrow(iris)),]
folds <- cut(seq(1,nrow(iris)),breaks=10,labels=FALSE)

results <- data.frame(matrix(NA, nrow = 0, ncol = 1))  #store results
#Perform 10 fold cross validation
for(i in 1:10){
  testIndexes <- which(folds==i,arr.ind=TRUE)
  testData <- iris[testIndexes, ]
  trainData <- iris[-testIndexes, ]
  print (nrow(trainData))
  print (nrow(testData))

  OLS <- lm (Sepal.Width ~ ., data=trainData)
  Predicted <- as.data.frame (predict (OLS, newdata = testData))
  results <- rbind (results, corr.test(cbind (dplyr::select(testData, Sepal.Width), Predicted))$r[2,1])
}
mean (results[,1])^2

Whenever I've used caret, I've always relied on my own outer loop and I've never used it for "lm" before this. Thus, I was pretty shocked when I found these results. Am I simply interpreting the output incorrectly, or is there an issue?

AndrewLawrence commented 3 years ago

Afraid I couldn't replicate. I needed to alter the code to get it to run without error, and when I did I got close enough R2 from caret and your method (some runs caret slightly better, some runs yours). Here's a reprex:

library(caret)
require(dplyr)

R2results <- c(kfold = NA, LOOCV = NA, boot = NA, manualkfold = NA)

fit <- train(Sepal.Width ~ ., method = "lm", data = iris, trControl = trainControl(method = "repeatedcv", number=10, repeats=10))
R2results[1] <- fit$results$Rsquared
fit <- train(Sepal.Width ~ ., method = "lm", data = iris, trControl = trainControl(method = "LOOCV"))
R2results[2] <- fit$results$Rsquared
fit <- train(Sepal.Width ~ ., method = "lm", data = iris, trControl = trainControl(method = "boot", number=1000))
R2results[3] <- fit$results$Rsquared

### just quickly doing my own k folds ###
#create folds#
iris <-iris[sample(nrow(iris)),]
folds <- cut(seq(1,nrow(iris)),breaks=10,labels=FALSE)

results <- as.data.frame(matrix(NA, nrow = 0, ncol = 1))  #store results
#Perform 10 fold cross validation
for(i in 1:10){
  testIndexes <- which(folds==i,arr.ind=TRUE)
  testData <- iris[testIndexes, ]
  trainData <- iris[-testIndexes, ]
  OLS <- lm(Sepal.Width ~ ., data=trainData)
  Predicted <- predict(OLS, newdata = testData)
  results <- rbind(results,
                   cor(dplyr::pull(testData, Sepal.Width), Predicted))
}
R2results[4] <- mean(results[,1])^2

print(R2results)
#>       kfold       LOOCV        boot manualkfold 
#>   0.6383806   0.6026316   0.6002337   0.5821995

Created on 2021-01-23 by the reprex package (v0.3.0)

Drew2019 commented 3 years ago

Well, you did actually reproduce what I was talking about. Here is a more robust example. The code below runs caret 500 times, and it runs a manual k-folds 500 times. The mean R^2 is .619 for manual, and it is .637 for caret. I've found this with every dataset. Perhaps it is my unknown user error, but I find the results concerning.

### Loop Caret ###
results_caret <- data.frame(matrix(NA, nrow = 0, ncol = 1))  #store results
for (i in 1:500) {
fit <- train(Sepal.Width ~ ., method = "lm", data = iris, trControl = trainControl(method = "repeatedcv", number=10, repeats=1))
results_caret <- rbind (results_caret, fit$results$Rsquared)
}
mean (results_caret[,1])

### Resampled Loop Manual ###
results_outer <- data.frame(matrix(NA, nrow = 0, ncol = 1)) 
for (loop in 1:500) {
  results <- data.frame(matrix(NA, nrow = 0, ncol = 1)) 
  iris <-iris[sample(nrow(iris)),]
  folds <- cut(seq(1,nrow(iris)),breaks=10,labels=FALSE)

    #10 fold
    for(i in 1:10){
      testIndexes <- which(folds==i,arr.ind=TRUE)
      testData <- iris[testIndexes, ]
      trainData <- iris[-testIndexes, ]

      OLS <- lm (Sepal.Width ~ ., data=trainData)
      Predicted <- as.data.frame (predict (OLS, newdata = testData))
      results <- rbind (results, corr.test(cbind (dplyr::select(testData, Sepal.Width), Predicted))$r[2,1])
    }
  print (results)
  results_outer <- rbind (results_outer, mean (results[,1]))
}
results_outer
mean (results_outer[,1])^2
AndrewLawrence commented 3 years ago

I don't think your implementation is correct.

results_outer is being incremented in the inner loop, so it contains 5000 values instead of 500 (one per loop). At the end you look at mean (results[,1])^2, when results contains the 10 values from the last run of the inner loop.

Also, you take the average of R values and then take the square of this average, this won't give quite the same result as taking the average of the R^2 values: mean((1:10 / 10))^2; mean((1:10 / 10)^2)

There's also a wider question of whether to stack the predictions and take the R2 over all folds: https://stats.stackexchange.com/questions/129937/how-to-compute-r-squared-value-when-doing-cross-validation Or to use Frank Harrell's approach: https://stats.stackexchange.com/a/186418

I don't know which way caret falls on this issue.

Drew2019 commented 3 years ago

Fixed the loop issue (thanks), with same takeaway. The question persists of why results are different. Thanks for the links to the aggregation of results.

AndrewLawrence commented 3 years ago

Great, that's really interesting the difference persists. I'll take a look at the fixed code.

I've found in the documentation a relevant note for R2 calculation:

A note about how R2 is calculated by caret: it takes the straightforward approach of computing the correlation between the observed and predicted values (i.e. R) and squaring the value. When the model is poor, this can lead to differences between this estimator and the more widely known estimate derived form linear regression models. Mostly notably, the correlation approach will not generate negative values of R2 (which are theoretically invalid). A comparison of these and other estimators can be found in Kvalseth 1985.

This is how your code does it, so it is not that caret uses Frank Harrell's approach from the link above.

AndrewLawrence commented 3 years ago

The difference appears to be stratified sampling in the k-fold. If I force uniform sampling I get mean R2 over 500 reps as ~0.617 and if I use caret's stratified sampling I get ~0.635 This is approximately the gap from your results, right?

As I understand it caret::train uses the caret::createMultiFolds function (which is just a repeated call to caret::createFolds) to do repeated-k-fold. This function stratifies the data based on percentiles of the numeric outcome (here Sepal.Width). Your method samples participants into the folds uniformly.

The relevant part of documentation for createMultiFolds:

For numeric y, the sample is split into groups sections based on percentiles and sampling is done within these subgroups. For createDataPartition, the number of percentiles is set via the groups argument. For createFolds and createMultiFolds, the number of groups is set dynamically based on the sample size and k. For smaller samples sizes, these two functions may not do stratified splitting and, at most, will split the data into quartiles.

Here's the code I used:

library(caret)
set.seed(42)
loops <- paste0("loop", 1:500)
kfold_for_iris <- function(k = 10, stratified = TRUE) {
  stratifying_y <- iris$Sepal.Width
  if ( ! stratified ) {
    stratifying_y <- rep("A", NROW(iris))
  }
  train_idx_list <- caret::createMultiFolds(y = stratifying_y,
                                            k = k,
                                            times = 1)
  inner_result <- sapply(train_idx_list,
                         function(train) {

                           train_data <- iris[train,]
                           test_data <- iris[-train,]

                           mod <- lm(Sepal.Width ~ ., data = train_data)

                           cor(test_data$Sepal.Width, predict(mod, newdata = test_data))^2
                         })
  return(mean(inner_result))
}
result_uniform <- sapply(loops, function(i) kfold_for_iris(k = 10, stratified = F))
result_stratified <- sapply(loops, function(i) kfold_for_iris(k = 10, stratified = T))

mean(result_uniform)
#> [1] 0.6173559
mean(result_stratified)
#> [1] 0.6346681

Created on 2021-01-25 by the reprex package (v0.3.0)

Drew2019 commented 3 years ago

Well that's quite interesting. I had no idea that caret stratified automatically. Perhaps that does explain the difference between manual k-folds and the caret results. However, my original inquiry stemmed from the differences between LOOCV and bootstrap, and how these were lower than k-folds in caret. My example was for the iris dataset, but I originally uncovered it working with other data. What you discovered is indeed helpful, but I would assume that because LOOCV has a larger training N, it should exhibit better performance than k-folds.

AndrewLawrence commented 3 years ago

Ah, OK. LOOCV and (in caret at least) bootstrap don't use stratified sampling, so you would need to compare these against unstratified k-fold rather than the default caret stratified k-fold for a fair comparison.

I would assume that because LOOCV has a larger training N, it should exhibit better performance than k-folds.

Just a thought, but while LOOCV is less pessimistically biased than k-fold (because it uses more train data), it is also a higher variance estimator (bias-variance trade off), so you might need to run on multiple simulated datasets to rule out a effect due to variance.

topepo commented 3 years ago

Maybe this will help:

library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2

# To demonstrate, manually create the folds
set.seed(1)
folds <- createFolds(mtcars$mpg, k = 5, returnTrain = TRUE)

train_res <- train(mpg ~ ., data = mtcars, method = "lm", 
                   trControl = trainControl(method = "cv", index = folds,
                                            savePredictions = TRUE))
train_res
#> Linear Regression 
#> 
#> 32 samples
#> 10 predictors
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 26, 26, 26, 25, 25 
#> Resampling results:
#> 
#>   RMSE      Rsquared   MAE     
#>   3.552056  0.7170875  2.999363
#> 
#> Tuning parameter 'intercept' was held constant at a value of TRUE

# Now reproduce the results: 
for (i in seq_along(folds)) {
  for_model <- folds[[i]]
  for_perf <- (1:nrow(mtcars))[-unique(for_model)]
  mod_fit <- lm(mpg ~ ., data = mtcars[unique(for_model),])
  mod_pred <- predict(mod_fit, mtcars[-unique(for_model),])
  mod_rmse <- RMSE(mod_pred, mtcars$mpg[-unique(for_model)])
  cat(
    "manual RMSE:", mod_rmse,
    "train() RMSE:", train_res$resample$RMSE[i],
    "\n")
}
#> manual RMSE: 3.115265 train() RMSE: 3.115265 
#> manual RMSE: 2.544146 train() RMSE: 2.544146 
#> manual RMSE: 4.645231 train() RMSE: 4.645231 
#> manual RMSE: 3.810561 train() RMSE: 3.810561 
#> manual RMSE: 3.645076 train() RMSE: 3.645076

Created on 2021-02-09 by the reprex package (v1.0.0.9000)