topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.61k stars 634 forks source link

Training a Linear Regression seems much lower with caret than with lm #1277

Open GFabien opened 2 years ago

GFabien commented 2 years ago

Hey! Thanks for the great package! I am using caret to be able to use a wide range of models directly, and this is really easy, thanks to caret. However, I realized that fitting a Linear Regression using caret's train function was slower than fitting stats::lm directly (see the benchmark reported below). Is there anything I'm missing in how I am doing the training using caret? Here I don't want to tune any parameter nor perform any splitting of my data.

Thank you for your help!

Minimal, runnable code:

library(microbenchmark)
data("iris")

X <- iris[, -5]

base_lm <- function() {
  stats::lm(Petal.Width ~ ., data = X)
}
caret_lm <- function() {
  caret::train(Petal.Width ~ .,
         data = X,
         method = "lm",
         trControl = caret::trainControl(method = "none")
  )
}

res <- microbenchmark(NULL, base_lm(), caret_lm(), times = 50L)
print(res, unit = "ms")
#> Unit: milliseconds
#>        expr        min         lq         mean     median         uq
#>        NULL   0.000009   0.000012   0.00003386   0.000030   0.000048
#>   base_lm()   0.764566   0.854957   0.96419726   0.891547   0.945221
#>  caret_lm() 154.067533 162.821262 196.87554234 164.597034 166.491618
#>          max neval
#>     0.000077    50
#>     3.186349    50
#>  1766.592617    50

Session Info:

sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS

packageVersion("caret")
#> [1] '6.0.91'
KURolfKF commented 2 years ago

As far as I know, caret's train()-function performs by default a 5 (?) Fold Cross Validation while training the model. The lm()-function of course doesn't, that's why the latter is much faster.

GFabien commented 2 years ago

It seems that caret::train function calls stats::lm only once. I wonder if this additional time is due to all the checks performed. I will try to dive deeper into this problem.

GFabien commented 2 years ago

I looked into this problem and found two reasons responsible for this performance issue:

I would be glad to make a PR to fix the first point.