Training a Linear Regression seems much lower with caret than with lm

GFabien commented 2 years ago

Hey! Thanks for the great package! I am using caret to be able to use a wide range of models directly, and this is really easy, thanks to caret. However, I realized that fitting a Linear Regression using caret's train function was slower than fitting stats::lm directly (see the benchmark reported below). Is there anything I'm missing in how I am doing the training using caret? Here I don't want to tune any parameter nor perform any splitting of my data.

Thank you for your help!

Minimal, runnable code:

library(microbenchmark)
data("iris")

X <- iris[, -5]

base_lm <- function() {
  stats::lm(Petal.Width ~ ., data = X)
}
caret_lm <- function() {
  caret::train(Petal.Width ~ .,
         data = X,
         method = "lm",
         trControl = caret::trainControl(method = "none")
  )
}

res <- microbenchmark(NULL, base_lm(), caret_lm(), times = 50L)
print(res, unit = "ms")
#> Unit: milliseconds
#>        expr        min         lq         mean     median         uq
#>        NULL   0.000009   0.000012   0.00003386   0.000030   0.000048
#>   base_lm()   0.764566   0.854957   0.96419726   0.891547   0.945221
#>  caret_lm() 154.067533 162.821262 196.87554234 164.597034 166.491618
#>          max neval
#>     0.000077    50
#>     3.186349    50
#>  1766.592617    50

Session Info:

sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS

packageVersion("caret")
#> [1] '6.0.91'

KURolfKF commented 2 years ago

As far as I know, caret's train()-function performs by default a 5 (?) Fold Cross Validation while training the model. The lm()-function of course doesn't, that's why the latter is much faster.

GFabien commented 2 years ago

It seems that caret::train function calls stats::lm only once. I wonder if this additional time is due to all the checks performed. I will try to dive deeper into this problem.

GFabien commented 2 years ago

I looked into this problem and found two reasons responsible for this performance issue:

in my case, the bottleneck of the caret::train function is the call to system.time. Changing this by two calls to proc.time divides the computation time by 10.
once the first bottleneck is removed, a second one appears, which is the call to getModelInfo. Hence, if the model is called a high number of times, this will cause some overhead. To take only account this once, getModelInfo can be called outside of the function, and the method argument can be directly filled with the result of getModelInfo. Doing that leads to an extra factor of 10 in the computation time for my simple example.

I would be glad to make a PR to fix the first point.

topepo / caret

Training a Linear Regression seems much lower with caret than with lm #1277

Minimal, runnable code:

Session Info: