tobigithub / caret-machine-learning

Practical examples for the R caret machine learning package
MIT License
67 stars 50 forks source link

caret train binary glm fails on parallel cluster via doParallel #37

Closed Triamus closed 6 years ago

Triamus commented 7 years ago

Can you have a look at this SO question caret train binary glm fails on parallel cluster via doParallel? Potentially you have an idea what is going wrong as I tried pretty much your setup for Windows but it didn't work. many thanks!

tobigithub commented 7 years ago

I did not get any error under R3.3.1. (bug in you hair), R version 3.3.1 (2016-06-21) Platform: x86_64-w64-mingw32/x64 (64-bit)

Other attached packages: [1] doParallel_1.0.10 iterators_1.0.8 foreach_1.4.3
[4] microbenchmark_1.4-2.1 caret_6.0-73 ggplot2_2.2.0
[7] lattice_0.20-34

But then again I also don't know if the code is correct, plus it actually does not run in parallel, uses only 25% of all cores when using doParallel. You can always open a ticket in the original caret developer forum and see if they can help (https://github.com/topepo/caret). Maybe related to a restriction, feature or bug. Also see my comments below.

library(caret)
library(microbenchmark)
library(doParallel)

x1 = rnorm(1000)           # some continuous variables 
x2 = rnorm(1000)
z = 1 + 2 * x1 + 3 * x2        # linear combination with a bias
pr = 1 / (1 + exp(-z))         # pass through an inv-logit function
y = rbinom(100, 1, pr)      # bernoulli response variable
df = data.frame(y = as.factor(ifelse(y == 0, "no", "yes")), x1 = x1, x2 = x2)

# serial
# train control function
ctrl <- 
  trainControl(
    method = "repeatedcv", 
    number = 10,
    repeats = 5,
    classProbs = TRUE,
    summaryFunction = twoClassSummary)

# train function
microbenchmark(
  glm_nopar =
    train(y ~ .,
          data = df,
          method = "glm",
          family = "binomial",
          metric = "ROC",
          trControl = ctrl),
  times = 5)

# parallel
cores_2_use <- floor(0.8 * detectCores())
cl <- makeCluster(cores_2_use, outfile = "parallel_log2.txt")
registerDoParallel(cl)

microbenchmark(
  glm_parP =
    train(y ~ .,
          data = df,
          method = "glm",
          family = "binomial",
          metric = "ROC",
          trControl = ctrl),
  times = 5)

parallel::stopCluster(cl)
foreach::registerDoSEQ()

Logfile: parallel_log2.txt

starting worker pid=3256 on localhost:11657 at 12:00:00.147
starting worker pid=816 on localhost:11657 at 12:00:00.323
starting worker pid=6976 on localhost:11657 at 12:00:00.499
starting worker pid=10908 on localhost:11657 at 12:00:00.674
starting worker pid=11164 on localhost:11657 at 12:00:00.849
starting worker pid=9836 on localhost:11657 at 12:00:01.026
Loading required package: caret
Loading required package: lattice
Loading required package: ggplot2
loaded caret and set parent environment
Loading required package: caret
Loading required package: lattice
Loading required package: ggplot2
loaded caret and set parent environment
Loading required package: caret
Loading required package: lattice
Loading required package: ggplot2
loaded caret and set parent environment
Loading required package: caret
Loading required package: lattice
Loading required package: ggplot2
loaded caret and set parent environment
Loading required package: caret
Loading required package: lattice
Loading required package: ggplot2
loaded caret and set parent environment
Loading required package: caret
Loading required package: lattice
Loading required package: ggplot2
loaded caret and set parent environment
loaded caret and set parent environment
loaded caret and set parent environment
loaded caret and set parent environment
loaded caret and set parent environment
loaded caret and set parent environment
loaded caret and set parent environment
loaded caret and set parent environment
loaded caret and set parent environment
loaded caret and set parent environment
loaded caret and set parent environment
loaded caret and set parent environment
loaded caret and set parent environment
loaded caret and set parent environment
loaded caret and set parent environment
loaded caret and set parent environment
loaded caret and set parent environment

BTW, not trying to hijack the comment, but GLM is fast but also notoriously inaccurate. See for example here from the PIMA dataset, it ranks somewhere in the middle at rank60, see this TSV (its display, not download): https://github.com/tobigithub/caret-machine-learning/blob/master/caret-classification/caret-all-binary-class-PimaIndiansDiabetes.tsv which comes from here: https://github.com/tobigithub/caret-machine-learning/wiki/caret-ml-classification

I am sure its still in use here and there maybe in production, but I would use newer models, such as boosting xgboost or LightGBM (can handle Gbytes of data with high accuracy and very high native parallel efficiency): https://github.com/Microsoft/LightGBM or catboost from Yandex: https://github.com/catboost