luca-scr / GA

An R package for optimization using genetic algorithms
http://luca-scr.github.io/GA/
91 stars 29 forks source link

Reproducible seeds on nodes #3

Closed ebyerly closed 8 years ago

ebyerly commented 8 years ago

To implement reproducible seeds on all nodes, the doRNG package is a great addition to doParallel. The changes required to GA were minimal and should have no impact on previous analysis (by my reading of the code; please let me know if I've overlooked references to a random number generator following the call to registerDoRNG). Please note, this patch relies on a fix to doRNG that is currently up for review (renozao/doRNG#3).

Below is a minimal example showing the fix to #2 .

devtools::install_github("ElizabethAB/GA")
devtools::install_github("ElizabethAB/doRNG")

library(GA)

data("fat", package = "UsingR")
nms <- c('age', 'weight', 'height', 'neck', 'chest', 'abdomen', 'hip', 'thigh',
         'knee', 'ankle', 'bicep', 'forearm', 'wrist')

fitness <- function(string) {
  equation <- paste(c("body.fat.siri ~",
                      paste(nms[which(string == 1)], collapse = " + ")),
                    collapse = " ")
  if (equation == "body.fat.siri ~ ") return(.Machine$integer.max)
  nfold <- 10
  folds <- sample(1:nrow(fat)) %% nfold
  tmp <- sapply((1:nfold) - 1, function(fold) {
    indices <- folds %in% fold
    mod <- lm(equation, data = fat[!indices,])
    fat[indices, "body.fat.siri"] - predict(mod, fat[indices,])
  })
  -sum(unlist(tmp)^2)
}

eg <- function(par = FALSE) {
  ga("binary", fitness = fitness, nBits = length(nms), maxiter = 5,
     names = nms, monitor = NULL, seed = 20160505, parallel = par)
}

eg()@summary == eg()@summary
eg(2)@summary == eg(2)@summary
luca-scr commented 8 years ago

Dear Elizabeth,

thanks for your message & commit. I have just upload the new major version (3.0) which I planned to release soon on CRAN. This version fix the problem you mentioned, although in a different way. Basically I define a new operator %DO% which is equivalent to %do% if parallel = FALSE, and %dorng% otherwise. As you can see I did use the doRNG package as you mentioned. That makes the results reproducible, even on batch run.

On a side note: if you want to check two objects are identical, it is advisable to use identical(). For example:

GA1 = ga("binary", fitness = fitness, nBits = length(nms), maxiter = 5,
         names = nms, monitor = NULL, seed = 20160505, parallel = TRUE)
GA2 = ga("binary", fitness = fitness, nBits = length(nms), maxiter = 5,
         names = nms, monitor = NULL, seed = 20160505, parallel = TRUE)
identical(GA1, GA2)
GA1@fitnessValue
GA2@fitnessValue

Best,

Luca

ebyerly commented 8 years ago

Luca, My timing is either excellent or terrible. Thank you for the update!

Apparently resolved with 9f333cf6

luca-scr commented 8 years ago

It was excellent because it forced me to commit all the changes I collected in the past months.

Maybe you noticed that I have added also the option to provide a cluster of machines for parallelisation. See a brief description on vignette("GA"). I did some checks but only with a simple network of two computers. It should work more in general, but I would be glad if you could test this feature. It seems you have expertise on setting up a running clusters on Amazon. Thanks.

Luca