mlr-org / mlr

Machine Learning in R
https://mlr.mlr-org.com
Other
1.64k stars 404 forks source link

resampling with many iterations takes too long to "merge the results" #371

Closed giuseppec closed 9 years ago

giuseppec commented 9 years ago

For 500 iterations, for example, it seems that the 'mergeResampleResult' function, which is called within the resample function, takes much more time than all resampling iterations together (which are done with doResampleIteration function). Try the following:

resamp = makeResampleDesc("Bootstrap", stratify = TRUE, iters = 500, predict = "both")
samp = resample("classif.rpart", bc.task, resampling = resamp, show.info = TRUE)
# after all 500 iterations are done, it will take much more time unitl we get a result from the function

The following code is much faster:

iterations = 500
# do only one iteration...
resamp1 = makeResampleDesc("Bootstrap", stratify = TRUE, iters = 1, predict = "both")
sList = vector("list", iterations)
# ... and repeat it 500 times
for(i in 1:iterations){
  sList[[i]] = resample("classif.rpart", bc.task, resampling = resamp1, show.info = FALSE)
  message(i)
}

It seems that the runtime is exponentially increasing with the number of iterations...

berndbischl commented 9 years ago

Are you able to make a suggestion to speed this up?

giuseppec commented 9 years ago

I think I found why it takes so long. Rprofiling tells me that rbind is the slowest function, this is due to missing preallocation in:

makeResamplePrediction = function(instance, preds.test, preds.train) {
  # FIXME: prealloc
  data = data.frame()
  for (i in seq_len(instance$desc$iters)) {
    if (!is.null(preds.test[[i]]))
      data = rbind(data, cbind(preds.test[[i]]$data, iter = i, set = "test"))
    if (!is.null(preds.train[[i]]))
      data = rbind(data, cbind(preds.train[[i]]$data, iter = i, set = "train"))
  }
  p1 = preds.test[[1L]]
  setClasses(list(
    instance = instance,
    predict.type = p1$predict.type,
    data = data,
    threshold = p1$threshold,
    task.desc = p1$task.desc,
    time = extractSubList(preds.test, "time")
  ), c("ResamplePrediction", "Prediction"))
}

I'll try to fix it with do.call and some apply functions instead of the for-loop.

berndbischl commented 9 years ago

rbind is the slowest thing on earth, We have programmed around this in BatchJobs too. And BatchExperiments. Look at the reduce* type of functions that return data frames. There you should see how to do this faster

mllg commented 9 years ago

Just for the record. We should really use data.table or dplyr in the future:

library(microbenchmark)
library(dplyr)
library(data.table)

x = replicate(100, iris, simplify = FALSE)
microbenchmark(bind_rows(x), rbindlist(x), do.call(rbind, x), times = 10, unit = "relative")

-> speedup of 50-100x.

berndbischl commented 9 years ago

Agreed. When do we talk about this?

jakob-r commented 9 years ago

+1 for data.table. Although i prefer the dplyr syntax data.table is also known to be faster.

mllg commented 9 years ago

As I've learned at the useR, dplyr nowadays serves also as an abstraction for data base access (see http://cran.r-project.org/web/packages/dplyr/vignettes/databases.html). This is something I really want to have in mlr. As a first step we need to ensure that we do not access the tasks directly, i.e. we need to make sure to always use getters and setters. The next step would be to port the getters and setters to dplyr. But note that dplyr can also be used as a frontend to data.table and at least for some operations you get the best of both worlds: the speed of data.table and the possibility to have a data base in the background. Yet we should carefully benchmark if using data.table is still worth the effort.

The only downside so far: the syntax of dplyr is horrible, especially because we need all the non-standard evaluation stuff (http://cran.r-project.org/web/packages/dplyr/vignettes/nse.html).

berndbischl commented 9 years ago

We CANNOT decide this in an adhoc manner. If we want to do this, we need to schedule some time to look at this more closely. Maybe with some people in the same room.

zmjones commented 9 years ago

data.table's syntax is more difficult to follow but i think it would probably be a better choice because it seems that dplyr is very much oriented towards interactive use (hence the hybrid evaluation stuff). For most of the operations in mlr I am aware of (admittedly a limited set) we don't need the abstraction of dplyr: i.e., we do not need (now anyhow) to operate on data.frames and databases of various sorts, which seems to me is the power of an abstraction layer like dplyr.