Closed giuseppec closed 9 years ago
Are you able to make a suggestion to speed this up?
I think I found why it takes so long. Rprofiling tells me that rbind
is the slowest function, this is due to missing preallocation in:
makeResamplePrediction = function(instance, preds.test, preds.train) {
# FIXME: prealloc
data = data.frame()
for (i in seq_len(instance$desc$iters)) {
if (!is.null(preds.test[[i]]))
data = rbind(data, cbind(preds.test[[i]]$data, iter = i, set = "test"))
if (!is.null(preds.train[[i]]))
data = rbind(data, cbind(preds.train[[i]]$data, iter = i, set = "train"))
}
p1 = preds.test[[1L]]
setClasses(list(
instance = instance,
predict.type = p1$predict.type,
data = data,
threshold = p1$threshold,
task.desc = p1$task.desc,
time = extractSubList(preds.test, "time")
), c("ResamplePrediction", "Prediction"))
}
I'll try to fix it with do.call
and some apply
functions instead of the for-loop.
rbind is the slowest thing on earth, We have programmed around this in BatchJobs too. And BatchExperiments. Look at the reduce* type of functions that return data frames. There you should see how to do this faster
Just for the record. We should really use data.table
or dplyr
in the future:
library(microbenchmark)
library(dplyr)
library(data.table)
x = replicate(100, iris, simplify = FALSE)
microbenchmark(bind_rows(x), rbindlist(x), do.call(rbind, x), times = 10, unit = "relative")
-> speedup of 50-100x.
Agreed. When do we talk about this?
+1 for data.table
. Although i prefer the dplyr
syntax data.table
is also known to be faster.
As I've learned at the useR, dplyr
nowadays serves also as an abstraction for data base access (see http://cran.r-project.org/web/packages/dplyr/vignettes/databases.html). This is something I really want to have in mlr
. As a first step we need to ensure that we do not access the tasks directly, i.e. we need to make sure to always use getters and setters. The next step would be to port the getters and setters to dplyr
. But note that dplyr
can also be used as a frontend to data.table
and at least for some operations you get the best of both worlds: the speed of data.table and the possibility to have a data base in the background. Yet we should carefully benchmark if using data.table
is still worth the effort.
The only downside so far: the syntax of dplyr is horrible, especially because we need all the non-standard evaluation stuff (http://cran.r-project.org/web/packages/dplyr/vignettes/nse.html).
We CANNOT decide this in an adhoc manner. If we want to do this, we need to schedule some time to look at this more closely. Maybe with some people in the same room.
data.table
's syntax is more difficult to follow but i think it would probably be a better choice because it seems that dplyr
is very much oriented towards interactive use (hence the hybrid evaluation stuff). For most of the operations in mlr I am aware of (admittedly a limited set) we don't need the abstraction of dplyr
: i.e., we do not need (now anyhow) to operate on data.frames
and databases of various sorts, which seems to me is the power of an abstraction layer like dplyr
.
For 500 iterations, for example, it seems that the 'mergeResampleResult' function, which is called within the
resample
function, takes much more time than all resampling iterations together (which are done withdoResampleIteration
function). Try the following:The following code is much faster:
It seems that the runtime is exponentially increasing with the number of iterations...