mlr-org / mlr3misc

Miscellaneous helper functions for mlr3
https://mlr3misc.mlr-org.com
GNU Lesser General Public License v3.0
11 stars 2 forks source link

map_dtc is unreasonably slow when .f returns data.table #78

Open mb706 opened 1 year ago

mb706 commented 1 year ago

When the function in map_dtc returns a data.table with many rows, map_dtc appears to be slower than it needs to be by a factor of about 100.

system.time(mlr3misc::map_dtc(1:3, function(x) runif(1e6, max = x)))
#>    user  system elapsed 
#>   0.043   0.000   0.044 
system.time(mlr3misc::map_dtc(1:3, function(x) data.table(x = runif(1e6, max = x))))
#>    user  system elapsed 
#>   5.124   0.006   5.147 

profvis tells me this this is because name_dots is called in data.table.

m-muecke commented 2 months ago

@mb706 I've found the same but on a much smaller scale, but the memory allocation was higher than it should be, this is due to the do.call(data.table, c(cols, list(check.names = TRUE))) in https://github.com/mlr-org/mlr3misc/blob/main/R/purrr_map.R#L129 as a fix I've used the following, i.e. using setDT():

map_dtc = function(.x, .f, ...) {
  cols = map(.x, .f, ...)
  setDT(unlist(cols, recursive = FALSE))[]
}

perhaps we can do something like the following to accomodate both use-cases:

map_dtc = function(.x, .f, ...) {
  cols = map(.x, .f, ...)
  j = map_lgl(cols, function(x) !is.null(dim(x)) && !is.null(colnames(x)))
  names(cols)[j] = ""
  if (inherits(cols[[1L]], "data.table")) {
    cols = unlist(cols, recursive = FALSE)
  }
  setDT(cols)[]
}

There is also PR for a C implementation for cbindlist, but seems to take quite a while till that is merged: https://github.com/Rdatatable/data.table/pull/4370