mlr-org / mlr3pipelines

Dataflow Programming for Machine Learning in R
https://mlr3pipelines.mlr-org.com/
GNU Lesser General Public License v3.0
137 stars 25 forks source link

Scale data in `data.table` #602

Closed kadyb closed 2 years ago

kadyb commented 2 years ago

I think it would be a good idea to implement data scaling in data.table instead of using base::scale() because it is slower and requires more memory. Below is the benchmark. I also included collapse::fscale(), but that would require adding a new dependency.

library("data.table"); setDTthreads(1)

set.seed(123)
mat = matrix(c(rnorm(1e7, 30, 0.2), runif(1e7, 3, 5), runif(1e7, 10, 20)),
             ncol = 3)
dt = data.table(mat)
cols = colnames(dt)
scale_fun = function(x) {(x - mean(x)) / sd(x)}

result = bench::mark(
  iterations = 10, check = FALSE,
  base = base::scale(mat),
  collapse = collapse::fscale(mat),
  dt = dt[, (cols) := lapply(.SD, scale_fun), .SDcols = cols]
  )

result
#> expression      min   median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 base          3.04s    3.21s     0.314    2.35GB     1.51
#> 2 collapse   260.92ms 268.41ms     3.12   228.91MB     1.25
#> 3 dt         397.71ms 440.14ms     2.26   230.57MB     1.13
mb706 commented 2 years ago

Thanks for pointing this out. In reality, things are a bit complicated...

Calling the PipeOpScale currently takes 16.4 arbitrary units of time with the given data. Of this, 9.4 units are spent on mlr3 backend overhead (we could make this better), 1.0 unit of PipeOpTaskPreproc overhead (probably because of conversion from data.table to matrix), leaving 6.0 units doing the actual scale() call.

The actual scale() call does more than just (x - mean(x)) / sd(x), since it (1) records center and scale values (for prediction), and (2) also needs to work in case center = FALSE, in which case sd() can not be used. I tried to write an alternative version for PipeOpScale, which takes 13.9 units overall, of which 4.6 are spent in PipeOpTaskPreproc (for scaling and PipeOpTaskPreproc overhead together, compare to 6.0 + 1.0 above).

The situation changes when the data has more columns and fewer rows, using ncol = 300 instead of ncol = 3, resulting in a 100000 x 300 matrix: 7.7 units current PipeOpScale using scale(), of which 1.0 are mlr3-overhead; 3.6 units PipeOpScale using faster methods, of which 0.8 are mlr3-overhead.

I guess I will therefore opt to use the new method and hope I have found a good tradeoff between verbosity and speed.