Scale data in `data.table`

mlr-org / mlr3pipelines

Dataflow Programming for Machine Learning in R

GNU Lesser General Public License v3.0

137 stars 25 forks source link

library("data.table"); setDTthreads(1) set.seed(123) mat = matrix(c(rnorm(1e7, 30, 0.2), runif(1e7, 3, 5), runif(1e7, 10, 20)), ncol = 3) dt = data.table(mat) cols = colnames(dt) scale_fun = function(x) {(x - mean(x)) / sd(x)} result = bench::mark( iterations = 10, check = FALSE, base = base::scale(mat), collapse = collapse::fscale(mat), dt = dt[, (cols) := lapply(.SD, scale_fun), .SDcols = cols] ) result #> expression min median `itr/sec` mem_alloc `gc/sec` #> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> #> 1 base 3.04s 3.21s 0.314 2.35GB 1.51 #> 2 collapse 260.92ms 268.41ms 3.12 228.91MB 1.25 #> 3 dt 397.71ms 440.14ms 2.26 230.57MB 1.13

Thanks for pointing this out. In reality, things are a bit complicated...

Calling the PipeOpScale currently takes 16.4 arbitrary units of time with the given data. Of this, 9.4 units are spent on mlr3 backend overhead (we could make this better), 1.0 unit of PipeOpTaskPreproc overhead (probably because of conversion from data.table to matrix), leaving 6.0 units doing the actual scale() call.

The actual scale() call does more than just (x - mean(x)) / sd(x), since it (1) records center and scale values (for prediction), and (2) also needs to work in case center = FALSE, in which case sd() can not be used. I tried to write an alternative version for PipeOpScale, which takes 13.9 units overall, of which 4.6 are spent in PipeOpTaskPreproc (for scaling and PipeOpTaskPreproc overhead together, compare to 6.0 + 1.0 above).

The situation changes when the data has more columns and fewer rows, using ncol = 300 instead of ncol = 3, resulting in a 100000 x 300 matrix: 7.7 units current PipeOpScale using scale(), of which 1.0 are mlr3-overhead; 3.6 units PipeOpScale using faster methods, of which 0.8 are mlr3-overhead.

I guess I will therefore opt to use the new method and hope I have found a good tradeoff between verbosity and speed.

mlr-org / mlr3pipelines

Scale data in `data.table` #602