Closed kadyb closed 2 years ago
Thanks for pointing this out. In reality, things are a bit complicated...
Calling the PipeOpScale
currently takes 16.4 arbitrary units of time with the given data. Of this, 9.4 units are spent on mlr3 backend overhead (we could make this better), 1.0 unit of PipeOpTaskPreproc overhead (probably because of conversion from data.table to matrix), leaving 6.0 units doing the actual scale()
call.
The actual scale()
call does more than just (x - mean(x)) / sd(x)
, since it (1) records center and scale values (for prediction), and (2) also needs to work in case center = FALSE
, in which case sd()
can not be used. I tried to write an alternative version for PipeOpScale
, which takes 13.9 units overall, of which 4.6 are spent in PipeOpTaskPreproc (for scaling and PipeOpTaskPreproc overhead together, compare to 6.0 + 1.0 above).
The situation changes when the data has more columns and fewer rows, using ncol = 300
instead of ncol = 3
, resulting in a 100000 x 300
matrix: 7.7 units current PipeOpScale using scale()
, of which 1.0 are mlr3-overhead; 3.6 units PipeOpScale using faster methods, of which 0.8 are mlr3-overhead.
I guess I will therefore opt to use the new method and hope I have found a good tradeoff between verbosity and speed.
I think it would be a good idea to implement data scaling in
data.table
instead of usingbase::scale()
because it is slower and requires more memory. Below is the benchmark. I also includedcollapse::fscale()
, but that would require adding a new dependency.