syberia / mungebits2

Atomic production-ready data preparation in R
MIT License
3 stars 2 forks source link

Mungebits2 is 4x slower than mungebits #11

Closed robertzk closed 9 years ago

robertzk commented 9 years ago

For simple operations... this might be due to the evasion of nonstandard evaluation in the train and predict function. Really need to figure out the discrepancy here.

robertzk commented 9 years ago

It might boil down to:

> data.frame(replicate(1000, 1:5, simplify=F)) -> foo
> fn1 <- function(df) {
+   for(i in colnames(df)) { df[[i]] <- df[[i]] + 1}
+   df
+ }
> fn2 <- function(df) {
+   df[colnames(df)] <- lapply(seq_along(df), function(i) { .subset2(df, i) + 1 })
+   df
+ }
> system.time(fn1(df))
   user  system elapsed
  0.050   0.005   0.056
> dim(df)
[1]    5 1000
> system.time(fn2(df))
   user  system elapsed
  0.032   0.003   0.035

TLDR: Use lapply?

robertzk commented 9 years ago

Solved by environment magic. https://github.com/robertzk/mungebits2/pull/12

robertzk commented 9 years ago

Latest speedups bring it within 3x of the most optimized base R code.

> load_all(); doubler <- mungebit$new(column_transformation(function(x) { 2 * x })); microbenchmark( doubler$run(foo, 1:1000), raw_double(foo, 1:1000), times = 5L)
Loading mungebits2
Unit: milliseconds
                     expr      min       lq    mean   median       uq      max
 doubler$run(foo, 1:1000) 3.366846 3.432370 4.00230 3.466510 3.610990 6.134785
  raw_double(foo, 1:1000) 1.093207 1.100542 1.16094 1.129146 1.151777 1.330030
robertzk commented 9 years ago

Additionally, train and predict are 1.5 faster than mungebits1

load_all(); column_transformation(function(x) x + 1) -> ct;  mp <- mungepiece$new(mungebit$new(ct)); system.time(mp$run(df))
   user  system elapsed
  0.007   0.000   0.006
> system.time(mp$run(df)) # predict
   user  system elapsed
  0.004   0.000   0.004

Compared to mungebits1:

> load_all(); column_transformation(function(x) x + 1) -> ct; mp <- mungepiece$new(mungebit$new(ct)); df2 <- mungeplane(data.frame(foo)); system.time(mp$run(df2))
Loading mungebits
   user  system elapsed
  0.008   0.000   0.008
> system.time(mp$run(df2)) # predict
   user  system elapsed
  0.009   0.000   0.009