Closed robertzk closed 9 years ago
It might boil down to:
> data.frame(replicate(1000, 1:5, simplify=F)) -> foo
> fn1 <- function(df) {
+ for(i in colnames(df)) { df[[i]] <- df[[i]] + 1}
+ df
+ }
> fn2 <- function(df) {
+ df[colnames(df)] <- lapply(seq_along(df), function(i) { .subset2(df, i) + 1 })
+ df
+ }
> system.time(fn1(df))
user system elapsed
0.050 0.005 0.056
> dim(df)
[1] 5 1000
> system.time(fn2(df))
user system elapsed
0.032 0.003 0.035
TLDR: Use lapply?
Solved by environment magic. https://github.com/robertzk/mungebits2/pull/12
Latest speedups bring it within 3x of the most optimized base R code.
> load_all(); doubler <- mungebit$new(column_transformation(function(x) { 2 * x })); microbenchmark( doubler$run(foo, 1:1000), raw_double(foo, 1:1000), times = 5L)
Loading mungebits2
Unit: milliseconds
expr min lq mean median uq max
doubler$run(foo, 1:1000) 3.366846 3.432370 4.00230 3.466510 3.610990 6.134785
raw_double(foo, 1:1000) 1.093207 1.100542 1.16094 1.129146 1.151777 1.330030
Additionally, train and predict are 1.5 faster than mungebits1
load_all(); column_transformation(function(x) x + 1) -> ct; mp <- mungepiece$new(mungebit$new(ct)); system.time(mp$run(df))
user system elapsed
0.007 0.000 0.006
> system.time(mp$run(df)) # predict
user system elapsed
0.004 0.000 0.004
Compared to mungebits1:
> load_all(); column_transformation(function(x) x + 1) -> ct; mp <- mungepiece$new(mungebit$new(ct)); df2 <- mungeplane(data.frame(foo)); system.time(mp$run(df2))
Loading mungebits
user system elapsed
0.008 0.000 0.008
> system.time(mp$run(df2)) # predict
user system elapsed
0.009 0.000 0.009
For simple operations... this might be due to the evasion of nonstandard evaluation in the train and predict function. Really need to figure out the discrepancy here.