renkun-ken / pipeR

Multi-Paradigm Pipeline Implementation
Other
167 stars 39 forks source link

value vs magrittr? #13

Closed dpastoor closed 10 years ago

dpastoor commented 10 years ago

Out of curiosity, are you familiar with the magrittr package. Considering it has an already robust implementation for piping and is incorporated into Hadley's dplyr and ggvis packages it seems that your dev effort could be better off rollingyour additional ideas into that package?see here for a link to the package.

You could also take a look how he handled lambdas and aliases.

renkun-ken commented 10 years ago

See issue #12.

On Jun 3, 2014, at 6:51 AM, "dpastoor" notifications@github.com wrote:

Out of curiosity, are you familiar with the magrittr package. Considering it has an already robust implementation for piping and is incorporated into Hadley's dplyr and ggvis packages it seems that your dev effort could be better off rollingyour additional ideas into that package?see here for a link to the package.

You could also take a look how he handled lambdas and aliases.

— Reply to this email directly or view it on GitHub.

renkun-ken commented 10 years ago

Thanks for your remind! I'm a F# developer and was a magrittr user. I'm happy to see that the idea of pipeline in F# is brought into R by the developers of magrittr. But finally I decide to create a light version that does not try to combine multiple pipe mechanisms together, which I find adds cognitive burdens and significantly lower the performance, because the operator must handle multiple mechanisms when I use one simple operator. Sometimes it is confusing to read code in which one operator undertakes the responsibility to run multiple pipe mechanisms.

And that's why I create this package to follow different principles:

  1. Each operator should do a thing as simple as possible.
  2. The performance of an operator should be high enough.
  3. The overhead of an operator should be low enough.

I believe for piping each operator should be as simple as possible. If you review the code of magrittr, you will find that its performance cannot be very high. Although it is a bit unfair to ask pipe operators for performance, it is still useful for those who heavily use it.

Here is a set of benchmark tests for the two packages at their latest dev version:

First-argument piping:

library(microbenchmark)
set.seed(1)
null <- function() {
  c(rnorm(100),1)
}
pipeR <- function() {
  pipeR::`%>%`(rnorm(100),c(1))
}
magrittr <- function() {
  magrittr::`%>%`(rnorm(100),c(1))
}
microbenchmark(null(),pipeR(),magrittr(),times=10000L)

and the result is

Unit: microseconds
       expr     min      lq  median      uq      max neval
     null()  11.496  13.138  14.369  15.190   90.728 10000
    pipeR()  47.211  51.727  52.959  55.012 4339.295 10000
 magrittr() 139.170 144.507 147.791 155.181 5643.957 10000

Piping with .:

set.seed(1)
null <- function() {
  c(rnorm(100),1)
}
pipeR <- function() {
  pipeR::`%>>%`(rnorm(100),c(.,1))
}
magrittr <- function() {
  magrittr::`%>%`(rnorm(100),c(.,1))
}
microbenchmark(null(),pipeR(),magrittr(),times=10000L)

and the result is

Unit: microseconds
       expr     min      lq  median      uq      max neval
     null()  11.495  13.138  14.369  15.190 1019.756 10000
    pipeR()  34.896  39.822  42.695  44.748 1083.798 10000
 magrittr() 133.012 141.223 144.096 150.255 2388.871 10000

Nested piping:

set.seed(1)
null <- function() {
  x <- rnorm(100)
  c(x,c(x,c(x,c(x,1))))
}
pipeR <- function() {
  pipeR::`%>>%`(rnorm(100),c(.,c(.,c(.,1))))
}
magrittr <- function() {
  magrittr::`%>%`(rnorm(100),magrittr::lambda(. -> c(.,c(.,c(.,1)))))
}
microbenchmark(null(),pipeR(),magrittr(),times=10000L)

the result is

Unit: microseconds
       expr     min      lq  median      uq       max neval
     null()  14.780  16.832  17.654  18.885  1324.368 10000
    pipeR()  38.179  42.285  44.337  45.980  2704.157 10000
 magrittr() 191.718 200.749 204.033 213.886 23477.348 10000

Lambda piping:

set.seed(1)
null <- function() {
  x <- rnorm(100)
  c(x,c(x,c(x,c(x,1))))
}
pipeR <- function() {
  pipeR::`%|>%`(rnorm(100),. ~ c(.,c(.,c(.,1))))
}
magrittr <- function() {
  magrittr::`%>%`(rnorm(100),magrittr::lambda(. -> c(.,c(.,c(.,1)))))
}
microbenchmark(null(),pipeR(),magrittr(),times=10000L)

the result is

Unit: microseconds
       expr     min      lq  median      uq      max neval
     null()  14.369  17.242  18.064  19.295 1092.008 10000
    pipeR()  47.211  52.548  54.191  56.244 2679.114 10000
 magrittr() 191.307 201.160 205.266 216.349 2714.009 10000

You see that pipeR is almost 3-4 times as fast as magrittr.

I know the design purpose of piping is not for performance but for easiness to write fluent code. And it is also unfair to compare a toolbox-like operator in magrittr with each specialized operator in pipeR. But as I heavily use it, I hope to use multiple operators that I exactly know what a simple thing each of them does and want less overhead. That's why I create this package: to write fluent code with readability and robustness and without much overhead and ambiguity.

renkun-ken commented 10 years ago

In addition, %>>% in pipeR pipes the last value to . in the next expression, which allows you to write the following code:

library(pipeR)
rnorm(100) %>>% {
  z <- head(.,60)
  list(sample5=sample(.,5,replace=F),
    sample10=sample(.,10,replace=F))
}

because %>>% directly evaluates the next expression. To do the same thing with magrittr:

library(magrittr)
rnorm(100) %>% lambda(. -> {
  z <- head(.,60)
  list(sample5=sample(.,5,replace=F),
    sample10=sample(.,10,replace=F))
})

which is quite equal but pipeR is a bit simpler.

dpastoor commented 10 years ago

Thanks for the thorough response! This is actually quite interesting/useful.

renkun-ken commented 10 years ago

I have done a group of performance tests on different aspects of these two packages and find that pipeR is generally 3-7 times faster than magrittr, and can be much much faster than it when the chain is long. I will fire an issue at dplyr on this saying that the implementation of magrittr potentially reduces the overall performance when the chain is long and the data is big.

More specifically, the computing time of magrittr is polynomial to chain length while pipeR is slower than linear.

Here's a benchmark result from first-argument piping for 50 steps:

Chaining test
Unit: microseconds
     expr      min       lq   median       uq      max neval
     null  358.803  383.435  400.266  418.741 27234.92 10000
    pipeR 1065.324 1129.366 1186.020 1327.653 26022.22 10000
 magrittr 8770.547 9206.325 9489.795 9912.229 35055.09 10000

Another from dot piping for 50 steps.

Unit: microseconds
     expr      min       lq    median        uq      max neval
     null  354.697  386.309   406.835   429.825 26849.85 10000
    pipeR  655.616  715.553   752.500   860.470 27847.43 10000
 magrittr 9221.719 9749.865 10061.457 10565.381 43313.71 10000

Although a chaining operation of 50 steps looks ridiculous, the result really means something.

smbache commented 10 years ago

Performance can be essential, and one can definitely argue that there is room for piper too. One needs to trade off speed and comfort, and choose ones battles. It is also faster to call lm.fit, rather than using the lm interface:

> microbenchmark(
+   lm.fit = lm.fit(cbind(1, iris[["Sepal.Width"]]), iris[["Sepal.Length"]]), 
+   lm     = lm(Sepal.Length ~  Sepal.Width, iris)
+ )
Unit: microseconds
   expr      min        lq   median        uq      max neval
 lm.fit   82.959   90.8705   94.933   98.7815  248.877   100
     lm 1075.474 1097.2830 1108.401 1138.9760 2205.683   100

However, the latter is rarely seen in practice. But one might do it in situations where speed matters, e.g. simulations. (again, my main concern is that piper uses the same name as magrittr-- so if you wanted to, you could not have them both available).

rdinnager commented 10 years ago

As a potential user of pipeR, I agree that it would be better if pipeR used a different operator in place of %>%. I for one would be more likely to use it if it did not conflict with magrittr, which I use extensively. I would like to be able to use both, rather than being forced to choose (or being very careful with the way that the two packages are loaded, which is an unnecessary hassle). I like certain aspects of pipeR, but using %>% seems to create unnecessary competition, where I would prefer coexistence (as an ecologist, I know that coexistence is promoted by reducing overlap in resources). Just my two cents.

smbache commented 10 years ago

My exact point in https://github.com/renkun-ken/pipeR/issues/12 Maybe the "issue" was taken the wrong way..

renkun-ken commented 10 years ago

Thanks for your kind replies! This issue should be only about the value of this package. Let's talk about naming operators in issue #12.