renkun-ken / pipeR

Multi-Paradigm Pipeline Implementation
Other
167 stars 39 forks source link

Add the syntax only for side effect #30

Closed renkun-ken closed 10 years ago

renkun-ken commented 10 years ago

Consider the following syntax:

x %>>% (~ expr)         # evaluate expr with . = x and return x
x %>>% ((m) ~ expr)     # evaluate expr with m = x and return x
mtcars %>>%
  (~ cat("Number of columns:",ncol(.),"\n")) %>>%
  (mpg) %>>%
  summary
Number of columns: 11 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.40   15.42   19.20   20.09   22.80   33.90 

or

mtcars %>>%
  ((x) ~ cat("Number of columns:",ncol(x),"\n")) %>>%
  (mpg) %>>%
  summary
Number of columns: 11 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.40   15.42   19.20   20.09   22.80   33.90 

where (~ expr) or ((x) ~ expr) indicates that the output of this will be ignored and the input will be returned, thus only for side effect (only one side is stressed in the formula, also looks like expr is evaluated as a side branch)

Note that all syntax in () automatically applies to .() in Pipe, therefore,

Pipe(mtcars)$
  .(~ cat("Number of columns:",ncol(.),"\n"))$
  .(mpg)$
  summary()
Number of columns: 11 
$value : summaryDefault table 
------
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.40   15.42   19.20   20.09   22.80   33.90 
renkun-ken commented 10 years ago

A practical case is to make a plot of a linear model before which the partition is set.

mtcars %>>%
  (~ par(mfrow=c(2,2))) %>>%    # only for side effect (par() returns the arg list)
  (lm(mpg ~ cyl + wt, data = .)) %>>%
  plot()
renkun-ken commented 10 years ago

Other examples:

mtcars %>>%
  (~ par(mfrow=c(1,2))) %>>%
  (~ plot(mpg ~ cyl, data = .)) %>>%
  (~ plot(mpg ~ wt, data = .)) %>>%
  (lm(mpg ~ cyl + wt, data = .)) %>>%
  summary() %>>%
  (coefficients)
Pipe(mtcars)$
  .(~ par(mfrow=c(1,2)))$
  .(~ plot(mpg ~ cyl, data = .))$
  .(~ plot(mpg ~ wt, data = .))$
  .(lm(mpg ~ cyl + wt, data = .))$
  summary()$
  .(coefficients)

Do you think it is useful? Do you think it looks ambiguous even if you know the rule and what it means?

@timelyportfolio @ramnathv @yanlinlin82

timelyportfolio commented 10 years ago

Is the question whether to have this functionality at all or what syntax is best?

I will very clearly demonstrate my ignorance here, but just to make sure I am clear this would accomplish the objective of the magrittr %T>% tee operator? I looked quickly for an equivalent in F#, but could not find any readily available discussions or examples. Are there parallels in F# or other languages where we could borrow the syntax?

Although I use it rarely, it is very nice to have in those rare use cases even beyond logging. I will try to work up the examples where I find it handy and see how this syntax looks. Also, as I work through many examples, see how sticky it is?

I am assuming that deprecation of lambda #31 will be mandatory to prevent confusion.

renkun-ken commented 10 years ago

Yes, it is like magrittr's %T>% operator for side effect and it is not from F# or any other language as far as I know. It helps avoid breaking pipes in some cases where we want some side effects in between, sometimes helpful for me.

magrittr introduces a new operator to do this and more operators to do other things. At early times, I saw only one or two operators in magrittr, and now I see 5 or 6. Instead of introducing new operators, I would like to carefully introduce new syntax that is not confusing and not easily abused.

The feature has been committed to branch 0.4. Would you please try it and give some suggestions? Thanks a lot!

timelyportfolio commented 10 years ago

I updated to the newest 0.4 and will test. In the past, I used most with reference classes (R5).

renkun-ken commented 10 years ago

Thanks! If you think there is better syntax, please let me know. If the feature costs more than the value it brings, it would not go to master.

timelyportfolio commented 10 years ago

Figured I would borrow some code from a package that uses reference classes so I arbitrarily chose lme4. Here is a small snippet where I try to overuse the side effect functionality.

#think this will be useful for reference classes (R5)
install.packages('lme4')
library(lme4)

#borrow from lme4 vignette to test side effects operator
#str(sleepstudy)
#fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
sleepstudy %>>% 
  ( ~ str(.) ) %>>%  #note ( ~ str ) does not print str but still passes through
  #found this in the .Rnw but code is not in final vignette output  
  (~ 
     print(lattice::xyplot(Reaction ~ Days | Subject, ., aspect = "xy",
                    layout = c(9, 2), type = c("g", "p", "r"),
                    index.cond = function(x, y) coef(lm(y ~ x))[2],
                    xlab = "Days of sleep deprivation",
                    ylab = "Average reaction time (ms)",
                    as.table = TRUE))
  ) %>>%
  { lmer( Reaction ~ Days + ( Days | Subject ), . ) } %>>%
  ( ~assign( "fm1", ., envir = .GlobalEnv ) )

#the hard way to accomplish the fm1 above
# formula module
#   parsedFormula <- lFormula(formula = Reaction ~ Days + (Days|Subject),
#                                data = sleepstudy)
# 
#   # objective function module
#   devianceFunction <- do.call(mkLmerDevfun, parsedFormula)
# 
#   # optimization module
#   optimizerOutput <- optimizeLmer(devianceFunction)
# 
#   # output module
#   mkMerMod( rho = environment(devianceFunction),
#                opt = optimizerOutput,
#                reTrms = parsedFormula$reTrms,
#                fr = parsedFormula$fr)

#probably not a likely candidate for pipelining but do it nevertheless
#don't know enough yet about lme4 design to recode yet
sleepstudy %>>%
  ( ~ print("# formula module")) %>>%
  { 
    lFormula (
      formula = Reaction ~ Days + (Days|Subject)
      , data = .
    )
  } %>>%
  ( ~ assign( "parsedFormula", ., envir = .GlobalEnv ) ) %>>%
  ( ~ cat( "test parsedFormula$frame == fm1@frame" ) )%>>%
  ( ~ testthat::is_identical_to(fm1@frame,parsedFormula$fr) ) %>>%
  ( ~ print( "optimization module" ) ) %>>%
  { do.call( mkLmerDevfun, . ) } %>>%
  ( ~ assign( "devianceFunction", ., envir = .GlobalEnv ) ) %>>%
  ( ~ print( "output module" ) ) %>>%
  optimizeLmer %>>%
  {
    mkMerMod (
      rho = environment( devianceFunction )
      ,opt = .
      ,reTrms = parsedFormula$reTrms,
      ,fr = parsedFormula$fr
    )
  }
timelyportfolio commented 10 years ago

Another use similar to logging/documenting that I had not considered would be to test in the pipeline with testthat or rtype.

renkun-ken commented 10 years ago

Here's a pseudo computing example :)

Pipe(1:3)$
    .(~ cat("connect",length(.),"elements with 2 more\n"))$
    .(~ Sys.sleep(1))$
    c(4,5)$
    .(~ cat("calculating mean\n"))$
    .(~ Sys.sleep(1))$
    mean()
timelyportfolio commented 10 years ago

hardest thing for me so far has been ~ inside of () rather than ~(), but the ( ~ ) makes more sense to me. just harder to type for some reason (probably muscle memory).

renkun-ken commented 10 years ago

It's sad that ~(expr) will be parsed to break the evaluation order which does not allow chaining.

> as.list(quote(a %>>% ~(x) %>>% y()))
[[1]]
`%>>%`

[[2]]
a

[[3]]
~(x) %>>% y()

Neither does ~ expr work:

> as.list(quote(a %>>% ~x %>>% y()))
[[1]]
`%>>%`

[[2]]
a

[[3]]
~x %>>% y()

And for (~expr):

> as.list(quote(a %>>% (~x) %>>% y()))
[[1]]
`%>>%`

[[2]]
a %>>% (~x)

[[3]]
y()
renkun-ken commented 10 years ago

In the syntax I designed, () is the feature-hub that supports more than one features dependent on the inner syntax, which may bring potential confusion though. But so far it looks unlikely that someone mistakenly use a feature if one does not know it.

timelyportfolio commented 10 years ago

btw, I like the computing example...

I found the lattice plot in the lme4 vignette by digging in the .Rnw source file, so I added to the example above, but thought it would be good to paste separately to see how it looks in isolation.

sleepstudy %>>% 
  ( ~ str(.) ) %>>%  #note ( ~ str ) does not print str but still passes through
  #found this in the .Rnw but code is not in final vignette output  
  (~ 
    print(lattice::xyplot(Reaction ~ Days | Subject, ., aspect = "xy",
                    layout = c(9, 2), type = c("g", "p", "r"),
                    index.cond = function(x, y) coef(lm(y ~ x))[2],
                    xlab = "Days of sleep deprivation",
                    ylab = "Average reaction time (ms)",
                    as.table = TRUE))
  ) %>>%
  { lmer( Reaction ~ Days + ( Days | Subject ), . ) } %>>%
  ( ~ assign( "fm1", ., envir = .GlobalEnv ) ) %>>%
  #test some nested calls with the profile from vignette conclusion
  ( ~ profile( . ) %>>% { print(lattice::splom(.) ) } )

and then if I have it right as a Pipe.

Pipe(sleepstudy)$
  .( ~ str(.) )$
  .( ~ 
     print(lattice::xyplot(Reaction ~ Days | Subject, ., aspect = "xy",
                           layout = c(9, 2), type = c("g", "p", "r"),
                           index.cond = function(x, y) coef(lm(y ~ x))[2],
                           xlab = "Days of sleep deprivation",
                           ylab = "Average reaction time (ms)",
                           as.table = TRUE))
  )$
  .( lmer( Reaction ~ Days + ( Days | Subject ), . ) )$
  .( ~ assign( "fm1", ., envir = .GlobalEnv ) )$
  .( ~ profile( . ) %>>% { print(lattice::splom(.)) } )[]
renkun-ken commented 10 years ago

Another step-by-step plotting example:

m <- data.frame(x=1:100,y=rnorm(100))
par(mfrow=c(2,2))
Pipe(m)$
  .(~ plot(y ~ x, data = .))$
  transform(z = y^2)$
  .(~ plot(y ~ z, data = .))$
  transform(w = (y + z))$
  .(~ plot(y ~ w, data = .))$
  transform(q = sin(x)+cos(y))$
  .(~ plot(y ~ q, data = .))
renkun-ken commented 10 years ago

I consider the main use of this feature is to

The (~ expr) syntax seems to be easy to distinguish from non-side effect use and unlikely to be mistakenly used. Can't imagine a user type this syntax without knowing what it means.

timelyportfolio commented 10 years ago

lambda conflict (which I think you have decided to deprecate/eliminate) really is the only side effect of the side effect that I have thought of

timelyportfolio commented 10 years ago

I cannot think of much more to throw at it than this monstrosity replicating a post I had done previously.

# from timelyportfolio lme4 error bar post
# http://timelyportfolio.github.io/rCharts_errorbar/ucla_melogit.html
"http://www.ats.ucla.edu/stat/data/hdp.csv" %>>%
  read.csv %>>%
  within( {
    Married <- factor(Married, levels = 0:1, labels = c("no", "yes"))
    DID <- factor(DID)
    HID <- factor(HID)
  } ) %>>%
  {
    glmer(remission ~ Age + LengthofStay + FamilyHx + IL6 + CRP +
          CancerStage + Experience + (1 | DID) + (1 | HID),
          data = ., family = binomial, nAGQ=1)
  } %>>%  # show the dotplot as a reference
  (~
     print(lattice::dotplot(
       ranef(., which = "DID", postVar = TRUE),
       scales = list(y = list(alternating = 0))
     ))
  ) %>>%
  { ranef(object  = ., which = "DID", postVar = TRUE)$DID } %>>%
  {
    data.frame(
      "id" = rownames(.),  #this will be our x
      "intercept" = .[,1],            #this will be our y
      "se" = as.numeric(attr( ., "postVar" ))  #this will be our se
    )
  } %>>%  #had not thought of this use to add library
  (~ library(rCharts) ) %>>%  
  #rCharts good ref class reference for side effect helpfulness
  {
    setRefClass(
      "rChartsError"
      ,contains="rCharts"
      ,methods=list(
        initialize = function(){
          callSuper()
        }
        ,getPayload = function(chartId){
          list(chartParams = toJSON2(params), chartId = chartId, lib = basename(lib), liburl = LIB$url)
        }
      )
    )$new() %>>%
        (~ .$setLib("http://timelyportfolio.github.io/rCharts_errorbar") ) %>>%
        (~ .$setTemplate (
          script = "http://timelyportfolio.github.io/rCharts_errorbar/layouts/chart.html"
          ,chartDiv = "<div></div>"
        ) ) %>>%
        (~ .$set(
          data = get(".",parent.env(environment())),  #ugly but don't know better way
          height = 500,
          width = 1000,
          margin = list(top = 10, bottom = 10, right = 50, left = 100),
          x = "id",
          y = "intercept",
          radius = 2,
          sort = list( var = "intercept" ),
          whiskers = "#!function(d){return [d.intercept - 1.96 * d.se, d.intercept + 1.96 * d.se]}!#",
          tooltipLabels = c("id","intercept","se") 
        ))
  }
timelyportfolio commented 10 years ago

another use similar to logging would be to write a file with results for reproducibility.

timelyportfolio commented 10 years ago

more plotting examples

pdf("test.pdf")
data.frame( x = 1:10, y = 1:10 ) %>>%
  ( ~ plot( x = .[,"x"], y = .[,"y"], type = "b" ) ) %>>%
  ( ~ library(latticeExtra) ) %>>%
  ( ~ xyplot( y ~ x, data = ., type = c("p","l") ) %>>% print %>>%  ( ~ asTheEconomist(.) %>>% print ) ) %>>%
  ( ~ library(ggplot2) ) %>>% 
  ( ~ ggplot( ., aes( x = x, y = y) )  %>>% + geom_line() %>>% + geom_point() %>>% print )
dev.off()

a little different look at it with a focus on ggplot2

data.frame( x = 1:10, y = 1:10 ) %>>%
    ggplot( aes(x=x,y=y) ) %>>%
    ((g1) ~ print( g1  + geom_point()) ) %>>%
    ( ~ print( . + geom_line() )) %>>%
    str
renkun-ken commented 10 years ago

A mix for all features:

library(pipeR)
mtcars %>>%
  (~ cat("data:",ncol(.),"columns\n")) %>>%
  subset(mpg >= quantile(mpg, 0.05) & mpg <= quantile(mpg,0.95)) %>>%
  ( lm(mpg ~ cyl + disp + wt + factor(vs), data = .) ) %>>%
  summary() %>>%
  (coefficients) %>>%
  ((coe) ~ cat("coefficients:",class(coe),"\n")) %>>%
  ((coe) ~ print(coe)) %>>%
  (coe ~ coe[-1,1]) %>>%
  barplot(main = "coefficients")

I think the ((x) ~ expr) part is not quite clear for distinction or not obvious to regard as side effect.

I'm considering take the syntax of the following:

The syntax looks more uniform and makes more sense to me. And luckily it can be parsed in desired way.


> as.list(quote(~ x ~ x + 1))
[[1]]
`~`

[[2]]
~x

[[3]]
x + 1

What do you think?

renkun-ken commented 10 years ago

It's very interesting that my expression analyzer directly support the syntax of ~ x ~ expr. In fact any syntax where lhs is length 2 will indicate that the 2nd element in lhs will be regarded as the symbol for side effect expression.

The following code will run without having to change any code:

mtcars %>>%
  (~ cat("data:",ncol(.),"columns\n")) %>>%
  subset(mpg >= quantile(mpg, 0.05) & mpg <= quantile(mpg,0.95)) %>>%
  ( lm(mpg ~ cyl + disp + wt + factor(vs), data = .) ) %>>%
  summary() %>>%
  (coefficients) %>>%
  (~ coe ~ cat("coefficients:",class(coe),"\n")) %>>%
  (~ coe ~ print(coe)) %>>%
  (coe ~ coe[-1,1]) %>>%
  barplot(main = "coefficients")
yanlinlin82 commented 10 years ago

I was wondering why this would be treated as a "side effect".

I prefer to look it directly as the final return value of the whole pipe expression of (A %>>% fun). Since the default return value of a pipe expression is the rhs, why don't you define another operator for such "returning lhs" requirement, which I think may leave the pipe itself more clear.

For example:

1:10 %>>% mean # return mean(1:10) 1:10 %<<% mean # calculate mean(1:10) but only return lhs, i.e. 1:10

yanlinlin82 commented 10 years ago

A more comprehensive example could be like this:

x <- 1:10 # First I have a data set print(x %>>% plot) # Plot the data set, and the whole pipe expression returns NULL x %<<% plot %>>% mean # What if I want to calculate mean() while plotting it

I think this should be more clear than:

x %>>% (~ plot) %>>% mean Because in the latter scenario, I need to understand the pipe expression first and then found that it is a "side effect".

renkun-ken commented 10 years ago

Thanks @yanlinlin82 for your opinion. You just pointed out the core problem in this issue: more operators or more syntax?

Let's see the example with %<<% being the side-effect operator or simply use magrittr's %T>%.

mtcars %<<%
  (cat("data:",ncol(.),"columns\n")) %>>%
  subset(mpg >= quantile(mpg, 0.05) & mpg <= quantile(mpg,0.95)) %>>%
  (lm(mpg ~ cyl + disp + wt + factor(vs), data = .)) %>>%
  summary() %>>%
  (coefficients) %<<%
  (coe ~ cat("coefficients:",class(coe),"\n") ) %<<%
  (coe ~ print(coe)) %>>%
  (coe ~ coe[-1,1]) %>>%
  barplot(main = "coefficients")

I feel I must scan the code very carefully to understand which line is forward piping and which line is only side effect. In this line-by-line example, only when I look back and find which operator is used can I assure whether it is a side effect or not. Neither can I quickly find the input of the "normal" lines without carefully back-looking at the code.

I think the same problem exists with magrittr's %T>%:

mtcars %T>%
  (l(. ~ cat("data:",ncol(.),"columns\n"))) %>%
  subset(mpg >= quantile(mpg, 0.05) & mpg <= quantile(mpg,0.95)) %>%
  lm(mpg ~ cyl + disp + wt + factor(vs), data = .) %>%
  summary() %$%
  coefficients %T>%
  (l(. ~ cat("coefficients:",class(.),"\n"))) %>%
  print %>%
  (l(coe ~ coe[-1,1])) %>%
  barplot(main = "coefficients")

Do you feel you can quickly understand which object is piped to where and quickly pick out the important "really-doing-stuff" lines? Frankly speaking, I can't, because a little operator is too small to distinguish and in line-by-line piping, the operator must be written in the previous line which determines how the next piping works.

Look at the new syntax where one wants to do some logging between pipes:

library(pipeR)
mtcars %>>%
  (~ cat("data:",ncol(.),"columns\n")) %>>%
  subset(mpg >= quantile(mpg, 0.05) & mpg <= quantile(mpg,0.95)) %>>%
  ( lm(mpg ~ cyl + disp + wt + factor(vs), data = .) ) %>>%
  summary() %>>%
  (coefficients) %>>%
  (~ coe ~ cat("coefficients:",class(coe),"\n")) %>>%
  (~ coe ~ print(coe)) %>>%
  (coe ~ coe[-1,1]) %>>%
  barplot(main = "coefficients")

I feel rather clear when I simply take a glimpse at the code if I know (~ expr) or (~ x ~ expr) indicates side-effect (it is only one side) and I don't have to care about the operator anymore thus not have to look back, because there's only one.

That's why I feel there are too many operators and hard to distinguish at a first glimpse. But with syntax, it should be much much easier to understand the code at first glimpse. That's why I make () more special because it's an alert that something special happens, and can be seen directly inline rather than an operator located in previous line.

A typical case is that one does not use this feature that heavily but rarely. Therefore, it should be like

Pipe(mtcars)$
  .(~ cat("data:",ncol(.),"columns\n"))$
  subset(mpg >= quantile(mpg, 0.05) & mpg <= quantile(mpg,0.95))$
  .(lm(mpg ~ cyl + disp + wt + factor(vs), data = .))$
  summary()

Just take a glimpse at the code, and it should be easy to find all lines that start with (~, if you want to understand the code quickly, just ignore all these lines and see what's being done and piped. But if the code uses more operators, I believe you won't understand it or scan it so quickly because you have to carefully look at the little symbol in the end of each line.

If you want to find out the input of a normal line, it should be pretty easy if you only look at the header of each line and look back until a line that does not start with (~, that is the line whose output is the input you want to know.

timelyportfolio commented 10 years ago

I agree x %>>% ( ~ p ~ expr ) for side effect with p = x is clearer to me.

I also vote against %<<%.

yanlinlin82 commented 10 years ago

I finally see your opinion.​ You are using "side effect" syntax to ignore branch steps to make it easy to find the main pipe stream. Then I admit it is better than involving another operator.

yanlinlin82 commented 10 years ago

By the way, it just occurred to me that will it always have a main stream in a pipe, with or without other branches. That is to say, if a data set is to be processed by different procedures simultaneously, and if we want them all in a pipe, then we need to arbitrarily make one procedure be primary, and other procedures be branches, it this right?

For example:

x <- c(... some data ...) proc1: foo1A(x); foo1B(x); foo1C(x); ... proc2: foo2A(x); foo2B(x); foo2C(x); ... proc3: foo3A(x); foo3B(x); foo3C(x); ...

Then it could be written like this:

x %>>% (~ foo1A %>>% foo1B %>>% foo1C %>>% ...) %>>% (~ foo2A %>>% foo2B %>>% foo2C %>>% ...) %>>% foo3A %>>% foo3B %>>% foo3C ...

renkun-ken commented 10 years ago

@yanlinlin82 That's a very interesting insight! I have not yet considered much about "branching" in pipeline. It looks quite interesting. For example,

m <- data.frame(x=1:10)
par(mfrow=c(2,2))
m %>>%
  (~ . %>>% transform(y=x) %>>% plot(type="l")) %>>%
  (~ . %>>% transform(y=x^2) %>>% plot(type="l")) %>>%
  (~ . %>>% transform(y=sin(x/2)) %>>% plot(type="l")) %>>%
  (~ . %>>% transform(y=cos(x/2)) %>>% plot(type="l"))

which has four branches to manipulate one piece of data :)