xtensor lazy evaluation

juangomezduaso commented 5 years ago

It would be good to take advantage of the lazy nature of xtensor, be it to have a special function to calculate the most performance troublesome expressions; or to constitute the normal way of working of a rray future lazy version. I thought that a solution to jump the languages chasm would be to have a function in C++ that admitted not just one operation, but complex xexpressions (prepared in R where flexibility shines) to execute in C++ (where performance shines) as a whole.

I have no idea of C++ (in fact I wonder why did I post this issue at all), but I stumbled with an article: https://gallery.rcpp.org/articles/rcpp-wrap-and-recurse/ that might interest you, and that allowed me to: a) Do a simple parser in R to convert expresions involving some "fake rray" functions into a recursive list RObject b) And a simple interpreter in C++ to read that list and be instructed to call the corresponding "fake xtensor" functions recursively.

Most xtensor functions have a fixed and limited set of arguments which are either (a)rrays (of the three allowed inner types) or integer vectors (for indexes and axes). This, hopefully, makes the design of the interpreter relatively simple.

Unfortunately my ignorance of C++ hardly allowed me to do just this toy example, but I hope that its very simplicity will let you get the idea at a glance and judge it without loosing too much time.

Let f1,f2,f3 be dummy xtensor functions and rray_Fun1, rray_Fun2, rray_Fun3 their R rray counterparts We can call arbitrary expresions in R involving rray functions mixed with other calls. These expressions can be thougth as the composition of: 1) an external subexpresion (from the root and as far as it can go not traversing any non-rray call), that will go to C++ to be calculated as a whole and 2) the "leaf" or internal subexpresions, that will be evaluated in R beforehand (and if they had rray functions calls, they would have provoked other previous calls to C++. See last example below).

In C++ I have:

//RAYCALC:CPP
//  based on: 
//  https://gallery.rcpp.org/articles/rcpp-wrap-and-recurse/

#include <Rcpp.h>
using namespace Rcpp;

int f1(int a, int b){
  return 100*a+b;
}

int f2(int a, int b){
  return a*(b+1);
}

int f3(int a) {
  return a*a;
}

// [[Rcpp::export]]
int calcInt(List x) {
  int res; 
  switch((int) x[0]){
  case 1:{
    res = f1(calcInt(x[1]), calcInt(x[2]) );
    break;
  }
  case 2:{
    res = f2(calcInt(x[1]), calcInt(x[2]) );
    break;
  } 
  case 3:{
    res = f3(calcInt(x[1]) );
    break;
  } 
      case 0:{
    res = x[1];
    break;
  }
  default: {
    stop("incompatible op number encountered");
  }
  }
  return res;
}

//DOUBLE
double f1(double a, double b){
  return 100*a+b;
}

double f2(double a, double b){
  return a*(b+1);
}

double f3(double a) {
  return a*a;
}

// [[Rcpp::export]]
double calcDbl(List x) {
  double res; 
  switch((int) x[0]){
  case 1:{
    res = f1(calcDbl(x[1]), calcDbl(x[2]) );
    break;
  }
  case 2:{
    res = f2(calcDbl(x[1]), calcDbl(x[2]) );
    break;
  } 
  case 3:{
    res = f3(calcDbl(x[1]) );
    break;
  } 
  case 0:{
    res = (double)x[1];
    break;
  }
  default: {
    stop("incompatible op number encountered");
  }
  }
  return res;
}

And in R:


library(Rcpp)
sourceCpp("C:/Users/juang_000/documents/pruebarray/rraycalc.cpp")

library(rlang)

######## rray functions
rray_Fun1<-function(x,y){
  xe<-enexpr(x)
  ye<-enexpr(y)
  wholexp<- expr(rray_Fun1(!!xe,!!ye))
  explist<- parsexp(wholexp); #print(explist)
  if(any(is.double(unlist(explist)))) {
    print("double evaluation")
    calcDbl(explist)
  }else {
    print("int evaluation")
    calcInt(explist)
  } 

}
# rray_Fun2() and rray_Fun3() would be similar

### R parser
parsexp <-function(e){
  if( e[[1]] == sym("rray_Fun1") ) {
    res <- list(1L,parsexp(e[[2]]),parsexp(e[[3]]))
  }  else if(e[[1]] == sym("rray_Fun2")){
    res <- list(2L,parsexp(e[[2]]),parsexp(e[[3]]))
  }  else if(e[[1]] == sym("rray_Fun3")){
    res <- list(3L,parsexp(e[[2]]) )
  }  else {
    res <- list(0L,eval(e))
  }
  res
}

# EXAMPLES
# Integer evaluation
rray_Fun1(1L,2L) == 100 * 1 + 2
#> [1] "int evaluation"
#> [1] TRUE

# Double evaluation. And any call (like 3*4) would be done before the rest goes to c++
rray_Fun1(3*4,2L) == 100 * 12 +2
#> [1] "double evaluation"
#> [1] TRUE
rray_Fun1(3*4,rray_Fun2(4.2,rray_Fun3(5))) == 100 *12 + ( 4.2 * ((5*5) + 1) )
#> [1] "double evaluation"
#> [1] TRUE

# Any call not under our control will be evaluated beforehand 
# For instance, here the minus unary op takes rray_Fun1(1L,2L) out of the main formula
# and it will be evaluated before the list is passed to c++
rray_Fun1(-rray_Fun1(1L,2L),
          rray_Fun2(4.2,rray_Fun3(5))) == 100 *(-102) + ( 4.2 * ((5*5) + 1) )
#> [1] "int evaluation"
#> [1] "double evaluation"
#> [1] TRUE

^{Created on 2019-06-04 by the reprex package (v0.2.1)}

DavisVaughan commented 5 years ago

I've thought long and hard about the lazy evaluation model of xtensor, and how we could use it in rray. It's tough. I've even explored external pointers and returning them to the R side (they hold onto an xtensor C object), but it wont work based on how xtensor does their type system. Eventually I decided we can take advantage of the laziness when we have complex manipulations (like rray_bind) but generally we stick to R semantics and let x + y return the answer without being lazy. Unfortunately I don't think we can do much more unless we add the ability to compile on the fly (like the compile branch), but that is a bit out of scope for rray directly, I think, and would be slow because compilation speeds aren't great. When you have a brms model, that also compiles on the fly, but the time it takes to run the model generally greatly outweighs the amount of time you have to wait for it to compile. Here, the opposite would be true. You expect operations to be fast, but you have to wait for them to compile on the fly and that would be slow

juangomezduaso commented 5 years ago

But the point of this example was precisely that we are not compiling anything at runtime. Are we? (I hope I didn't missunderstand what I have done so badly) My hope is that there can be a C++ interpreter valid for any expression, and thus already compiled. In that case, the two preparation steps ( Rparser-> list-> C++interpreter) seemed instantaneous to me (taking into account that I am always considering only "human scale" expressions, with tens, not thousands of calls into them)

On the other hand, and to be honest, I am now not so sure that this translation design would also be "relatively simple" as I said yesterday. So, close the issue when you see fit, of course

wolfv commented 5 years ago

yes, I think what you are looking at is a JIT compiler - one can use for example cling, or one can write a simple R (rray) expression -> C++ translator, that then compiles a R extension (using xtensor-r) and binds that into the running R process somehow ...

That is similar to what Pythran does for Python.

juangomezduaso commented 5 years ago

Well, I think I used the wrong terms in my example, making it unnecesarily onfusing. What I called "parser" converts R expressions into an R recursive list which represents the expression tree but with some ("leaf") subtrees already evaluated. All nodes from the root will remain unevaluated until we reach a function not included in our "language" (of rray functions that we dont want to evaluate eagerly). From those nodes downwards, the subtree will be inmediately evaluated (even if it had more rray calls deeper down). To illustrate: If we have functions F1,F2,....Fn in R, to be substituted, unevaluated, to xtensor functions f1,f2,...fn. And g1,g2,...gm are other R functions or objects, an expression like: F1( F2( g1(g2,F3(F4)), F5(g3) ), g4(g5) ) would: evaluate A= g1(g2,F3(F4)) In this case this would involve another (different) conversion chunk evaluate B= g4(g5) Produce the list: list(1, list(2,list(0,A),list(5,list(0,g3)),list(0,B)) This is passed to the C++ "calculator" , that recursively walks this tree and calls xtensor functions. If i dont get it wrong, I think all this will result in a lazy construction of an increasingly more complex xexpression with no real array work until it is finally asigned and returned to R. My point was that all this conversion of expressions, whatever we call it, is based in an extremely limited and simple DSL, taylored to the concrete rray needs of taking advantage of xtensor lazyness. This has nothing to do with general C++ complation, or more general and ambitious solutions (like Pythran?), which perhaps wouldnt be as instantaneous as I think our translation would be. But has nothing to do, either, with a just in time compilation of a C++ function for a concrete call.

Having said that, I ask you both to keep in mind that, as is easy to deduce viewing my program, I have no idea of C++ (and I suppose doing DSL compilations isn´t the best way to start ;). My knowledge of xtensor is also very limited, jut discovered it through rray and my interest in array computing in R. So I am sorry if all or part of all this has no sense.

wolfv commented 5 years ago

xtensor is (and can only) be lazy when the program is compiled. If it's not compiled, one needs to write out temporary arrays between the computation steps...

So walking the (dynamic) expression tree in C++ would probably not yield many benefits at this point.

For an actual speed boost, a compilation stage is necessary. The other option would be to provide many pre-compiled kernels, such as

sin(a) * b
sin(a) + b
cos(a) * b ...

and then match the R expression to these kernels. But you will have to do this for all combinations of types for a and b (double, complex, int ...) and for many functions, and then one would need to match these kernels.

In short, it's not feasible and would result in monstrous binaries :)

So JIT compiling the C++ code, or adding a JIT compilation stage in xtensor (through Halide or asmjit) are the more feasible options (and JIT compiling the C++ would be really straightforward with rray / xtensor-r!) :)

Does that make sense?

juangomezduaso commented 5 years ago

In my (now almost surelly wrong ) model I omitted one more C++ function in the example, a necesary one if it was to be applicable to xtensor at all. R should call an array returning function (say "CALC1") passing theLlist to it. CALC1 would then call the recursive one: CALC2(theList), whose return type would be an xexpression. My ignorance of xtensor and C++ made me the illusion that I could create automatic variables and pass them as an xexpression return value between the successive recursive calls of CALC2; and use a container (and thus producing the whole actual calculation) only at the end, in CALC1 But this wont probably do it anyway. Ah, It's much better to listen than to talk about what you don't know. Thank you for your alternative suggestions, which at least I hope could help @DavisVaughan in future improvements.

juangomezduaso commented 5 years ago

Back to R kingdom (which I shuldn't have left ;) , but still thinking about performance, I was surprised when trying to compare rray and Rbase with results like the ones below. I used an obvious slice repetition as the way to attain broadcasting in Rbase. Am i being unfair somehow in this comparation? Any subtelty that I am not aware of? The fact is that I only find advantage for rray in the simplest case. And memory doesnt seem key either.


library(rray)

set.seed(92136)
n <- 1e3
# build n x n test matrix and 2 margins
x <- rray(rnorm(n), c(1, n))
y <- rray_transpose(x)
z <- as_rray(crossprod(x, x)) # symmetric
# Base array versions:
ax<-as_array(x)
ay<-as_array(y)
az<-as_array(z)
# A single operation:
calcInR1<-function(a,b,c){
  a <- a[rep(1,n)]
  b <- b[,rep(1,n)]
  as_rray(  a+b   )
}
bench::mark(x+y)
#> # A tibble: 1 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 x + y        10.9ms     12ms      84.1    7.79MB     30.3
bench::mark(calcInR1(ax,ay,az))
#> # A tibble: 1 x 6
#>   expression                min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 calcInR1(ax, ay, az)   11.3ms   12.1ms      78.5    15.3MB     168.

# A simple expression:
calcInR2<-function(a,b,c){
  a <- a[rep(1,n)]
  b <- b[,rep(1,n)]
 as_rray(  (a+b)/c   )
}

bench::mark((x+y)/z)
#> # A tibble: 1 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 (x + y)/z    21.1ms   21.8ms      45.5    30.5MB     114.
bench::mark(calcInR2(ax,ay,az))
#> # A tibble: 1 x 6
#>   expression                min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 calcInR2(ax, ay, az)   16.1ms   16.4ms      61.0    22.9MB     285.

# A more complex expression:
calcInR3<-function(a,b,c){
  a <- a[rep(1,n)]
  b <- b[,rep(1,n)]
  as_rray(   (1+a*b)/c     )
}
bench::mark((1+x*y)/z)
#> # A tibble: 1 x 6
#>   expression         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 (1 + x * y)/z     35ms   36.3ms      27.5    38.2MB     27.5
bench::mark(calcInR3(ax,ay,az))
#> # A tibble: 1 x 6
#>   expression                min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 calcInR3(ax, ay, az)   16.5ms   17.5ms      53.0    22.9MB     66.3

# With  args repetition:
calcInR4<-function(a,b,c,expr){
  a <- a[rep(1,n)]
  b <- b[,rep(1,n)]
  as_rray( (a*b+a*b)/c )
}
bench::mark((x*y+x*y)/z)
#> # A tibble: 1 x 6
#>   expression             min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 (x * y + x * y)/z   51.5ms   51.5ms      19.4    53.4MB     97.0
bench::mark(calcInR4(ax,ay,az))
#> # A tibble: 1 x 6
#>   expression                min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 calcInR4(ax, ay, az)   20.5ms   20.9ms      44.7    30.6MB     71.5

^{Created on 2019-06-08 by the reprex package (v0.2.1)}

DavisVaughan commented 5 years ago

This is a bit unfair, but the reasons are very subtle. It's a fairly unknown thing, but R is very clever when it comes to performing arithmetic operations. Essentially, when R does (x * y + z) / z it calculates x * y, which allocates memory, and a temporary result is returned but is not assigned to any symbol (meaning it isn't assigned to any named variable). Then R does x * y + z and reuses the memory allocated to the temporary result of x * y in the addition operation with z. Because R doesn't broadcast, it can always reuse this memory without any issues. So essentially that whole chain of operations only allocates once. This has major speed implications.

Here is some proof:

x <- rep(1, times = 10000)
y <- x
z <- x

trash <- profmem::profmem(res <- (x * y + z) / z)

profmem::profmem(res <- (x * y + z) / z)
#> Rprofmem memory profiling of:
#> res <- (x * y + z)/z
#> 
#> Memory allocations:
#>        what bytes      calls
#> 1     alloc 80040 <internal>
#> total       80040

Unfortunately, as of right now this optimization is not available to the R developer. So rray has to allocate for each one of these operations. This is why x + y looks to be about the same speed, but going beyond that things get complicated.

For the interested, here is the function that is used in arithmetic operations that makes this happen. Apparently it was adapted from the pqr project! (pretty quick R). https://github.com/wch/r-source/blob/02c7a25c5c579a3d254ffcf4e6b77cab2f0b7d80/src/main/arithmetic.h#L65

More intensive comparison:

x <- rep(1, times = 10000000)
y <- x
z <- x

trash <- profmem::profmem(res <- (x * y + z) / z)

# with optimization
profmem::profmem(res <- (x * y + z) / z)
#> Rprofmem memory profiling of:
#> res <- (x * y + z)/z
#> 
#> Memory allocations:
#>        what    bytes      calls
#> 1     alloc 80000040 <internal>
#> total       80000040

# without optimization
profmem::profmem(
  {
    tmp1 <- x * y
    tmp2 <- tmp1 + z
    res <- tmp2 / z
  }
)
#> Rprofmem memory profiling of:
#> {
#>     tmp1 <- x * y
#>     tmp2 <- tmp1 + z
#>     res <- tmp2/z
#> }
#> 
#> Memory allocations:
#> Number of 'new page' entries not displayed: 7
#>        what     bytes      calls
#> 8     alloc  80000040 <internal>
#> 9     alloc  80000040 <internal>
#> 10    alloc  80000040 <internal>
#> total       240000120

# speed compare
bench::mark(
  (x * y + z) / z,
  {
    tmp1 <- x * y
    tmp2 <- tmp1 + z
    tmp2 / z
  }
)
#> # A tibble: 2 x 10
#>   expression   min  mean median   max `itr/sec` mem_alloc  n_gc n_itr
#>   <chr>      <bch> <bch> <bch:> <bch>     <dbl> <bch:byt> <dbl> <int>
#> 1 (x * y + … 104ms 146ms  146ms 188ms      6.85    76.3MB     2     2
#> 2 {...       205ms 205ms  205ms 205ms      4.87   228.9MB     1     1
#> # … with 1 more variable: total_time <bch:tm>

wolfv commented 5 years ago

@DavisVaughan as far as i know you are using the reshape_view for col-wise broadcasting, right?

I do hope that certain changes to strided_view's (and reshape-view) will help speed your way of broadcasting up substantially (the changes are here: https://github.com/QuantStack/xtensor/pull/1627 )

Also, once rray stabilizes a bit, we can run a profiler.

Regarding temporaries, I agree, I wonder if we could do something clever there (in the future).

DavisVaughan commented 5 years ago

@wolfv yea I use reshape_view() so col-wise broadcasting works correctly (to pad the number of dimensions first, before applying the broadcasting operations, as we have discussed before!)

I saw that PR! I'm interested in trying it out to see if I notice a difference.

Yea I'm interested to see how this can improve down the line (I'm sure there something we can do), but for the first version I'm okay without it.

juangomezduaso commented 5 years ago

Interesting explanation. I'm sure that improvements will be made, but as it is now I believe that the vast majority of rray users will not have any performance problems. I think the main aspect to watch out for would be the way performance scales as expression lengthens. I have tried to test it, but I trigger the GC in every iteration; so filtering in the benchmark is disabled. I dont think that anybody will write expressions with dozens of operations, but some careless application could generate them automatically (imagine a selection of all but a few items in an axis leading to the sum of a hundred slices, for instance )

DavisVaughan commented 5 years ago

I'm going to close this for now. In the future we can revisit how to improve on the temporaries

r-lib / rray

xtensor lazy evaluation #205