Closed juangomezduaso closed 5 years ago
I've thought long and hard about the lazy evaluation model of xtensor, and how we could use it in rray. It's tough. I've even explored external pointers and returning them to the R side (they hold onto an xtensor C object), but it wont work based on how xtensor does their type system. Eventually I decided we can take advantage of the laziness when we have complex manipulations (like rray_bind
) but generally we stick to R semantics and let x + y
return the answer without being lazy. Unfortunately I don't think we can do much more unless we add the ability to compile on the fly (like the compile branch), but that is a bit out of scope for rray directly, I think, and would be slow because compilation speeds aren't great. When you have a brms model, that also compiles on the fly, but the time it takes to run the model generally greatly outweighs the amount of time you have to wait for it to compile. Here, the opposite would be true. You expect operations to be fast, but you have to wait for them to compile on the fly and that would be slow
But the point of this example was precisely that we are not compiling anything at runtime. Are we? (I hope I didn't missunderstand what I have done so badly) My hope is that there can be a C++ interpreter valid for any expression, and thus already compiled. In that case, the two preparation steps ( Rparser-> list-> C++interpreter) seemed instantaneous to me (taking into account that I am always considering only "human scale" expressions, with tens, not thousands of calls into them)
On the other hand, and to be honest, I am now not so sure that this translation design would also be "relatively simple" as I said yesterday. So, close the issue when you see fit, of course
yes, I think what you are looking at is a JIT compiler - one can use for example cling
, or one can write a simple R (rray) expression -> C++ translator, that then compiles a R extension (using xtensor-r) and binds that into the running R process somehow ...
That is similar to what Pythran does for Python.
Well, I think I used the wrong terms in my example, making it unnecesarily onfusing. What I called "parser" converts R expressions into an R recursive list which represents the expression tree but with some ("leaf") subtrees already evaluated. All nodes from the root will remain unevaluated until we reach a function not included in our "language" (of rray functions that we dont want to evaluate eagerly). From those nodes downwards, the subtree will be inmediately evaluated (even if it had more rray calls deeper down). To illustrate: If we have functions F1,F2,....Fn in R, to be substituted, unevaluated, to xtensor functions f1,f2,...fn. And g1,g2,...gm are other R functions or objects, an expression like: F1( F2( g1(g2,F3(F4)), F5(g3) ), g4(g5) ) would: evaluate A= g1(g2,F3(F4)) In this case this would involve another (different) conversion chunk evaluate B= g4(g5) Produce the list: list(1, list(2,list(0,A),list(5,list(0,g3)),list(0,B)) This is passed to the C++ "calculator" , that recursively walks this tree and calls xtensor functions. If i dont get it wrong, I think all this will result in a lazy construction of an increasingly more complex xexpression with no real array work until it is finally asigned and returned to R. My point was that all this conversion of expressions, whatever we call it, is based in an extremely limited and simple DSL, taylored to the concrete rray needs of taking advantage of xtensor lazyness. This has nothing to do with general C++ complation, or more general and ambitious solutions (like Pythran?), which perhaps wouldnt be as instantaneous as I think our translation would be. But has nothing to do, either, with a just in time compilation of a C++ function for a concrete call.
Having said that, I ask you both to keep in mind that, as is easy to deduce viewing my program, I have no idea of C++ (and I suppose doing DSL compilations isn´t the best way to start ;). My knowledge of xtensor is also very limited, jut discovered it through rray and my interest in array computing in R. So I am sorry if all or part of all this has no sense.
xtensor is (and can only) be lazy when the program is compiled. If it's not compiled, one needs to write out temporary arrays between the computation steps...
So walking the (dynamic) expression tree in C++ would probably not yield many benefits at this point.
For an actual speed boost, a compilation stage is necessary. The other option would be to provide many pre-compiled kernels, such as
sin(a) * b
sin(a) + b
cos(a) * b ...
and then match the R expression to these kernels. But you will have to do this for all combinations of types for a
and b
(double, complex, int ...) and for many functions, and then one would need to match these kernels.
In short, it's not feasible and would result in monstrous binaries :)
So JIT compiling the C++ code, or adding a JIT compilation stage in xtensor (through Halide or asmjit) are the more feasible options (and JIT compiling the C++ would be really straightforward with rray / xtensor-r!) :)
Does that make sense?
In my (now almost surelly wrong ) model I omitted one more C++ function in the example, a necesary one if it was to be applicable to xtensor at all. R should call an array returning function (say "CALC1") passing theLlist to it. CALC1 would then call the recursive one: CALC2(theList), whose return type would be an xexpression. My ignorance of xtensor and C++ made me the illusion that I could create automatic variables and pass them as an xexpression return value between the successive recursive calls of CALC2; and use a container (and thus producing the whole actual calculation) only at the end, in CALC1 But this wont probably do it anyway. Ah, It's much better to listen than to talk about what you don't know. Thank you for your alternative suggestions, which at least I hope could help @DavisVaughan in future improvements.
Back to R kingdom (which I shuldn't have left ;) , but still thinking about performance, I was surprised when trying to compare rray and Rbase with results like the ones below. I used an obvious slice repetition as the way to attain broadcasting in Rbase. Am i being unfair somehow in this comparation? Any subtelty that I am not aware of? The fact is that I only find advantage for rray in the simplest case. And memory doesnt seem key either.
library(rray)
set.seed(92136)
n <- 1e3
# build n x n test matrix and 2 margins
x <- rray(rnorm(n), c(1, n))
y <- rray_transpose(x)
z <- as_rray(crossprod(x, x)) # symmetric
# Base array versions:
ax<-as_array(x)
ay<-as_array(y)
az<-as_array(z)
# A single operation:
calcInR1<-function(a,b,c){
a <- a[rep(1,n)]
b <- b[,rep(1,n)]
as_rray( a+b )
}
bench::mark(x+y)
#> # A tibble: 1 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 x + y 10.9ms 12ms 84.1 7.79MB 30.3
bench::mark(calcInR1(ax,ay,az))
#> # A tibble: 1 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 calcInR1(ax, ay, az) 11.3ms 12.1ms 78.5 15.3MB 168.
# A simple expression:
calcInR2<-function(a,b,c){
a <- a[rep(1,n)]
b <- b[,rep(1,n)]
as_rray( (a+b)/c )
}
bench::mark((x+y)/z)
#> # A tibble: 1 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 (x + y)/z 21.1ms 21.8ms 45.5 30.5MB 114.
bench::mark(calcInR2(ax,ay,az))
#> # A tibble: 1 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 calcInR2(ax, ay, az) 16.1ms 16.4ms 61.0 22.9MB 285.
# A more complex expression:
calcInR3<-function(a,b,c){
a <- a[rep(1,n)]
b <- b[,rep(1,n)]
as_rray( (1+a*b)/c )
}
bench::mark((1+x*y)/z)
#> # A tibble: 1 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 (1 + x * y)/z 35ms 36.3ms 27.5 38.2MB 27.5
bench::mark(calcInR3(ax,ay,az))
#> # A tibble: 1 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 calcInR3(ax, ay, az) 16.5ms 17.5ms 53.0 22.9MB 66.3
# With args repetition:
calcInR4<-function(a,b,c,expr){
a <- a[rep(1,n)]
b <- b[,rep(1,n)]
as_rray( (a*b+a*b)/c )
}
bench::mark((x*y+x*y)/z)
#> # A tibble: 1 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 (x * y + x * y)/z 51.5ms 51.5ms 19.4 53.4MB 97.0
bench::mark(calcInR4(ax,ay,az))
#> # A tibble: 1 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 calcInR4(ax, ay, az) 20.5ms 20.9ms 44.7 30.6MB 71.5
Created on 2019-06-08 by the reprex package (v0.2.1)
This is a bit unfair, but the reasons are very subtle. It's a fairly unknown thing, but R is very clever when it comes to performing arithmetic operations. Essentially, when R does (x * y + z) / z
it calculates x * y
, which allocates memory, and a temporary result is returned but is not assigned to any symbol (meaning it isn't assigned to any named variable). Then R does x * y + z
and reuses the memory allocated to the temporary result of x * y
in the addition operation with z
. Because R doesn't broadcast, it can always reuse this memory without any issues. So essentially that whole chain of operations only allocates once. This has major speed implications.
Here is some proof:
x <- rep(1, times = 10000)
y <- x
z <- x
trash <- profmem::profmem(res <- (x * y + z) / z)
profmem::profmem(res <- (x * y + z) / z)
#> Rprofmem memory profiling of:
#> res <- (x * y + z)/z
#>
#> Memory allocations:
#> what bytes calls
#> 1 alloc 80040 <internal>
#> total 80040
Unfortunately, as of right now this optimization is not available to the R developer. So rray has to allocate for each one of these operations. This is why x + y
looks to be about the same speed, but going beyond that things get complicated.
For the interested, here is the function that is used in arithmetic operations that makes this happen. Apparently it was adapted from the pqr project! (pretty quick R). https://github.com/wch/r-source/blob/02c7a25c5c579a3d254ffcf4e6b77cab2f0b7d80/src/main/arithmetic.h#L65
More intensive comparison:
x <- rep(1, times = 10000000)
y <- x
z <- x
trash <- profmem::profmem(res <- (x * y + z) / z)
# with optimization
profmem::profmem(res <- (x * y + z) / z)
#> Rprofmem memory profiling of:
#> res <- (x * y + z)/z
#>
#> Memory allocations:
#> what bytes calls
#> 1 alloc 80000040 <internal>
#> total 80000040
# without optimization
profmem::profmem(
{
tmp1 <- x * y
tmp2 <- tmp1 + z
res <- tmp2 / z
}
)
#> Rprofmem memory profiling of:
#> {
#> tmp1 <- x * y
#> tmp2 <- tmp1 + z
#> res <- tmp2/z
#> }
#>
#> Memory allocations:
#> Number of 'new page' entries not displayed: 7
#> what bytes calls
#> 8 alloc 80000040 <internal>
#> 9 alloc 80000040 <internal>
#> 10 alloc 80000040 <internal>
#> total 240000120
# speed compare
bench::mark(
(x * y + z) / z,
{
tmp1 <- x * y
tmp2 <- tmp1 + z
tmp2 / z
}
)
#> # A tibble: 2 x 10
#> expression min mean median max `itr/sec` mem_alloc n_gc n_itr
#> <chr> <bch> <bch> <bch:> <bch> <dbl> <bch:byt> <dbl> <int>
#> 1 (x * y + … 104ms 146ms 146ms 188ms 6.85 76.3MB 2 2
#> 2 {... 205ms 205ms 205ms 205ms 4.87 228.9MB 1 1
#> # … with 1 more variable: total_time <bch:tm>
@DavisVaughan as far as i know you are using the reshape_view
for col-wise broadcasting, right?
I do hope that certain changes to strided_view's (and reshape-view) will help speed your way of broadcasting up substantially (the changes are here: https://github.com/QuantStack/xtensor/pull/1627 )
Also, once rray stabilizes a bit, we can run a profiler.
Regarding temporaries, I agree, I wonder if we could do something clever there (in the future).
@wolfv yea I use reshape_view()
so col-wise broadcasting works correctly (to pad the number of dimensions first, before applying the broadcasting operations, as we have discussed before!)
I saw that PR! I'm interested in trying it out to see if I notice a difference.
Yea I'm interested to see how this can improve down the line (I'm sure there something we can do), but for the first version I'm okay without it.
Interesting explanation. I'm sure that improvements will be made, but as it is now I believe that the vast majority of rray users will not have any performance problems. I think the main aspect to watch out for would be the way performance scales as expression lengthens. I have tried to test it, but I trigger the GC in every iteration; so filtering in the benchmark is disabled. I dont think that anybody will write expressions with dozens of operations, but some careless application could generate them automatically (imagine a selection of all but a few items in an axis leading to the sum of a hundred slices, for instance )
I'm going to close this for now. In the future we can revisit how to improve on the temporaries
It would be good to take advantage of the lazy nature of xtensor, be it to have a special function to calculate the most performance troublesome expressions; or to constitute the normal way of working of a rray future lazy version. I thought that a solution to jump the languages chasm would be to have a function in C++ that admitted not just one operation, but complex xexpressions (prepared in R where flexibility shines) to execute in C++ (where performance shines) as a whole.
I have no idea of C++ (in fact I wonder why did I post this issue at all), but I stumbled with an article: https://gallery.rcpp.org/articles/rcpp-wrap-and-recurse/ that might interest you, and that allowed me to: a) Do a simple parser in R to convert expresions involving some "fake rray" functions into a recursive list RObject b) And a simple interpreter in C++ to read that list and be instructed to call the corresponding "fake xtensor" functions recursively.
Most xtensor functions have a fixed and limited set of arguments which are either (a)rrays (of the three allowed inner types) or integer vectors (for indexes and axes). This, hopefully, makes the design of the interpreter relatively simple.
Unfortunately my ignorance of C++ hardly allowed me to do just this toy example, but I hope that its very simplicity will let you get the idea at a glance and judge it without loosing too much time.
Let f1,f2,f3 be dummy xtensor functions and rray_Fun1, rray_Fun2, rray_Fun3 their R rray counterparts We can call arbitrary expresions in R involving rray functions mixed with other calls. These expressions can be thougth as the composition of: 1) an external subexpresion (from the root and as far as it can go not traversing any non-rray call), that will go to C++ to be calculated as a whole and 2) the "leaf" or internal subexpresions, that will be evaluated in R beforehand (and if they had rray functions calls, they would have provoked other previous calls to C++. See last example below).
In C++ I have:
And in R:
Created on 2019-06-04 by the reprex package (v0.2.1)