greta-dev / greta

simple and scalable statistical modelling in R
https://greta-stats.org
Other
527 stars 63 forks source link

Discussion) A generalization of `greta`'s DAG construction for matrices/arrays #247

Open DavisVaughan opened 5 years ago

DavisVaughan commented 5 years ago

Hi @goldingn, I've been reading through some of greta's source code, and I'm pretty impressed with the R6 setup for nodes of the DAG. I've been thinking a bit about how that could generalize to be used outside of greta, and wanted to get your thoughts.

Imagine a world where instead of executing z <- x + y on two R matrices immediately, we instead capture the fact that you are doing addition on 2 objects and return something that has the correct shape of the output, but with ? in each cell rather than the result.greta essentially does this already. Then, the user can run calculate() on z to actually get the result. This can continue, meaning that we can do z + x to get another object, without ever calculating z. You already know all this.

The best part, in my opinion, is that since you have captured just the operations, it is very much like dplyr in that the backend can be anything. It can be base R, tensorflow, or xtensor (a cpp library i've been working with that allows for lazy and broadcasted array semantics). Or even a database where the table is purely numeric.

So essentially what I propose is extracting out the node scaffolding from greta, making them a bit more generic (so tf_operation would just be operation which could store "add" and then you specify that you are using tensorflow later, or xtensor or whatever), and then greta is just one extension of that system.

What do you think? It's obviously a huge project, and nothing that I'd be requesting you do alone, but I think it could provide for some really powerful abstraction.

PS) I have a little work on an integration of xtensor into R in the rray package here. It doesn't take advantage of the fact that the operations are lazy though, which this abstraction could.

goldingn commented 5 years ago

Hey @davisvaughan. Yes, this is something I have thought about a little. In building greta, I've actually ended up writing a rudimentary dataflow programming version of R. I think it would be a great idea to pull out that bit of the code into its own more general package and tighten it up. Time (as ever) is the only thing that has stopped me from doing that.

One of the reasons I went with this design for greta was so that I could drop in a different computational backend if TensorFlow became obsolete, or there was a need for some lighter-weight backend.

It would be really nice to be able to run computations efficiently with different backends, but without changing the R code at all. The package has a function to convert an arbitrary R function to a tensorflow function. The same thing code be done for other backends (and expressions). How cool would it be to GPU-ify and scale up something like stats::glm() without touching the package code? Users could then just switch out their backend with some command, and run code from existing packages more efficiently on modern hardware.

Some current pain points that wouldn't be too hard to resolve:

So yeah, definitely keen. The only thing slowing all this down is the limited amount time I can spend on software development in my academic job. So it's great that you're keen and see the potential too!

goldingn commented 5 years ago

And rray and xtensor look awesome! One of the frustrating things about making greta work with standard R syntax is that working with arrays in greta models is much more painful than it needs to be. rray has the matrix/array interface I'd like to be able to use.

DavisVaughan commented 5 years ago

I spent some of the day working on pulling out a little of the data flow style code. It lives here: https://github.com/DavisVaughan/nodegraph (forgive me for copying a bit of yours)

e.g. no private methods/objects and some fairly complex objects that could be split up

I tried to be a bit cleaner about this, with getters/setters and private methods and a few other design decisions that are different from yours.

Users could then just switch out their backend with some command

I imagine this would work a little like greta. Users create a xtensor_array VS a greta_array and then sin() instantly works differently depending on the input object.

greta only handles arrays (and matrices). Adding support for vectors won't be hard and is on the to do list.

For matrix/array-like calculations, I find it somewhat easier to grasp if you just throw out the concept of vectors and just treat them as 1 column matrices (as greta seems to do). But obviously if you want to do the data frame abstraction then you'd need to support them. That's interesting though, I hadn't thought about doing that.

One of the things I am currently struggling to come up with a solution for is: At what point do you actually say "okay go do the actual computation"?

Like if I have a DAG representing the expressions of z <- x+y; k <- z + 4, how do I force computation of k in a general way? I was thinking about exposing some kind of calculate() method for one of the R6 classes that a subclass could override with the real implementation of how to do the calculation, but it's still a little fuzzy to me. I couldn't figure out exactly how greta does this.

DavisVaughan commented 5 years ago

I think references to node 'parents' and 'children' are also the wrong way around.

I was a little confused when I was reading your code regarding this too. I think I figured out that when you do something like x+y then + is the parent and x and y are the children, and the children point upward to their parent. I feel like I'm normally used to a parent pointing to a child (binary tree search?) but once i made this distinction everything was fine, and I think it is correct this way

goldingn commented 5 years ago

Nice!

Rather than have different classes of object for the different backends, I was imagining a single class of array object, and with the backend registered at calculation time (calculate(z, list(x = 1, y = 1), backend = "rray"). Though the advantage of different classes is that you can error earlier if an operation isn't available with that backend.

Not sure I understand your last point there calculate() (with a list of values) constructs the dag and pushes the values through it. Because TF is also a dataflow system, the TF graph is built, then sess$run() is called on the tensor for the result, feeding values of the inputs via a feed dict. In the xtensor case, I imagine you could just define objects in an environment for all intermediate values and the final value, which you then return.

DavisVaughan commented 5 years ago

Though the advantage of different classes is that you can error earlier if an operation isn't available with that backend.

Also, some backends support broadcasting (like xtensor) and others might not. So you probably need to know what backend you have when performing operations so the dim construction is done correctly along the way.

I forgot you had already defined calculate(). I was talking about something slightly different, not a big deal, sorry! But what you say makes sense, thanks.

goldingn commented 5 years ago

Right, I was still thinking of the user syntax just being the native R syntax. In which case they'd all have the same interface and would just handle things differently in the backend.

But yeah, to make use of the nice syntax stuff with rray, it would be useful to have the different classes. Possibly with the ability to translate unknown values between the types.

DavisVaughan commented 5 years ago

Wanted to share an update. It's all very rough, but kind of exciting. I've got nodegraph holding the core implementation of the Node and DAG classes, and then have a delay_array S3 class that implements delayed evaluation for base R objects. BUT its all extensible through 2 core S3 methods (and a few other things) so other classes can "easily" create their own lazy systems. For example, I went ahead and bootstrapped a lazy version of rray arrays as well. The system that computes dimensions is one of the two extensible methods so you can compute broadcasted dimensions correctly if your class requires it. The other S3 method controls how the computation is done (i.e. essentially specifying xtensor as the backend by giving me a chance to make rray objects)

library(rray)
library(nodegraph)

# //////////////////////////////////////////////////////////////////////////////
# Use delay_arrays. They work like R matrices/arrays, but are lazy

x <- as_delay_array(matrix(c(1,2,3)))
y <- as_delay_array(matrix(c(1,2,4)))

# x and y look like matrices, but they are more complex
x
#> <delay_array>
#>      [,1]
#> [1,]    1
#> [2,]    2
#> [3,]    3
y
#> <delay_array>
#>      [,1]
#> [1,]    1
#> [2,]    2
#> [3,]    4

# + is lazy
z <- x + y

# we know the shape of z and it uses base R shape rules
z
#> <delay_array>
#>      [,1]
#> [1,]  ?  
#> [2,]  ?  
#> [3,]  ?

# scalars work
zz <- z - 2

zz
#> <delay_array>
#>      [,1]
#> [1,]  ?  
#> [2,]  ?  
#> [3,]  ?

# okay, let's actually compute zz now
compute(zz)
#> <delay_array>
#>      [,1]
#> [1,]    0
#> [2,]    2
#> [3,]    5

# now, all upstream dependencies of zz are known, along with zz
zz
#> <delay_array>
#>      [,1]
#> [1,]    0
#> [2,]    2
#> [3,]    5

# z was required to compute zz, so it is known
z
#> <delay_array>
#>      [,1]
#> [1,]    2
#> [2,]    4
#> [3,]    7

# //////////////////////////////////////////////////////////////////////////////
# Use delay_rrays. Before computation they are converted to rrays and xtensor
# is used for the computation

x <- as_delay_rray(matrix(c(1,2,3), nrow = 3, ncol = 2))
y <- as_delay_rray(matrix(c(1,2,4)))

# Note that you cant add these together in base R!
x
#> <delay_rray>
#>      [,1] [,2]
#> [1,]    1    1
#> [2,]    2    2
#> [3,]    3    3
y
#> <delay_rray>
#>      [,1]
#> [1,]    1
#> [2,]    2
#> [3,]    4

z <- x + y

# Shape rules are correct here because the function that checks that 
# is extensible
z
#> <delay_rray>
#>      [,1] [,2]
#> [1,]  ?    ?  
#> [2,]  ?    ?  
#> [3,]  ?    ?

zz <- z - 2

# compute zz using xtensor
compute(zz)
#> <delay_rray>
#>      [,1] [,2]
#> [1,]    0    0
#> [2,]    2    2
#> [3,]    5    5

zz
#> <delay_rray>
#>      [,1] [,2]
#> [1,]    0    0
#> [2,]    2    2
#> [3,]    5    5

z
#> <delay_rray>
#>      [,1] [,2]
#> [1,]    2    2
#> [2,]    4    4
#> [3,]    7    7

Created on 2018-11-09 by the reprex package (v0.2.0).

goldingn commented 5 years ago

Nice! That was rapid progress.

Looks great, and I like that this has memoisation. Have you encoded the ability to change the values of nodes and rerun the graph?