stan-dev / posterior

The posterior R package
https://mc-stan.org/posterior/
Other
167 stars 24 forks source link

rv-like interface #8

Closed mjskay closed 3 years ago

mjskay commented 5 years ago

I wanted to gather conversations about a potential rv-like interface here so as not to derail other conversations (like #4).

I think a solid rv-like interface would be incredibly useful. More specifically, something that:

Rv already has those down, but would be even better if it:

With all of those requirements in place, I could see tidybayes moving largely towards using tables of these rv-like objects. It would be very useful for a lot of the posterior manipulation/summarization/visualization tasks tidybayes is designed for.

If you all are interested in supporting such a format here, then the question is what's the best way to get there? The options might be:

  1. Reach out to the rv maintainer and ask if they are willing to let us take it over and make backwards-incompatible changes to it, followed by a new major release deprecating some stuff.
  2. Bring some of the existing rv code that we want to build on into this package, come up with a new class name (to not clobber "rv") and go from there.
  3. Start from scratch here.
  4. Start from scratch in a different package and add that dependency here.

(1) Could work if the new maintainer is not planning much with the package and if there aren't a lot of users. Currently the package has ~400 downloads/month and no revdeps on CRAN. (2) Would be doable depending on license preferences (it is GPL-2). This could also be aided by the fact that rv looks to have been written by one of Andrew Gelman's former students, Jouni Kerman (@jgabry do you know him?).

I would be willing to float (1) to the current rv maintainer, Joseph Stachelek --- I've interacted with him once or twice on twitter and github so it wouldn't be a complete cold email (unless either of you know him better). If we'd rather go for (2) or (3) it might be good to reach out to Jouni Kerman to get his thoughts (either on using his code or on things he would have done differently if he wrote the package again).

paul-buerkner commented 5 years ago

Just adding the link to the rv website for easy reference: https://jsta.github.io/rv/

paul-buerkner commented 5 years ago

I am definitely in favor of supporting this format! I will need to take a look at the source code though before I can have an informed opinion about the best way to get there.

jgabry commented 5 years ago

I am definitely in favor of supporting this format! I will need to take a look at the source code though before I can have an informed opinion about the best way to get there.

Same!

jgabry commented 5 years ago

Yeah I forgot that it was Jouni who wrote that! He left before I started working with Andrew but Andrew has mentioned him a few times. Andrew has always wanted a more rv like implementation for working with Stan output but I don’t think he ended up using the rv package that much (I forget why but I should ask again). I’m definitely comfortable reaching out to Jouni if we want to go that route.

mjskay commented 5 years ago

Okay --- maybe as a first step, @jgabry could you reach out to Jouni and/or Andrew? Perhaps they would have insight on either lessons learned from implementing rv and what a good path forward for us might be. Then we can use that to help decide a path forward.

jgabry commented 5 years ago

Yeah I can do that. I haven’t found a current email address for Jouni but I can ask Andrew if he has it.

mjskay commented 5 years ago

Thanks!

jgabry commented 5 years ago

Andrew responded fast and connected me to Jouni. I just sent an email to everyone.

paul-buerkner commented 5 years ago

Since rv objects are just named lists, I think we could actually replace some or even all of the desired list classes (as per https://github.com/jgabry/posterior/issues/4#issuecomment-538865681) by rv objects, or lists of rv objects, since each rv represents only on vector/matrix/array of random variables and a posterior contains multiple of those.

We need to think of how to represent chains though, but I am confident we can find an elegant solution to this.

If we build rv natively into posterior this way, I would still like to have more control over it in order not to depend on a not-so-much maintained package.

mjskay commented 5 years ago

Since rv objects are just named lists, I think we could actually replace some or even all of the desired list classes (as per #4 (comment)) by rv objects, or lists of rv objects, since each rv represents only on vector/matrix/array of random variables and a posterior contains multiple of those.

I like this idea.

If we build rv natively into posterior this way, I would still like to have more control over it in order not to depend on a not-so-much maintained package.

Agreed. Is this worth raising in the email thread?

We need to think of how to represent chains though, but I am confident we can find an elegant solution to this.

I was wondering what this would look like... My first instinct was that the chain information is best put inside the rv objects, but this would require changes to the rv package no?

paul-buerkner commented 5 years ago

We may indeed need to raise the maintaining issue in the email.

I would also prefer to put the chain information directly into the rv objects so that we do not have to take care of it in the list of rv objects that make up the whole posterior and can pass single rv objects to convergence diagnostics etc.

Another aspect of rv is that they are represented internally as lists even if they can have a dimension assigned. This makes it potentially quite computationally inefficient to work this format for larger objects. For instance, a matrix multiplication is syntactically possible but I would guess quite inefficient (I haven't tested it yet). In any case, this is something to keep in mind.

mjskay commented 5 years ago

Hmm yeah, I see your point about the internal organization.

E.g., here is the internal representation for a 16x16 matrix:

> x1 = rvmatrix(rvnorm(256), nrow = 16)
> str(x1, list.len = 5)
List of 256
 $ : num [1:4000] 1.415 -0.593 -0.605 -0.16 -0.233 ...
 $ : num [1:4000] 0.561 -1.522 -0.757 -0.589 0.105 ...
 $ : num [1:4000] -0.452 0.369 0.626 1.168 -0.923 ...
 $ : num [1:4000] -0.806 -0.669 -1.443 -0.395 0.815 ...
 $ : num [1:4000] 1.822 1.416 0.213 -0.433 -0.186 ...
  [list output truncated]
 - attr(*, "class")= chr "rv"
 - attr(*, "dim")= int [1:2] 16 16

And a matrix multiply takes a little bit of time:

> x2 = rvmatrix(rvnorm(256), nrow = 16)
> system.time(x1 %**% x2)
   user  system elapsed 
   0.15    0.00    0.15 

(Not sure why %*% isn't overloaded to do the matrix multiply, maybe there's a limitation in what operators can be overloaded? Seems surprising to me)

Meanwhile you can re-organize the objects into lists of matrices by draw using the internal rv:::.sims.as.list function:

> x1_list = rv:::.sims.as.list(x1)
> x2_list = rv:::.sims.as.list(x2)
> str(x1_list, list.len = 5)
List of 4000
 $ : num [1:16, 1:16] 1.415 0.561 -0.452 -0.806 1.822 ...
 $ : num [1:16, 1:16] -0.593 -1.522 0.369 -0.669 1.416 ...
 $ : num [1:16, 1:16] -0.605 -0.757 0.626 -1.443 0.213 ...
 $ : num [1:16, 1:16] -0.16 -0.589 1.168 -0.395 -0.433 ...
 $ : num [1:16, 1:16] -0.233 0.105 -0.923 0.815 -0.186 ...
  [list output truncated]

This format is more what I would have imagined as the internal storage format before looking, as it makes implementing functions very straightforward (just a map operation). E.g. a quick-and-dirty matrix multiply using this format is much faster:

> system.time(purrr::map2(x1_list, x2_list, `%*%`))
   user  system elapsed 
   0.01    0.00    0.01 

This also makes it easier to support more arbitrary object types by overloading the typical math operators and functions to just be applied to every list element. This might even make it easy for other people to create custom objects compatible with our rv-like interface for other classes if they wanted to (e.g. if someone wanted to use rray or Matrix or some other particular array implementation we might be able to make it possible to do that without a whole lot of work).

Given the simplicity of an implementation with that format, I am perhaps coming around to just rolling our own. There are other things that rv does that I don't think are strictly necessary with some careful API design; e.g. rather than implementing an rv*** for every distribution (rvnorm, rvgamma, etc) it strikes me that we could construct a function that creates an appropriate rv object given an RNG function from base R; something like:

x = rv(rnorm, 0, 1)

Or:

x = rv("norm", 0, 1)

Another question there (maybe out of scope for now but something to consider in the future) is whether we would want to support rv objects that represent analytical distributions. These might have draws associated with them (for operations with other RVs) but supply their known density and distribution functions when requested, which could be useful for visualizing priors, for example. This is maybe a longer-term thing, but I think there would be something useful to a class that unified both representations: for example, the dev version of tidybayes has a bunch of variants of geoms that can either take samples or named distributions and their parameters to make visualizations (see here), but unifying these into a single object type that can be put into data frames would allow both to be used with the same geom.

mjskay commented 5 years ago

Also, if we want to get really crazy with this (just brainstorming here), a kind of "out there" feature could be an rv_do operation that automatically rewrites expressions with random variables into appropriate apply() or map() calls. This could be seen as a bit of an abuse of non-standard evaluation, but does have the nice property of allowing arbitrary functions to be used with rvs (sort of getting around the problems Jouni was talking about re: object orientation in R).

Say we created an NSE function rv_do() that takes names used in the passed-in code expression and checks the calling environment to see if those names correspond to rv objects and then did a parallel map over all used rv objects. It might turn code like this:

n_obs = 10
mu = rv(rnorm, 0, 1)
sigma = rv(rgamma, 1, 1)
y = rv_do(rnorm(n_obs, mu, sigma))

into the equivalent of this:

# everything else the same
y = purrr::pmap(list(mu = mu, sigma = sigma), function(mu, sigma) rnorm(n_obs, mu, sigma))

In fact, then mu = rv(rnorm, 0, 1) becomes equivalent to mu = rv_do(rnorm(1, 0, 1)).

paul-buerkner commented 5 years ago

Thanks for looking into this! The problem with the list-by-draw representation is that summary statistics are quite inefficient to compute I would expect but I am unsure how much this is a problem as compared to the efficiency of numerical transformations.

I am wondering if we need all the extra features from rv to sample from distribution etc. in the posterior package? Sure, if we reuse the existing rv, it will be shipped with it but if we build our own, I don't see too much need to support all this natively, at least not from the start. Of course, if we have an rv_do function that we don't need to worry about all those specifics as users can simply work with this.

More generally, I think we run into the same problems of different formats in rv as we run into posterior. So, to me, the question is if we accidentally build a posterior package within the posterior package by not only supporting different draws representations (of which one is rv-like objects) and then again support different representation of rv objects themselves. I am not sure yet, how a good combination of the two purposes (representing posterior distributions in different common formats on the one hand, and having a nice random-variable syntax on the other) could look like.

We may need to also develop a clearer of the use cases of these formats. If, for instance, rv-like objects were mostly for teaching purposes (not saying they should necessarily be) than speed might be less of an issue.

mjskay commented 5 years ago

Excellent points. I really like the idea of developing clear use cases: if we write down what operations we want to be efficient for a given use case, then we can choose a format for each use case that ensures those operations are efficient (or we might identify through that process a single format that is efficient for all the operations we care about --- does that seem likely?). Seems a nice, systematic way to avoid the package blowing up.

I am wondering if we need all the extra features from rv to sample from distribution etc. in the posterior package? Sure, if we reuse the existing rv, it will be shipped with it but if we build our own, I don't see too much need to support all this natively, at least not from the start. Of course, if we have an rv_do function that we don't need to worry about all those specifics as users can simply work with this.

I think an interface to sample from distributions would be useful as it would make it easy to do things like construct posterior predictions without round-tripping back through Stan (or whatever sampler the user uses) if desired (not important for brms since it does it for you, but can be useful with custom models). However, I think an interface like rv_do() (or perhaps a less esoteric form of it) should make this easy without implementing a bunch of custom functions like rv currently does.

jgabry commented 5 years ago

Cool ideas and lots to think about! I’ve been super busy recently but looking forward to getting back into this when we talk Monday and after

mjskay commented 5 years ago

I've been playing with this stuff some more and have some sketches of two different potential directions for this format on the rv-like branch, currently called rvar and rvar_array (crappy names chosen for testing). I'm not proposing to use both (or even either) of these, but more using them as early explorations of what might be possible.

The rvar experimental type just uses a list of draws, so it can support distributions over arbitrary objects. This is obviously a far end of the spectrum in terms of flexibility versus performance tradeoffs --- not necessarily fast, but you can do basically anything with it.

It comes with rfun() and rdo() functions for making and manipulating objects (the names aren't great, but this is just experimental :)). It also has a printing method that summarizes numerical objects using their original structure (I'll get to that).

rfun() turns a function into something that accepts and produces rvars:

> x = rfun(rnorm)(10)
> x
rvar<4000>:
 [1]  0.00950±1.01 -0.01749±0.99  0.00048±1.00 -0.00052±1.00  0.00540±0.99 -0.00353±0.99 -0.00794±1.01
 [8]  0.01832±1.00  0.00723±1.00 -0.00667±1.01

The internal structure contains a single element, draws, that looks like this:

> str(x$draws, list.len = 5)
List of 4000
 $ : num [1:10] -1.3053 0.4274 -1.5071 0.2732 -0.0349 ...
 $ : num [1:10] 0.258 -0.866 0.13 1.164 1.193 ...
 $ : num [1:10] 0.737 0.251 -0.172 -0.376 1.163 ...
 $ : num [1:10] -0.5474 0.0898 -0.6763 -1.4545 1.37 ...
 $ : num [1:10] 0.226 -0.903 0.553 -1.96 -0.337 ...
  [list output truncated]

rdo, in contrast to rfun, wraps an expression that can contain rvars (or not) and produces rvars. Basically it turns the expression into a function, takes any variables the expression uses from the environment that are rvars, wraps the function using rfun, then evaluates it using those rvars. The upshot (from the user's perspective) is you can wrap fairly arbitrary R code with rdo and get back rvars (saves us implementing a bunch of stuff, like random number generators). Like this:

> mu = rdo(rnorm(1))
> sigma = rdo(rgamma(1, 1, 1))
> y = rdo(rnorm(10, mu, sigma))
> y
rvar<4000>:
 [1] -0.0189±1.8 -0.0058±1.8 -0.0116±1.8  0.0095±1.8 -0.0033±1.7 -0.0359±1.7 -0.0511±1.7  0.0037±1.7  0.0095±1.7
[10] -0.0398±1.8

Basic math ops also work:

> y + 1
rvar<4000>:
 [1] 0.98±1.8 0.99±1.8 0.99±1.8 1.01±1.8 1.00±1.7 0.96±1.7 0.95±1.7 1.00±1.7 1.01±1.7 0.96±1.8

rvars can also be put into tibbles:

> tibble(y)
# A tibble: 10 x 1
             y
        <rvar>
 1 -0.0189±1.8
 2 -0.0058±1.8
 3 -0.0116±1.8
 4  0.0095±1.8
 5 -0.0033±1.7
 6 -0.0359±1.7
 7 -0.0511±1.7
 8  0.0037±1.7
 9  0.0095±1.7
10 -0.0398±1.8

I haven't quite figured out how to get these objects to play perfectly nicely with all tibble-related things. The pivoting functions in tidyr especially are giving me a bunch of grief. I think I need to rebuild these types on top of vctrs (currently I have just been hacking with base-R approaches and the occasional stuff from vctrs).

rvar also works for matrices and arrays of arbitrary dimension. E.g.:

> x = rdo(array(rnorm(2*2*2), dim = c(2,2,2)))
> x
rvar<4000>:
, , 1

     [,1]         [,2]        
[1,] -0.0121±0.99 -0.0014±0.99
[2,]  0.0116±0.98 -0.0171±1.00

, , 2

     [,1]         [,2]        
[1,] -0.0022±1.00 -0.0084±1.01
[2,] -0.0080±1.00 -0.0124±0.99

The alternative format I have been playing with is called rvar_array (just for testing). I haven't implemented rfun / rdo for it, but in principle it should be possible (though with some sanity checks since rvar_array is a more strict format), and everything else works similarly (printing, sticking it in tibbles, etc). The main difference is that the internal format is just an array, which means some operations will be faster, but arbitrary object types are not supported (only numerics, logicals, and characters really --- we could probably also hack a factor-like thing onto it if desired).

Since it doesn't have a nice constructor I'll make one manually:

> xa = rvar_array(array(rnorm(2*2*2*4000), dim = c(2,2,2,4000)))
> xa
rvar_array<4000>:
, , 1

     [,1]         [,2]        
[1,]  0.0142±1.01  0.0126±0.99
[2,]  0.0059±0.98  0.0141±0.99

, , 2

     [,1]         [,2]        
[1,] -0.0124±0.99 -0.0181±1.00
[2,]  0.0114±1.00 -0.0088±0.99

Internally we have:

> str(xa$draws)
 num [1:2, 1:2, 1:2, 1:4000] -0.0969 0.0888 1.2426 1.4416 1.1179 ...

And indexing and such work (this would also work with rvar):

> xa[1:2,1,1]
rvar_array<4000>:
, , 1

     [,1]       
[1,] 0.0142±1.01
[2,] 0.0059±0.98

I haven't implemented math ops yet for rvar_array but I'm thinking I might be able to use the broadcasting stuff from rray to do it efficiently.

In terms of formats for posterior, one way forward might be to include something like the rvar_array format (with a better name) as the format for draws with arbitrary dimensions (for #13).

If we adopt one of these styles of interface, we might also want something like a draws_rvar format that is just a list of rvar-like objects (for representing all the variables in a posterior).

Thoughts on something like this? I admit the implementation is pretty hackish, but I thought I'd share what I've been playing with before I spend more time on it.

paul-buerkner commented 5 years ago

Wow, this looks so nice already! I am very exited for this feature! Just some quick thoughts/questions:

mjskay commented 5 years ago

Thanks, this is exactly the kind of feedback I needed for the next round of prototyping!

I think the rvar_array version is possibly preferable in the long run

Works for me. The computer scientist in me really liked that it was even possible to build the rvar version to work with any type of object, but in the end the efficiency issues with that format are not worth it.

Depending on the implementation, rvar_array might indeed be a proper backend for #13, but for this purpose we need to make sure it is basically as efficient as working with standard arrays and similar in flexibility

The current design attempts to make this possible in a couple of ways. First, most basic math operations should I think be just as efficient. Second, by having the array format stored internally rather than subclassing from array directly, if you have an operation you want to do that is best done directly on the array you can always bypass the wrapper by acting on x$draws directly. Obviously that's a potentially "dangerous" operation (if you modify x$draws to something invalid), but seems like a reasonable escape hatch. I might wrap it in a function (can't think of a good name... both draws() and draws_array() are obvious candidates but conflict with other concepts already in the package).

Currently, the draws dimension is last in the dimension list of the array version which deviates a little from the existing array-like formats. What would be your argument for this order?

Good question. I think when I was early-on prototyping something with one of the *apply functions it was spitting out stuff joined by the last dimension, so it seemed a natural enough way to implement everything for consistency. However, if having the iteration dimension be first would be better for your use cases I'm happy to try doing that for the next round.

How can we best store chain information in those objects? Provided that we do want to store them.

Good question. I wasn't sure if we wanted it or not yet so I left off prototyping it before solving other stuff. I can try adding a dimension for that in the next prototyping round. Another option (maybe easier) would be to let this be handled at the level of a draws_rvar object that could store different chains as different rvar objects for the same variable.

What is the interface to decide how many draws are generated by default? I see that they are 4000 in your examples but I am not sure where this is coming from?

Currently there's an argument to rvar / rdo (.ndraws, not a great name) with a default value, which I could also make be set by a global option (that's how rv does it). This argument is used for expressions without existing rvars; for expressions containing existing rvars the number of draws in those objects is what is used. I haven't decided what to do if someone tries to combine rvars with different numbers of iterations (probably throw an error?).

One problem with vctrs and rray is that, as I understand it, you cannot built child classes on top of them that are retained by standard operations. At least this is true for rray and is bacially the reason I decided against using that package for now in the standard formats. See r-lib/rray#247

Ah yeah. From what I'm reading of vec_proxy / vec_restore and what I've played with so far, I think you can build custom types on top of vctrs but not rray currently? Or I might have missed something...

I will have to think more about proper function names but I am already exited about the interface in general. It looks so clean to me!

Great! I am going to continue playing with this then. Possibly not this week though (bunch of other stuff on my plate...)

jgabry commented 5 years ago

@mjskay Thanks for working on this! I started writing comments before looking at @paul-buerkner's but he basically mentioned all my initial questions! Thanks!

jgabry commented 5 years ago

Works for me. The computer scientist in me really liked that it was even possible to build the rvar version to work with any type of object

Totally get that :) Definitely cool!

However, if having the iteration dimension be first would be better for your use cases I'm happy to try doing that for the next round.

There's a chance that future RStan v3 and some other package won't have iteration first, but that's certainly the standard at the moment (at least for Stan)

How can we best store chain information in those objects? Provided that we do want to store them.

Good question. I wasn't sure if we wanted it or not yet so I left off prototyping it before solving other stuff. I can try adding a dimension for that in the next prototyping round. Another option (maybe easier) would be to let this be handled at the level of a draws_rvar object that could store different chains as different rvar objects for the same variable.

I'm wondering if we need chain information. It could make sense for this format to be recommended only once the user is satisfied that diagnostics are ok and they can go ahead and use the draws for interesting stuff. I'm certainly not opposed to having chain information, but maybe it's not essential.

I haven't decided what to do if someone tries to combine rvars with different numbers of iterations (probably throw an error?).

An error seems reasonable to me.

Great! I am going to continue playing with this then. Possibly not this week though (bunch of other stuff on my plate...)

Awesome, thanks again for working on this!

mjskay commented 5 years ago

There's a chance that future RStan v3 and some other package won't have iteration first, but that's certainly the standard at the moment (at least for Stan)

Makes sense. Since the position of that dimension is hidden from everyone except power users by this format, one thing that might help decide is if there are existing use cases that argue for one versus the other (e.g., due to it being easier/faster to do some needed operations given one order versus another). @paul-buerkner I recall from awhile back you mentioned needing to do operations on these kinds of objects; is there a use case you have that would be easier or harder given the choice of dimension order?

I'm wondering if we need chain information. It could make sense for this format to be recommended only once the user is satisfied that diagnostics are ok and they can go ahead and use the draws for interesting stuff. I'm certainly not opposed to having chain information, but maybe it's not essential.

This makes sense to me. I'll give it a try and see if it complicates life a lot, but if it does I won't worry about it too much.

TODOs so I don't forget:

paul-buerkner commented 5 years ago

With regard to order of dimension I see two possibly relevant aspects: (1) Convention. This favors iteration as first dimension. (2) Efficiency. If it makes a difference at all. R stores arrays in column major order (last dimension first for higher dimensional arrays). As a result, values of parameters are consequtive in memory if they come at the later dimensions (columns in case of matrices) so I would presume storing parameters in later dimensions is, if it makes any difference, probably preferable. I am definitely out of depth here though.

With regard to the chain stuff, I am not sure what would be the best approch. Storing chains as an additional dimension prevents a lot of the matrix multiplication-like operations that I would like to use arrays for (and hence the rv-like objects we decide to use them for #13). So if we would still like to use chain info, we could store them as a separete element next to $draws in the rvar object, where we could basically store .chain, .iteration, .draws as in draws_df Manual subsetting inside the draws element would of course invalidate this but that is not something common users should be doing. Not sure if this proposal is sensible but perhaps it helps us thinking about if/how to incorporate chains.

mjskay commented 5 years ago

Some updates:

> A = rdo(matrix(rnorm(6, 1:6), ncol = 3))
> A
rvar<4000>[2,3]:
     [,1]      [,2]      [,3]     
[1,] 0.97±1.00 2.99±1.00 5.00±1.01
[2,] 1.99±1.00 3.98±0.98 6.01±1.01
> B = rdo(matrix(rnorm(6, 1:6), ncol = 2))
> B
rvar<4000>[3,2]:
     [,1]      [,2]     
[1,] 0.98±1.01 3.98±1.01
[2,] 2.02±0.99 4.99±0.99
[3,] 3.01±1.01 6.00±1.00
> A %*% B
rvar<4000>[2,2]:
     [,1]    [,2]   
[1,] 22±7.3  49±10.7
[2,] 28±8.5  64±11.5

And it's not bad speed wise either (compare to rdo(A %*% B), which does the naive thing of a matrix multiply within each draw):

> microbenchmark(A %*% B, rdo(A %*% B))
Unit: milliseconds
         expr      min       lq       mean   median       uq      max neval
      A %*% B   1.1267   1.2926   1.658963   1.4364   1.5539  17.2971   100
 rdo(A %*% B) 109.1457 127.0384 139.782670 136.3684 145.5802 391.3957   100

This did require treating the objects as S4 (at least, setting the S4 object flag on them) as %*% does not dispatch correctly on S3 objects. So it is a sort of mostly-S3-except-for-%*% implementation at the moment.

> str(A)
 rvar<4000>[2,3] 0.97±1.00 1.99±1.00 2.99±1.00 3.98±0.98 ...
> str(draws_of(A))
 num [1:2, 1:3, 1:4000] 0.748 1.275 5.071 3.194 5.397 ...
 - attr(*, "dimnames")=List of 3
  ..$ : NULL
  ..$ : NULL
  ..$ : NULL
paul-buerkner commented 5 years ago

This all looks really good! Please tell us when you need more feedback on something or want us to try something out.

jgabry commented 5 years ago

Very cool! (And what Paul said.)

paul-buerkner commented 4 years ago

There has been some discussion about features in posteriordb, that I think will essentially come down to features in posterior and perhaps speciall the rv-like interface (not sure about the latter yet). Here is the link:https://discourse.mc-stan.org/t/beta-release-bayesian-posterior-database/12141/19

@mjskay what is your current status with the feature? No rush of course, but I just super excited for it!

mjskay commented 4 years ago

Ah cool! So, the basic rvar works if you check out the rv-like branch. Over the Christmas break I am hoping to put together a little notebook (maybe a vignette) introducing the syntax and demo-ing it a bit in order to get feedback, but if you want to give it a try sooner you can check out that branch.

There are two big TODOs at the moment:

paul-buerkner commented 4 years ago

Brilliant, thank you! Did you had some more thoughts about the positioning of the iteration dimension already (first or last)?

mjskay commented 4 years ago

I haven't had a chance to experiment with a version with the iteration dimension first, so I don't have any new thoughts on that yet. Would that be helpful?

paul-buerkner commented 4 years ago

I am afraid I didnt understand your question. What would be helpful? My argument for having it as the first dimension is for consistency with the other formats. Presumably is does not matter too much efficiency wise but that remains to be checked.

Matthew Kay notifications@github.com schrieb am Fr., 6. Dez. 2019, 21:15:

I haven't had a chance to experiment with a version with the iteration dimension first, so I don't have any new thoughts on that yet. Would that be helpful?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jgabry/posterior/issues/8?email_source=notifications&email_token=ADCW2AC7F67BIDHPG3VGZPDQXKQFNA5CNFSM4I6MCLG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGFCNAQ#issuecomment-562701954, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCW2ACKNVDQ3D2SMG7F653QXKQFNANCNFSM4I6MCLGQ .

mjskay commented 4 years ago

Ah, I meant: would me experimenting with that format be helpful? I take it from your response the answer is yes :).

I have waited on playing with that in the short term because I want to figure out the issues with vctrs first.

paul-buerkner commented 4 years ago

I think it would be good to have a beta release of posterior soon so I am trying to clean up the remaining issues with the beta-release tag.

@mjskay you indicated that you currently do not have much time to work on the rv-like feature. Does that still hold? If yes, I would like to drop this issue from the beta-release list as I don't see a problem in adding this feature later. Would you agree?

If we decide to drop this issue from the beta-release, we need to figure out what happens to issue #13 in that regard. I would like to have draws arrays of arbitrary dimension for brms, ideally already in the beta/CRAN release although it is not a problem if it does not end up there. Last time, we spoke we agreed on that the rv-like interface could provide us with such arrays quite naturally. So the question would be whether to have an rv-independent implementation of multi-dimensional arrays or to post-pone #13 as well.

mjskay commented 4 years ago

Good question! My time remains pretty limited (currently changing a class over to online instruction and preparing another one). Probably won't have dedicated time until the end of semester (late April).

I just took a look to see if the tibble/dplyr support was working now with recent changes in the dev versions of vctrs and dplyr. I think it's closer to working and will be easier to get right. I had hoped there would be a quick fix for the outstanding problems with tibble integration, but I expect I'll still need some dedicated time to get everything I want working working on that front.

What's your timeline from beta to CRAN? If a CRAN submission is not going to happen until May anyway I could see merging an experimental version of the rvar interface into the master branch by the end of this week (as that would not be much work, compared to getting it integrated into tibbles properly). Then it can at least be in the beta. The main concern I have with pushing a version to CRAN before tibble integration works is I don't want to have to re-architect the structure of the rvar type much after it goes to CRAN, and I can see that potentially being needed to properly support tibbles. However, I suspect the part of the interface you would need for brms would not change much.

So if the planned CRAN timeline is further out (May or later) we could consider rolling it into the beta.

If the planned CRAN timeline is sooner we might want to wait until the next version.

To your last question, either way I don't think we'll need an additional interface for draws of arbitrary dimension.

paul-buerkner commented 4 years ago

Thanks, that makes sense! I am not too worried about having a CRAN release soon but having posterior in a state in which it is releasable in principle. This is important as posterior already serves as the backend of two important packages, cmdstanr by @jgabry and posteriordb by @MansMeg. If one of these packages have to be released, so does posterior. I am not sure about the schedule for these projects but we should make sure posterior is in a good shape regardless. Accordingly, I think it makes sense to postpone merging the rv-interface until you have the time to make sure it works as expected. If that means it won't make it into beta/first CRAN release that this is totally fine.

The rv-like interface has some potential implications not only for #13 but also for some of the generics, we want to move to posterior (see #39). I will comment on this separately in the corresponding issue.

mjskay commented 4 years ago

Alright! Finally getting back to this. Lots of updates on the rv-like branch if folks want to check it out. Definitely some stuff I need feedback/potential help with now. More details below + questions summarized at the end.

tibbles work

The basic rvar implementation is now fairly feature-complete. Some TODOs/stubs remain but should be marked. New stuff includes the ability to put rvars in data.frames / tibbles, which previously did not work due to issues with vctrs:

x = rdo(rnorm(4, mean = 1:4))
df = tibble(y = letters[1:4], x)
df
# A tibble: 4 x 2
  y          x
  <chr> <rvar>
1 a     0.99±1
2 b     1.98±1
3 c     2.99±1
4 d     4.00±1

compatibility with {ggdist} and {distributional}

I was able to update the ggdist::stat_dist_ family to support visualizing rvars. That means if you install the dev branch of {ggdist} you can do stuff like this:

library(tidyverse)
library(ggdist)    # install the dev branch 
df %>% 
  ggplot(aes(y = y, dist = x)) + 
  stat_dist_halfeye()

image

Or any of the other stat_dist... geoms.

We also now have cdf(), density(), and quantile() functions patterned after how distributional does this. Combining {distributional} functions for analytical distributions (like dist_normal(0,1)) with rvars we can do stuff like this:

library(distributional)
tibble(
  x = list(dist_normal(0,1), rvar(rnorm(4000, 0.5, 0.5))),
  which = c("prior", "posterior")
) %>%
  ggplot(aes(y = "", dist = x, color = which, fill = which)) +
  stat_dist_slab(alpha = 0.5)

image

(as an aside, it would be awesome if eventually brms / rstanarm could output their priors in the {distributional} format so plots like this would be easier to make --- see also https://github.com/mjskay/ggdist/issues/21).

internal format changes

expectations

x
rvar<4000>[4] mean±sd:
[1] 0.99±1 2.02±1 2.99±1 3.96±1
mean(x)
rvar<4000>[1] mean±sd:
[1] 2.5±0.5

This was intended to mimic base-R mean() (e.g. mean(1:4) -> 2.5) but in retrospect I think that use case might be better served by a different function name (like rvar_mean()). Then mean() would be an alias for E(), which does this:

E(x)   # could have mean(x) do this as well
[1] 0.9935992 2.0236997 2.9876973 3.9577762

What do folks think? That would also make the interface consistent with the {distributional} package, which provides a similar vector format for analytical distributions.

Container type

I have started a basic implementation of a container type (with lots of stop("TODO: IMPLEMENT") stubs) called draws_rvars. I'm not sold on the name, but that's easily changed. The basic format currently is a named list where each element is an rvar:

d = draws_rvars(mu = rvar(rnorm(4000)), Rho = rvar(rethinking::rlkjcorr(4000, 5, 3)))
d
$mu
rvar<4000>[1] mean±sd:
[1] 0.02±1

$Rho
rvar<4000>[5,5] mean±sd:
     [,1]             [,2]             [,3]             [,4]             [,5]            
[1,]  1.00000±0.0e+00 -0.00199±3.2e-01 -0.00208±3.2e-01  0.00027±3.2e-01  0.00020±3.2e-01
[2,] -0.00199±3.2e-01  1.00000±9.1e-17 -0.00050±3.2e-01 -0.00086±3.2e-01  0.00251±3.2e-01
[3,] -0.00208±3.2e-01 -0.00050±3.2e-01  1.00000±9.9e-17 -0.00047±3.1e-01 -0.00739±3.1e-01
[4,]  0.00027±3.2e-01 -0.00086±3.2e-01 -0.00047±3.1e-01  1.00000±1.1e-16 -0.00084±3.1e-01
[5,]  0.00020±3.2e-01  0.00251±3.2e-01 -0.00739±3.1e-01 -0.00084±3.1e-01  1.00000±1.1e-16

attr(,"class")
[1] "draws_rvars" "draws"       "list" 

(I haven't implemented a print function yet, it just uses list print + the print function for rvars). As a proof of concept I have implemented basic conversion to/from draws_matrix:

dm = as_draws_matrix(d)
dm
# A draws_matrix: 4000 draws, and 26 variables
    variable
draw    mu Rho[1,1] Rho[2,1] Rho[3,1] Rho[4,1] Rho[5,1] Rho[1,2] Rho[2,2]
  1   0.52        1    0.070   -0.232    0.261   -0.647    0.070        1
  2   0.29        1   -0.245   -0.347   -0.142   -0.236   -0.245        1
  3  -0.59        1    0.480   -0.081    0.211   -0.219    0.480        1
  4  -0.51        1    0.060    0.476   -0.587    0.381    0.060        1
  5   1.26        1    0.354    0.097    0.398    0.276    0.354        1
  6   2.32        1    0.193    0.093   -0.279    0.388    0.193        1
  7  -1.06        1   -0.049    0.013    0.127   -0.104   -0.049        1
  8   0.05        1   -0.184   -0.115    0.189   -0.107   -0.184        1
  9   1.18        1    0.246    0.335   -0.141    0.524    0.246        1
  10  0.84        1   -0.284    0.122   -0.046   -0.059   -0.284        1
# ... with 3990 more draws, and 18 more variables

And back --- note the dimensions gain names on the way back; this is so that character dimensions (e.g. if it were Rho[x,y]) would get names:

as_draws_rvars(dm)
$mu
rvar<4000>[1] mean±sd:
[1] 0.02±1

$Rho
rvar<4000>[5,5] mean±sd:
  1                2                3                4                5               
1  1.00000±0.0e+00 -0.00199±3.2e-01 -0.00208±3.2e-01  0.00027±3.2e-01  0.00020±3.2e-01
2 -0.00199±3.2e-01  1.00000±9.1e-17 -0.00050±3.2e-01 -0.00086±3.2e-01  0.00251±3.2e-01
3 -0.00208±3.2e-01 -0.00050±3.2e-01  1.00000±9.9e-17 -0.00047±3.1e-01 -0.00739±3.1e-01
4  0.00027±3.2e-01 -0.00086±3.2e-01 -0.00047±3.1e-01  1.00000±1.1e-16 -0.00084±3.1e-01
5  0.00020±3.2e-01  0.00251±3.2e-01 -0.00739±3.1e-01 -0.00084±3.1e-01  1.00000±1.1e-16

attr(,"class")
[1] "draws_rvars" "draws"       "list" 

The back-conversion will also sort dimensions and fill in missing dimensions, in case the input does not include every cell of the array:

as_draws_rvars(dm[,c("Rho[3,2]", "Rho[1,1]", "Rho[2,2]")])
$Rho
rvar<4000>[3,2] mean±sd:
  1               2              
1  1.0000±0.0e+00      NA±NA     
2      NA±NA       1.0000±9.1e-17
3      NA±NA      -0.0005±3.2e-01

attr(,"class")
[1] "draws_rvars" "draws"       "list" 

Although I'm not sure about the sorting aspect here (possibly dimensions should be left in whatever the input order is? That would preserve order for dimensions with string names and numeric dimensions as long as they are ordered on input...). It seems there are some corner cases to consider here.

Another bit to consider is what the API for the draws_rvars constructor should be. Currently it expects input to already be rvars, and casts anything that isn't an rvar to one using as_rvar(). However, this yields behavior that is not consistent with the constructors of other draws types, like this:

draws_matrix(x = rnorm(10))
# A draws_matrix: 10 draws, and 1 variables
    variable
draw     x
  1   0.55
  2   0.27
  3   0.44
  4   0.14
  5  -0.55
  6  -1.47
  7   0.82
  8   0.55
  9  -0.83
  10 -1.16

Whereas draws_rvars() does this:

draws_rvars(x = rnorm(10))
$x
rvar<1>[10] mean±sd:
 [1]  0.412±NA -1.416±NA -0.311±NA  0.364±NA  1.326±NA  0.171±NA -0.525±NA -0.070±NA  2.325±NA  0.021±NA

attr(,"class")
[1] "draws_rvars" "draws"       "list"  

This is because it just uses as_rvar(), which converts numerics to constant rvars (rvars with one draw). This is the correct behavior for as_rvar as it allows casting to work correctly when doing math with rvars and numerics. However, it causes an inconsistency in this part of the API --- one option would be to have draws_rvars use rvar() instead of as_rvar() when passed objects that are not rvars... then the behavior would be more consistent with the other draws_ constructors.

Questions / feedback

Summarizing some questions above (plus some new ones):

What do folks think? @paul-buerkner @jgabry?

mjskay commented 4 years ago

Followup on this:

How should we deal with character-vector versus numeric indices when converting to draws_rvars? Is sorting a good or bad idea?

I just changed the behavior so that numeric indices are sorted and character indices are not. This seems more sensible to me than sorting character indices, but this is easily changed.

paul-buerkner commented 4 years ago

Thank you @mjskay! This is really exciting!

I need to understand some details a little better, but here are some thoughts on the questions you had:

jgabry commented 4 years ago

Yeah this is exciting, thanks for working on this! I'm in the process of moving this week so I might not have time to really dive into this until next week. Definitely feel free to keep discussing without me and I'll catch up.

For now just one comment on the chains question. If there's an option include chains that doesn't come with too much overhead, that'd be fine, but I'm coming around to the idea that we don't really need chain information. The ability to work with draws as random variables is most useful after we're happy the chains are ok, so I would think we'd convert to an rvar from e.g. a draws_array after we've checked diagnostics using the draws_array. Does that make sense?

mjskay commented 4 years ago

Great, thanks both! Some TODOs for myself where there's pretty clear consensus:

Re; chains I am opening a separate issue since I think we need a bit of discussion there.

mjskay commented 4 years ago

One minor note worth raising as I implement more of draws_rvars: because draws_rvars considers all cells in an array variable as part of a single variable, it counts variables differently from other formats:

d = draws_rvars(x = rnorm(1000), y = rethinking::rlkjcorr(1000, 3, 1))
d
# A draws_rvars: 1000 iterations, 1 chains, and 2 variables
$x: rvar<1000>[1] mean±sd:
[1] -0.067±0.94

$y: rvar<1000>[3,3] mean±sd:
     [,1]            [,2]            [,3]           
[1,]  1.0000±0.0e+00  0.0081±5.1e-01 -0.0054±4.9e-01
[2,]  0.0081±5.1e-01  1.0000±8.1e-17  0.0230±4.8e-01
[3,] -0.0054±4.9e-01  0.0230±4.8e-01  1.0000±1.2e-16

That is 2 variables as a draws_rvars but 10 as any other format:

as_draws_matrix(d)
# A draws_matrix: 1000 draws, and 10 variables
    variable
draw     x y[1,1] y[2,1] y[3,1] y[1,2] y[2,2] y[3,2] y[1,3]
  1   1.29      1   0.68 -0.039   0.68      1  0.133 -0.039
  2  -0.40      1  -0.11 -0.618  -0.11      1  0.235 -0.618
  3  -0.75      1  -0.75 -0.353  -0.75      1  0.712 -0.353
  4  -1.42      1  -0.17  0.046  -0.17      1  0.068  0.046
  5   0.56      1  -0.26  0.257  -0.26      1 -0.391  0.257
  6  -0.54      1  -0.98  0.615  -0.98      1 -0.746  0.615
  7  -0.88      1  -0.60  0.051  -0.60      1  0.506  0.051
  8   0.27      1   0.92  0.632   0.92      1  0.299  0.632
  9   0.92      1  -0.81 -0.642  -0.81      1  0.082 -0.642
  10 -1.15      1  -0.50 -0.812  -0.50      1  0.743 -0.812
# ... with 990 more draws, and 2 more variables

I don't think adjusting nvariables.draws_rvars() to report the same number as the other formats makes sense (because then it would be inconsistent with variables.draws_rvars()), but I wanted to surface this as a minor inconsistency between formats to be aware of.

paul-buerkner commented 4 years ago

I think it is ok to have this inconsistency given that the definition of 'variable' varies between the rvar format and the others.