Adding generics of general Bayesian methods

paul-buerkner commented 4 years ago

We discussed at some point, that we want to have a common place where a lot of the generics should life that packages such as rstanarm, brms, etc. use. Right now, a lot of them are in rstantools but I don't think this is a good place for them as the methods are not only relevant from rstan based packages. Some methods are better suited for special packages and as such should life there, for instance, in bayesplot for all plotting related generics and in loo for all cross-validation related procedures.

Please amend this initial comment to add more methods that we may want to put into posterior.

[ ] posterior_predict compute predictions and return an arrary of draws
[ ] posterior_linpred compute linear predictor values and return an arrary of draws
[ ] posterior_epred compute means of the posterior predictive distribution and return an array of draws (see https://github.com/paul-buerkner/brms/issues/644)
[ ] predictive_error computed differences between observed and predicted responses and return an array of draws
[ ] posterior_interval basically summarise_draws with just two quantiles
[ ] predictive_interval basically summarise_draws with just two quantiles after calling posterior_predict

mjskay commented 4 years ago

Maybe this is the wrong place for this, but I wonder if part of putting those generics here might include some more standardization in their interfaces. At present there are some idiosyncrasies across brms and rstanarm that make the functions a bit harder to program against (I have to special-case stuff in tidybayes::add_predicted_draws / add_fitted_draws for example). There are also some aspects of the packages that are the same, but which don't appear in the generic (even things like newdata). This is not really a problem for end users, but those arguments can't be relied upon generically when writing a wrapper around these functions. I can understand why some of these things aren't in the generic (you wouldn't want to assume every possible model type has them), but it seems to me that there's a wide scope for even vaguely glmm-ish models that could benefit from having a standard interface.

As a very selfish wishlist item for me, thinking about things like nobs makes me think it would be nice to also have a common interface to things like the response variables for multivariate models, parameters in distributional regression, etc. This would make it easier to program against any models that employ those concepts (not just brms).

jgabry commented 4 years ago

@mjskay yeah I'd like to see more standardization in the interfaces too. the rstantools package (which includes lots of generics) was an initial attempt to get some of unification of method names but @paul-buerkner and I have also discussed other ways that we could share more between the packages. feedback on that is definitely welcome!

jgabry commented 4 years ago

@paul-buerkner I added posterior_interval and predictive_interval to the list above

mjskay commented 4 years ago

Following on this issue from brms (https://github.com/paul-buerkner/brms/issues/552), I think it would be nice to make sure these interfaces have enough standardization that we could add a marginalization interface within posterior that could cover these three cases identified by @paul-buerkner:

marginalize over categorical variables (relatively easy)

marginalize over continuous variables, perhaps by converting them to categorical variables with sufficient number of categories

marginalize over random effects

I think this might require being a bit more prescriptive about arguments to functions like pp_expect and/or providing functions that can inspect a model (e.g. get info about predictors, random effects, etc). For example, I'd love to think about what a better alternative to the somewhat cryptic re_formula special cases (NA and ~ 0 and whatnot) that could be part of a unified marginalization interface for all three types defined above.

statsccpr commented 4 years ago

Following this thread from the brms marginalizing thread, i think this makes a lot of sense, since Integrating out/over a distribution is a generic idea.

I think the tricky part is extracting standardize information about the distribution family used for the random effect

For 90% of use cases, it'll be a normal distribution centered at 0 re ~ Normal(0,sd)

I think the flexibility of brms is interesting since users might experiment with non-normal distributions (but this is rare, as the cited McCulloch paper argues that it's not that useful) https://github.com/paul-buerkner/brms/issues/231#issuecomment-370238182

Nevertheless, if you wanted to fully support 'integrating out' non normal distributions, i think brms / stan / your favorite bayes software, would need to tell what type of distribution was used for the random effect, to be used as

# numerical / monte carlo integration
integrate_wrt_re(lin_pred,disribution_re,...)

I think you'll need to ask the bayes software to meet you half way and do the heavy lifting of returning simulations from the specified re distribution (for monte carlo integration) or alternatively give you enough information about the distribution to generate a templated r-side wrapper function to write a closure function to be integrated, used in variants of https://stat.ethz.ch/R-manual/R-devel/library/stats/html/integrate.html

paul-buerkner commented 4 years ago

Thinking more about the generics above, I think there is one conceptual issue, which we need to resolve related to #8 and #13. posterior_predict and related methods usually return a matrix of draws (ndraws x nobservations), which would fit into the structure of draws_matrix objects. However, this structure is not general enough I think. For more complex models, such as multivariate models, we need a third dimension for the different responses variables. Perhaps there are even cases where we need more than 3 dimensions. Basically, what we need are multidimensional arrays as proposed in #13. We should provide package developers with clear guidance as to what should be returned by methods such as posterior_predict so that users can expect a consistent interface across packages.

The conceptual problem arises because we have thought of #13 as being resolved as part of the rv-like interface in #8. However, I am not sure if we should enforce usage of rvars (or however we call the objects of the rv-like interface) on the all users of posterior_predict etc. (so essentially the vast majority of rstanarm and brms users). I had previously thought of rvars to provide a nice but optional interface not something that we build in so strongly into common interfaces, but I am happy to change my mind.

Essentially, I think we have two options:

Indeed use multidimensional rvars as the required/recommended output format of posterior_predict etc.
Have support for multidimensional arrays (as per #13) independently of rvars and use those as required/recommended output format of posterior_predict etc. Those multidimensional arrays (let's call them draws_multi_array for now) could then have a dual purpose. (a) they would be usable standalone and (b) they would be the storage format inside rvar. It would be easy to transform between draws_multi_array and rvar via conversion functions, one of which draws_of(rvar) already exists.

@mjskay I lnow you won't have time to work on the rvars interface for the time being but I would still appreciate your input on this topic. If we decide for Option 1, I think we should postpone moving several generic to posterior until the rvar interface is fully operational. If we decide for Option 2, I could already work on draws_multi_array objects, which you could then simply add as storage format in rvar objects once you continue working on this feature.

mjskay commented 4 years ago

This is a really tough call!

I think my instinct is to have fewer redundant ways of doing things if at all possible (lots of friction happens from a user's perspective when their data needs to be in one format and they have it in another, especially when the formats are superficially similar). FWIW my long-term goal for the corresponding functions in tidybayes is that they be built on the generics in posterior and give long-format tables with rvars in them instead of (or perhaps as alternative to) the current long-format data frames, and then to modify the existing geoms and stats in tidybayes to accept rvars directly. I suspect this will end up being a particularly powerful and flexible approach for building custom visualizations.

The counter-argument to this is that if someone needs to do custom operations on the underlying array and then take advantage of our summary functions they would have to wrap/unwrap that data structure using draws_of(). This operation should be fast, but the question is would it be tedious? My hope is for the vast majority of things people need to do they wouldn't need to touch the underlying array anyway, so maybe this would not be tedious.

paul-buerkner commented 4 years ago

Your points make a lot of sense. I guess what I am worried about is that their are a lot of potential custom operations which only work on the underlying array directly. For example, if I want to use a sin transform on all draws in the array, can I simply call sin(rvar) or do I need to use rdo(sin(rvar))? For all kinds of operations, which are naturally vectorized over arrays already, working with rvars might add another layer of complexity, if I understand it correctly.

By having common methods such as posterior_predict returning rvars we are imposing this interface on a lot of users. So need to be quite confident with this approach as the impact of such a decision will be large.

mjskay commented 4 years ago

Ah yeah, that makes sense. Some of those functions are already covered, including everything in the Math generic, which includes sin() --- but I see that your point is not about sin() in particular, but more generally about any function that someone might have already implemented for arrays that isn't explicitly supported. For such functions, unless those functions are written strictly in terms of the functions we do provide wrappers for, you'd have to use rdo() or apply them directly to draws_of(x) (the latter would be faster provided the function does not change the shape of the array).

So perhaps it is a question of if the Math generic (perhaps plus some other reasonable set of functions we can wrap) provides enough coverage for the majority of users? I could see the answer being no...

paul-buerkner commented 4 years ago

Having every method in the math generic page is quite good already. I think we need to see rvars in action before we can make an informed decision if we want to enfore them as output format of a lot of important methods.

There are more decisions to be made with regard to the generics, for example, whether to add more arguments to the generic to enfore their consistent use, and what defaults the should have (e.g., the prob argument of posterior_interval). Since both of you, @jgabry and @mjskay, seem quite busy at the moment, we should perhaps postpone this to a later point (where we perhaps also have rvars already) and then implement all generics at once after a discussion via call.

Depending on when we need to first release posterior, the generics, rvar etc. could go into the first release or not, but there would be no arm in adding them later as well.

stan-dev / posterior

Adding generics of general Bayesian methods #39