Support for multiple response variables

eb8680 commented 4 years ago

The mvbrmsformula function in brms provides support for multiple formulas with shared inputs:

xf <- bf(x ~ z + 1)
yf <- bf(y ~ x + z + 1)
formula <- mvbf(yf, xf, ...)
fitted_model <- brm(formula, ...)

It would be nice if brmp supported this behavior as well. I believe that in code generation this would just correspond to generating a single model with multiple response sample statements.

Additional question for discussion: would this require implementing brms's (expr | ID | group) grouping syntax as well?

eb8680 commented 4 years ago

@null-a what do you think about the feasibility of this? Are there any major stumbling blocks you're aware of?

null-a commented 4 years ago

what do you think about the feasibility of this? Are there any major stumbling blocks you're aware of?

@eb8680: To the extent that I'm familiar with this feature, it seems perfectly feasible to me -- I see no major stumbling blocks.

I believe that in code generation this would just correspond to generating a single model with multiple response sample statements.

Yes, that's my understanding too. When multiple responses share the same family this could also be a single sample statement from a multivariate distribution? It seems that when the reponse is either multivariate normal or student's-t, brms models residual correlations by default. (In particular, the response distribution is parameterised by standard deviations and a correlation matrix.)

would this require implementing brms's (expr | ID | group) grouping syntax as well?

My understanding is that it is possible implement multivariate responses without this, but that once we do so this would be a useful extension to add? (Because it allows for group level terms in different formulas to be modeled as correlated.)

Some further thoughts, noted for future reference:

Would be want to support models in which response variables have different response families? (e.g. y1 comes from a normal and y2 from a Bernoulli.) I don't see how to do this in brms, but it wouldn't surprise me if it's possible; it's very flexible.
There's perhaps some overlap here with models with a categorical response, as they also result in models in which mu is a vector rather than a scalar. "Distributional models" are also similar (though perhaps only superficially), since they too are specified with multiple formulas.
It seems most of the work would be in extending code generation. The mechanism used to specify priors would need to be extended so that parameters from individual formulas can be picked out. (brms appears to do this using the name of the response variable.) fitted would return an array with an extra dimension. Formula parsing and design matrix coding would be unchanged I guess.
The mvbind notation for setting up a multivariate model (in which each element of the response uses the same formula) seems like it would be useful eventually. Perhaps that could also be written as e.g. [y1,y2] ~ 1 + x. Perhaps for high-dimensional data there's a way to specify a range of columns, e.g. have y[0:2] ~ 1 + x be eqv. to [y0,y1,y2] ~ 1 + x.

eb8680 commented 4 years ago

Would be want to support models in which response variables have different response families?

I think a first version supporting only Normal responses with uncorrelated residuals would be fine, but in general we should be able to support all response families in the case where responses are fully observed, and at least Normal and Categorical/Bernoulli when responses are missing (#43, #44). Support for the general case might involve doing relatively naive code generation in brmp and expecting the Pyro backend to be smarter about simplifying the resulting model.

Modelling correlations in residuals across arbitrary families seems more difficult, we can punt on that for now.

There's perhaps some overlap here with models with a categorical response, as they also result in models in which mu is a vector rather than a scalar ...

I wonder if we might want to draw a distinction between multiple variables and vector/tensor-valued variables or means, in the same way that Pyro allows tensor-valued sample statements via Distribution.to_event. That way brmp could support non-scalar responses of different shapes, families, etc, and we could naturally support mvbind and generalizations. That's sort of what I was thinking when I opened #46 although I don't have a fully formed proposal for that.

I would also like to see categorical responses supported eventually, since one of my motivating examples for this series of issues is a hierarchical HMM where the transition and emission distributions can each be written as GLMMs.

It seems most of the work would be in extending code generation

Yeah, that sounds right. Maybe a good starting point would be to collect some examples that brms handles?

null-a commented 4 years ago

I wonder if we might want to draw a distinction between multiple variables and vector/tensor-valued variables or means

This sounds like a promising direction to me.

pyro-ppl / brmp

Support for multiple response variables #42