paul-buerkner / brms

brms R package for Bayesian generalized multivariate non-linear multilevel models using Stan
https://paul-buerkner.github.io/brms/
GNU General Public License v2.0
1.27k stars 183 forks source link

Beta regression total sample size #144

Closed inti closed 7 years ago

inti commented 7 years ago

Hi, I have some count data where I observe counts: for each level of groupID I obtain multiple pairs of observations #A and #B nested within groupID. I want to model average prob of observing A #A/(#A+#B) as a beta distribution and model the groupID as a random effect. I see there is a Beta() family on brms. I am writing my model as

library(brms)
counts$n = = counts$alleleOne + counts$alleleTwo
mod1 = brm(alleleOne/n ~ 1  + (1|featureName),data=counts,family = Beta())

but it does not do anything :( mod1 = brm(alleleOne/n ~ 1 + (1|featureName),data=counts) runs but it is quite what I want since it does not model the observation sample size (n=A+B)

the data looks like this https://gist.github.com/inti/02eaf3916908b3307e369b7fe5684ad3

how do I give it the total number of observations for each pair A and B? why does it not run when I set the beta() family

many thanks in advance

paul-buerkner commented 7 years ago

"It does not do anything" is not very specific. What does brms complain about?

inti commented 7 years ago

sorry, I did not mean to be unhelpful. it literally does not complain. this a full example I have brms Version: 1.1.0 Date: 2016-10-11

> library(RCurl)
> library(bitops)
> url <- 'https://gist.githubusercontent.com/inti/02eaf3916908b3307e369b7fe5684ad3/raw/f183c49ca5ecd362ab49597d4b32c3859ae74048/counts.txt'
> df <- getURL(url)
> counts = read.table(text = df,header = T)
> library(brms)
> counts$n = counts$alleleOne + counts$alleleTwo
> mod1 = brm(alleleOne/n ~ 1  + (1|featureName),data=counts,family = Beta())
> 

I was using RStudio, if I do it on the shell it does complain Error: The beta distribution requires responses between 0 and 1. should the total sample size be provided as an offset? or is there another way to specify it?

paul-buerkner commented 7 years ago

If there are outcomes that are exactly zero or one, brms has to throw this error as zeros and ones are not working with the beta distribution in Stan (only everything in between works). I wonder however, if a simple binomial model might be appropriate for your data. You could try

mod1 = brm(alleleOne | trials(n) ~ 1  + (1|featureName), 
           data=counts, family = binomial()) 
inti commented 7 years ago

ok. Actually the full hierarchical model is: observations A/(A+B) are binomial and within each featureName they are modeled using a Beta, so it is beta binomial distribution. Makes sense? I have coded it before in Stan and I am trying to build it with brms. Each featureName_{j} has its own Beta(alpha_{j},beta_{j}), p_{ij} ~Beta(alpha_{j},beta_{j} and #alleleOne ~Binomial(n_i,p_{ij})

Is there a way to hack this beta-binomial in brms?

thanks again!

paul-buerkner commented 7 years ago

Unfortunately not. This model does not fit nicely in the usual regression framework, where the mean (p for binomial models) is modeled via inv_link(linear_predictor). One could think of adding a family beta_binomial that allows to estimate alpha and beta, but I am not sure how much sense that makes.

paul-buerkner commented 7 years ago

After furthing thinking about this, I am convincent that the beta-binomial family can be reasonably implemented in brms. However, I wouldn't use the shape1, shape2 parameterization of the beta distribution but the mean, precision parameterization, instead. This fits much more nicely into the existing framework. I put the beta-binomial family on the list in #110 so that I don't forget about it.

I will close this issue here to keep the issue tracker clear. If you feel that there is more to discuss regarding the issue feel free to reopen it.

schmettow commented 7 years ago

The beta-binomial is often chosen when overdispersion is an issue. Formally, overdispersion can be thought of as unexplained variance, and is therefore the rule rather than exception. Thinking about it that way, there is a rather straight-forward solution: adding an observation-level random effect. This "pulls" the unexplained variance into the linear predictor of the binomial (=logistic) model, where it is assumed to be normally distributed. In practice, the beta-binomial and the logit-normal distributions are very similar. It also allows to compare degree of variations, as they are all on the same scale.

You do it by adding an identifier for observations to the data set, say ObsID and then specify the model as:

mod1 = brm(alleleOne | trials(n) ~ 1 + (1|featureName) + (1|ObsID), data=counts, family = binomial())

Dunno whether this is valid in the context of genetics, though ...

inti commented 7 years ago

Thanks for the comment, I guess it would need to be mod1 = brm(alleleOne | trials(n) ~ 1 + (1|featureName/ObsID), data=counts, family = binomial())

since ObsID are nested within featureName.

I think you are right in terms of how to handle the extra variation. The models are not fully equivalent but it might do.

Cheers, Inti

On Nov 22, 2016, at 08:34, schmettow notifications@github.com wrote:

The beta-binomial is often chosen when overdispersion is an issue. Formally, overdispersion can be thought of as unexplained variance, and is therefore the rule rather than exception. Thinking about it that way, there is a rather straight-forward solution: adding an observation-level random effect. This "pulls" the unexplained variance into the linear predictor of the binomial (=logistic) model, where it is assumed to be normally distributed. In practice, the beta-binomial and the logit-normal distributions are very similar. It also allows to compare degree of variations, as they are all on the same scale.

You do it by adding an identifier for observations to the data set, say ObsID and then specify the model as:

mod1 = brm(alleleOne | trials(n) ~ 1 + (1|featureName) + (1|ObsID), data=counts, family = binomial())

Dunno whether this is valid in the context of genetics, though ...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/paul-buerkner/brms/issues/144#issuecomment-262218495, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXApXSFFojun3Zmni_XKtTwclLzYtqOks5rAtNigaJpZM4KxyRf.

paul-buerkner commented 7 years ago

In your case, (1|featureName/ObsID) and (1|featureName) + (1|ObsID) will be equivalent since ObsID is already unique across all feature names.

inti commented 7 years ago

Cheers, Inti

On Nov 22, 2016, at 08:47, Paul-Christian Bürkner notifications@github.com wrote:

In your case, (1|featureName/ObsID) and (1|featureName) + (1|ObsID) will be equivalent since ObsID is already unique across all feature names.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/paul-buerkner/brms/issues/144#issuecomment-262220785, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXApczBnuH40Gkm3hQ0zR_MSPlHZyyEks5rAtY2gaJpZM4KxyRf.