paul-buerkner / brms

brms R package for Bayesian generalized multivariate non-linear multilevel models using Stan
https://paul-buerkner.github.io/brms/
GNU General Public License v2.0
1.28k stars 181 forks source link

implementation of zero-inflated gamma family #742

Closed ThomasKraft closed 5 years ago

ThomasKraft commented 5 years ago

One class of data that are currently difficult to model in brms are skewed continuous distributions with a high density of zero outcomes. Although brms does support hurdle gamma models, this essentially requires fitting two separate models and is not the most theoretically appropriate approach for all cases. I am thus wondering whether it would be possible to implement a zero-inflated gamma family option.

The exact kind of model that I am looking for is described in the following paper: https://link.springer.com/article/10.1007/s12110-014-9193-4. Notably, the supplement of that paper provides full Stan code for the implementation (See here for the R code as well as additional documentation: https://github.com/rmcelreath/mcelreath-koster-human-nature-2014), as well as a link to Richard McElreath's glmer2stan() package, which can be used to generate these types of models using glmer syntax with the "zigamma" family (See part (3) in https://github.com/rmcelreath/glmer2stan). Finally, I've discovered that in McElreath's better updated map2stan() package there is the option for zero-inflated gamma distributions, in case that is useful (https://www.rdocumentation.org/packages/rethinking/versions/1.59/topics/dzagamma2).

My goal here is to be able to run a model nearly identical to the one in the paper linked above, but with some of the additional functionality and convenience from the brms package (mainly the seamless incorporation of splines through mgcv). My hope is that the existence of resources on this topic makes this something that would be relatively straightforward to implement, although unfortunately I don't feel qualified to do so myself. Thanks in advance for the help.

paul-buerkner commented 5 years ago

Since the gamma distribution has zero probability at zero, the corresponding hurdle and zero_inflated distributions are identical to my understanding. Or can you explain how they are supposed to differ?

ThomasKraft notifications@github.com schrieb am Fr., 30. Aug. 2019, 22:21:

One class of data that are currently difficult to model in brms are skewed continuous distributions with a high density of zero outcomes. Although brms does support hurdle gamma models, this essentially requires fitting two separate models and is not the most theoretically appropriate approach for all cases. I am thus wondering whether it would be possible to implement a zero-inflated gamma family option.

The exact kind of model that I am looking for is described in the following paper: https://link.springer.com/article/10.1007/s12110-014-9193-4. Notably, the supplement of that paper provides full Stan code for the implementation (See here for the R code as well as additional documentation: https://github.com/rmcelreath/mcelreath-koster-human-nature-2014), as well as a link to Richard McElreath's glmer2stan() package, which can be used to generate these types of models using glmer syntax with the "zigamma" family (See part (3) in https://github.com/rmcelreath/glmer2stan). Finally, I've discovered that in McElreath's better updated map2stan() package there is the option for zero-inflated gamma distributions, in case that is useful ( https://www.rdocumentation.org/packages/rethinking/versions/1.59/topics/dzagamma2 ).

My goal here is to be able to run a model nearly identical to the one in the paper linked above, but with some of the additional functionality and convenience from the brms package (mainly the seamless incorporation of splines through mgcv). My hope is that the existence of resources on this topic makes this something that would be relatively straightforward to implement, although unfortunately I don't feel qualified to do so myself. Thanks in advance for the help.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/paul-buerkner/brms/issues/742?email_source=notifications&email_token=ADCW2AA2F65EBJWAB74UUI3QHFXMNA5CNFSM4ISQNW52YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HIQXOPA, or mute the thread https://github.com/notifications/unsubscribe-auth/ADCW2AAGQ74TQSG42JHSIJDQHFXMNANCNFSM4ISQNW5Q .

ThomasKraft commented 5 years ago

It is possible that I am misunderstanding and in fact they do not differ. I was basing the idea that they are different on the notion that a hurdle model is equivalent to running to separate models and combining the posteriors (one for zero/nonzero and the other for gamma distributed nonzero data) and the fact that the authors of the linked paper seem to make an explicit distinction between their modelling approach and a "separate regression models" approach. One benefit they promote is the ability to measure correlations within an individual random intercept effect between the nonzero and zero parts of the model. But perhaps that is captured in the hurdle_gamma structure already available? See text below for description from paper:

"This paper presents a unified Bayesian analysis of variation in human foraging returns. Variation in these data arise from differences in age, skill, and hunt duration, as well as many unmeasured and un-modeled factors. Instead of coercing the outcome measure, kilograms of meat returned to camp, into a convenient distribution, we modeled these returns using a two-process zero-inflated gamma mixture model. The benefit of this additional complexity is that we are able to discuss risk as well as average returns within the same model. When analyses focus on only one part of the mixture, whether zeros or non-zeros, or average across trips to blend zeros and non-zeros together, information is lost. In addition, the models estimate the correlation between the components of hunting returns, a superior approach to running separate regression models on zero and non-zero outcomes, because it allows information about failures to inform estimates about harvests, and vice versa"

paul-buerkner commented 5 years ago

What they mean by separate regressions is literally fitting two separate models which is clearly not what happens in brms with the hurdle gamma family. In other words their zero inflated gamma and brms' hurdle gamma family are the same.

ThomasKraft notifications@github.com schrieb am Sa., 31. Aug. 2019, 15:22:

It is possible that I am misunderstanding and in fact they do not differ. I was basing the idea that they are different on the notion that a hurdle model is equivalent to running to separate models and combining the posteriors (one for zero/nonzero and the other for gamma distributed nonzero data) and the fact that the authors of the linked paper seem to make an explicit distinction between their modelling approach and a "separate regression models" approach. One benefit they promote is the ability to measure correlations within an individual random intercept effect between the nonzero and zero parts of the model. But perhaps that is captured in the hurdle_gamma structure already available? See text below for description from paper:

"This paper presents a unified Bayesian analysis of variation in human foraging returns. Variation in these data arise from differences in age, skill, and hunt duration, as well as many unmeasured and un-modeled factors. Instead of coercing the outcome measure, kilograms of meat returned to camp, into a convenient distribution, we modeled these returns using a two-process zero-inflated gamma mixture model. The benefit of this additional complexity is that we are able to discuss risk as well as average returns within the same model. When analyses focus on only one part of the mixture, whether zeros or non-zeros, or average across trips to blend zeros and non-zeros together, information is lost. In addition, the models estimate the correlation between the components of hunting returns, a superior approach to running separate regression models on zero and non-zero outcomes, because it allows information about failures to inform estimates about harvests, and vice versa"

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/paul-buerkner/brms/issues/742?email_source=notifications&email_token=ADCW2ABR6OM7UCQNE6VMK6TQHJWBXA5CNFSM4ISQNW52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5TMVZA#issuecomment-526830308, or mute the thread https://github.com/notifications/unsubscribe-auth/ADCW2AFVL5X3Y3RELRPMK7TQHJWBXANCNFSM4ISQNW5Q .

ThomasKraft commented 5 years ago

Ah OK, very sorry about the confusion and for wasting your time. Thank you for clarifying Paul!