paul-buerkner / brms

brms R package for Bayesian generalized multivariate non-linear multilevel models using Stan
https://paul-buerkner.github.io/brms/
GNU General Public License v2.0
1.27k stars 183 forks source link

Conway-Maxwell-Poisson distribution #607

Open paul-buerkner opened 5 years ago

paul-buerkner commented 5 years ago

The Conway-Maxwell-Poisson distribution (https://en.wikipedia.org/wiki/Conway%E2%80%93Maxwell%E2%80%93Poisson_distribution) is a generalization of the Poisson distribution that allows for both over- and underdispersion. There have been some attempts to get this distribution running in Stan, but I haven't seen one that works fast enough for most practical applications. Once we got such an implementation though, I think the distribution would make a very valuable addition to brms.

paul-buerkner commented 5 years ago

An experimental version of the COM Poisson distribution is now available on GitHub. It is not yet documented or officially supported but can already be accessed via brmsfamily("com_poisson").

I currently use the mode parameterization as proposed by Guikema and Goffelt (https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1539-6924.2008.01014.x) with a (hopefully) accurate approximation of the normalizing constant via truncation. The implementation is not terribly efficient (read: an order of magnitude or more slower than the Poisson distribution), but it seems reasonably stable at least.

Ideally, I would like to parameterize the mean, but I haven't found a way, yet, to get this parameterization working in a Bayesian framework.

In any case, I would love if users could try out the distribution and provide feedback.

The Stan functions for the current COM Poisson implementation of brms are available at https://github.com/paul-buerkner/brms/blob/master/inst/chunks/fun_com_poisson.stan

bgoodri commented 5 years ago

In some situations, it would make sense to utilize the results in https://arxiv.org/abs/1612.06618 to approximate the normalizing constant well-enough.

StaffanBetner commented 5 years ago

An other option could be to implement Consul's generalized Poisson distribution, which could be parameterized by mean and variance.

paul-buerkner commented 5 years ago

Thanks @bgoodri, just implemented the approximation.

@StaffanBetner Do you have a reference for me about the Consul's generalized Poisson distribution?

StaffanBetner commented 5 years ago

@paul-buerkner The two main references are Consul & Jain (http://dx.doi.org/10.1080/00401706.1973.10489112) and Consul & Shoukri (https://doi.org/10.1080/03610918508812463) (discussing some truncation problems).

Regarding COM Poisson: Huang (https://doi.org/10.1177/1471082X17697749) would probably be a complement to the pre-print suggested by @bgoodri.

paul-buerkner commented 5 years ago

I now realize that the Consul's generalized Poisson distribution is what I know just as generalized Poisson distribution. The disadvantage of that distribution is that we can't properly model underdispersion unless we allow the model to have some very weird support that is no longer just x >= 0 for integer x. It may still be a nice model to look at when it comes to overdispersion, though.

StaffanBetner commented 5 years ago

Yes, that's true, and the suggested rule for the support isn't entirely proper either (might not result in unity of the sum of the support). I think this could be remedied by ignoring that rule and thus truncating below zero probabilities to zero and add a normalizing constant when the variance is lower than the mean instead. The normalizing constant would probably be easier to calculate than the constant for COM Poisson since it (probably) won't be a summation over an infinite support. But that is just what I suspect is the best implementation.

paul-buerkner commented 5 years ago

That would probably justify its own research project I guess, but yes, I agree this would likely be the way to go. Still, I think the GPD becomes a weird model once we allow for lambda < 0 that is allow for underdispersion.

jeffpollock9 commented 5 years ago

Many thanks for implementing this distribution and all the great work on brms!

I tried out this distribution but was getting odd results so wrote a test - I think the implementation has a bug somewhere as the PMF does not match any other R packages on CRAN. I did the following by copy-pasting fun_com_poisson.stan into a functions block and named the file test.stan:

library(rstan)
library(CompGLM)
library(compoisson)
library(COMPoissonReg)

expose_stan_functions("test.stan")

y <- 0:10
lambda <- 4.0
nu <- 1.5

br <- numeric(length(y))

for (i in seq_along(y)) {
    br[i] <- exp(com_poisson_lpmf(y[i], lambda, nu))
}

results <- data.frame(
    y = y,
    brms = br,
    CompGLM  = dcomp(y, lambda, nu),
    compoisson  = dcom(y, lambda, nu),
    COMPoissonReg = dcmp(y, lambda, nu)
)

which outputs:

> print(results)
    y        brms      CompGLM   compoisson COMPoissonReg
1   0 0.006732772 5.483936e-02 5.484348e-02  5.483936e-02
2   1 0.053862177 2.193575e-01 2.193739e-01  2.193575e-01
3   2 0.152345243 3.102183e-01 3.102416e-01  3.102183e-01
4   3 0.234550845 2.388061e-01 2.388241e-01  2.388061e-01
5   4 0.234550845 1.194031e-01 1.194120e-01  1.194031e-01
6   5 0.167830923 4.271894e-02 4.272215e-02  4.271894e-02
7   6 0.091355583 1.162662e-02 1.162750e-02  1.162662e-02
8   7 0.039461903 2.511115e-03 2.511303e-03  2.511115e-03
9   8 0.013951889 4.439065e-04 4.439399e-04  4.439065e-04
10  9 0.004133893 6.576393e-05 6.576887e-05  6.576393e-05
11 10 0.001045801 8.318553e-06 8.319177e-06  8.318553e-06

it works for nu = 1 but not any other values, as far as I can tell.

paul-buerkner commented 5 years ago

brms uses a different parameterization, which is called the "mode parameterization" not the original one. To get mu (the mode) from lambda (the original parmater), compute mu = lambda^(1 / nu) and plug that into the brms function.

Changing the code to

for (i in seq_along(y)) {
  br[i] <- exp(com_poisson_lpmf(y[i], lambda^(1/nu), nu))
}

solves the problem on my machine.

jeffpollock9 commented 5 years ago

@paul-buerkner yes that fixes it - thanks again! Sorry I missed that, it might be worth making this very clear in the docs as it is different from all the other packages on CRAN which people might be accustomed with.

paul-buerkner commented 5 years ago

No worries. You couldn't have known. Of course, this needs to be stated in the docs but since the COM-Poisson distribution is currently kept (semi-)internal and not officially supported, there is intentionally no doc related to this distribution yet.

MilaniC commented 5 years ago

The exact mean parametrisation of Huang (2017) is more helpful here than the mode parameterisation. The mean form is now used in the glmmTMB wrapper for TMB and works well - and can account for zero-inflation. Highly recommend implementing the exact mean parameterisation as is far more stable and provides direct comparison of effects between associated likelihoods like Poisson, NB etc (this is all clearly addressed in Huang 2017).

Huang, A. (2017). Mean-parametrized Conway–Maxwell–Poisson regression models for dispersed counts. Statistical Modelling, 17(6), 359–380. https://doi.org/10.1177/1471082X17697749

paul-buerkner commented 5 years ago

Thanks! I am aware of this parameterization and I prefer it conceptually as well. However, it requires an inner optimization to obtain the mean, which, if implemented in a fully Bayesian setting, will be even slower than anything else...

Anyway, one of the reasons I am not exporting the COM-Poisson distribution at the moment is that I actually want to use this mean parameterization. So suggestions how to speed up this parameterization (ideally without an inner optimization if at all possible) are more than welcome!

MilaniC commented 5 years ago

While based on the mode parameterisation, perhaps the fast rejection sampler proposed here for the COM-Poisson might be helpful:

Benson A, Friel N (2017) Bayesian inference, model selection and likelihood estimation using fast rejection sampling: the Conway-Maxwell-Poisson distribution. https://arxiv.org/abs/1709.03471

joshua-goldberg commented 4 years ago

I have a follow-up question regarding the mode parametrization currently implemented. In particular, how should one calculate fitted values on the link/response scale? If I understand correctly, the parameter estimates represent the mode on the log-link scale. Let's call the mode "mu", following Guikema & Goffelt. The mode is related to the "mean" – lambda – by mu = lambda^(1/nu), or lambda = mu^nu, after some algebraic rearrangement.

If the above holds, then response scale estimates of lambda, on the response scale, should be given by lambda = exp(fitted values from model)^nu. However, when I make this calculation using tidybayes::add_fitted_draws(), I see some minor deviations between the fitted value reported directly and the value estimated from transformation from the link to response scale and it first glance the transformed values are all larger than the direct estimates. It seems I have missed something, but I cannot see what. I'd like to make sure that I am plotting output from COM-Poisson models and performing comparisons among groups correctly.

MilaniC commented 4 years ago

If I understand your query correctly - then this discrepancy is probably because the approach using tidybayes (which is correct) is to transform-then-summarise using the posterior draws. But if you just back-transform the parameter point estimates in the brm parameter summary then that is using a summarise-then-transform approach (which is not necessarily appropriate), and that will invariably give you different point summary estimates. Go with the summary via tidybayes using the posterior draws from your model fit, as that is the appropriate step in a Bayesian workflow. I think this helps to answer your query.

joshua-goldberg commented 4 years ago

Thanks for the quick response. I agree that transform-then-summarise is the correct approach, but I feel quite certain that my issue does not emerge out of a summaries-then-transform error. To be a bit more concrete, here's a pseudo-code sketch of my approach and what I've been able to take away thus far:

drawsLink = expand.grid( SOME MODEL TERMS ) %>% add_fitted_draws(model, re_formula=NA, scale='linear', dpar = c(mu='mu', shape='shape))

drawsResp = expand.grid( SOME MODEL TERMS ) %>% add_fitted_draws(model, re_formula=NA, scale='response', dpar = c(mu='mu', shape='shape'))

On the linear scale, drawsLink, I can see that .value == mu for all cases. I can then partially relate this output to the response scale, drawsResp, where drawsResp$mu == exp(drawsLink$mu), or equivalently, drawsResp$mu == exp(drawsLink$.value). On the response scale, .value != mu in any cases, which suggests that the nu/shape parameter is being taken into account to compute .value. However, drawsResp$.value != drawsResp$mu^drawsResp$shape, so it's unclear what transformation is getting used to compute .value on the response scale. While these two values are usually quite close, drawsResp$.value < drawsResp$mu^drawsResp$shape in all cases and sometimes the difference can be quite large, a factor of 2 or more.

As such, there's clearly something more to the transformation that I cannot pin down without a deeper understanding of the internal parameterization of the COM-Poisson distribution in brms and/or its interpretation via tidybayes. Given the experimental nature of the distribution, I am a bit leery of blindly trusting output on the assumption that it works and gives the expected/correct result. Also, please note, I am not venturing to make any summaries in my exploration thus far, but am instead working with full posterior distributions. I hope this fuller explanation makes my question clearer.

paul-buerkner commented 4 years ago

brms computes the mean by approximating the infinite sum that is the mean instead of using any simple mean = mode + f(nu) transformation. See function mean_com_poisson in https://github.com/paul-buerkner/brms/blob/master/R/distributions.R

joshua-goldberg commented 4 years ago

Ah, that makes sense. Thanks for clearing up my confusion.

liamkendall commented 2 years ago

Hi Paul, I seem to be having some problems using the conway-maxwell poisson distribution. I feel like it was working a while ago but now i cannot run any model with it that converges. Running it on the fish dataset you provide in your brms_distreg vignette returns the same issues as any model of mine: returning this error and then giving large rhats and tiny ESSs.

" Exception: Exception: Exception: vector[multi] indexing: accessing element out of range. index 52 out of range; expecting index to be between 1 and 51 (in 'model2d2193c5cbd_cb1899a62439dae64d4593250519d3d8' at line 64)"

paul-buerkner commented 2 years ago

Can you please provide a minimial reproducible example of this error with the exact code you are using?

liamkendall commented 2 years ago

Apologises. See below

zinb <- read.csv("https://paul-buerkner.github.io/data/fish.csv")

fit <- brm(count ~ 1, cores=4, iter=1000, data = zinb, family = "com_poisson")

paul-buerkner commented 2 years ago

I don't know what happens with this example for you as it works well for me. Have you tried installing the latest CRAN or github version of brms?

MathsBrens commented 2 years ago

Hi Paul, how are you? Paul, I would like to initially acknowledge the work in which you develop in providing computational routines, as well as the dedication in writing quality scientific articles, contributing to the theoretical and applied statistics scenario. Paul, I am interested in modeling count data in STAN using some specific models, among them COM-Poisson. I found your github link and there I noticed that you made available the codes for implementing this model in STAN, however, I am having difficulties understanding the structure of the model in terms of the likelihood function. I would like to ask you if it would be possible for you to give me the implementation of the likelihood function of the COM-Poisson distribution using the STAN language. I congratulate you once again for the work you have been doing. I look forward to hearing from you! Regards!

paul-buerkner commented 2 years ago

I am unsure what exactly you mean. You can use the make_stancode(..., family = "com_poisson") to get the Stan code.

AliceCade commented 2 years ago

Hi, I am new to STAN and Bayesian modeling, and am trying to fit a brms model with the COM-poisson distribution for a some count data that is under dispersed. However I am struggling with STAN diverging and crashing a lot. Warning message tells me to try to increase iterations and/or choose stronger priors. I have looked at the STAN code using the make_stancode(), but I don't understand all of it, or how exactly the priors are chosen. Can you explain how the priors are chosen in your STAN code, and if it's possible/viable to change them?

data : DataHI.csv

My model is brm(EstAgeInYrs ~ poly(KnwnAgeInYrs,3) + (1|ReaderID), cores = 4, iter = 1000 data = DataHI, family = "com_poisson") I arrived at this model from testing with glmmTMB.

paul-buerkner commented 2 years ago

Please ask brms-related questions on https://discourse.mc-stan.org/

DoktorMike commented 1 year ago

Just a friendly question if this has been formally added yet to brms? If not what is holding it back? If it's just more testing I'd be happy to make a few test models and report back here.

paul-buerkner commented 1 year ago

use brmsfamily("com_poisson"). this is a Mode parameterization which is Experimental and not officially supported.

Michael Green @.***> schrieb am Mo., 13. Feb. 2023, 13:53:

Just a friendly question if this has been formally added yet to brms? If not what is holding it back? If it's just more testing I'd be happy to make a few test models and report back here.

— Reply to this email directly, view it on GitHub https://github.com/paul-buerkner/brms/issues/607#issuecomment-1427890569, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCW2AFZRZI56XK5CVNBVBLWXIVDBANCNFSM4GZGQPIA . You are receiving this because you were mentioned.Message ID: @.***>

gaarbet commented 1 year ago

I have been trying to use the CMP distribution here but I consistently receive this error: " Exception: Exception: Exception: vector[multi] indexing: accessing element out of range. index 52 out of range; expecting index to be between 1 and 51 (in 'model2d2193c5cbd_cb1899a62439dae64d4593250519d3d8' at line 64)". I noticed previously someone else had this same error. I have installed the most recent version of brms but that did not seem to change anything.

I tried the fish data referenced in that previous issue as well. Furthermore, i tried to generate my own data with the make_stancode for the com poisson and I receive this same error. Anything else I could try?

paul-buerkner commented 1 year ago

which brms version are you using? If the latest github doesn't work, please provide a minimal reproducible example.

gaarbet commented 1 year ago

I am using 2.19.6. Here is my data generation process:

library(brms) set.seed(123) n_id <- 200 n_items <- 15 N <- n_id *n_items

random_effect_id <- rnorm(n_id, 0, 1) # ability fixed_effect_items <- rnorm(n_items,0,1) #item content easiness nu <- 1

id <- rep(1:n_id, each = n_items) items <- rep(1:n_items, length.out = N)

lambda <- exp(nu * (random_effect_id[id] + fixed_effect_items[items]))

count <- COMPoissonReg::rcmp(N, lambda, 1) longdata <- data.frame(id = factor(id), items = factor(items), count = count) mod <- brm(count ~ -1 + items + (1 | id), data = longdata, family = "com_poisson")

tpbilton commented 9 months ago

I am using 2.19.6. Here is my data generation process:

library(brms) set.seed(123) n_id <- 200 n_items <- 15 N <- n_id *n_items

random_effect_id <- rnorm(n_id, 0, 1) # ability fixed_effect_items <- rnorm(n_items,0,1) #item content easiness nu <- 1

id <- rep(1:n_id, each = n_items) items <- rep(1:n_items, length.out = N)

lambda <- exp(nu * (random_effect_id[id] + fixed_effect_items[items]))

count <- COMPoissonReg::rcmp(N, lambda, 1) longdata <- data.frame(id = factor(id), items = factor(items), count = count) mod <- brm(count ~ -1 + items + (1 | id), data = longdata, family = "com_poisson")

I tried running this code and got the following error:

> mod <- brm(count ~ -1 + items + (1 | id), data = longdata, family = "com_poisson")
Compiling Stan program...
Error in stanc(file = file, model_code = model_code, model_name = model_name,  : 
  parser failed badly; maybe try installing the V8 package

SessionInfo:

> sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_New Zealand.utf8  LC_CTYPE=English_New Zealand.utf8   
[3] LC_MONETARY=English_New Zealand.utf8 LC_NUMERIC=C                        
[5] LC_TIME=English_New Zealand.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] brms_2.20.4 Rcpp_1.0.11

loaded via a namespace (and not attached):
 [1] nlme_3.1-163         matrixStats_1.0.0    xts_0.13.1          
 [4] threejs_0.3.3        rstan_2.32.3         numDeriv_2016.8-1.1 
 [7] tensorA_0.36.2       tools_4.2.1          backports_1.4.1     
[10] utf8_1.2.3           R6_2.5.1             DT_0.29             
[13] colorspace_2.1-0     tidyselect_1.2.0     gridExtra_2.3       
[16] prettyunits_1.1.1    processx_3.8.2       Brobdingnag_1.2-9   
[19] emmeans_1.8.8        curl_5.0.2           compiler_4.2.1      
[22] cli_3.6.1            shinyjs_2.1.0        sandwich_3.0-2      
[25] colourpicker_1.3.0   posterior_1.5.0      scales_1.2.1        
[28] dygraphs_1.1.1.6     checkmate_2.2.0      mvtnorm_1.2-3       
[31] COMPoissonReg_0.8.1  callr_3.7.3          QuickJSR_1.0.7      
[34] stringr_1.5.0        digest_0.6.33        StanHeaders_2.26.28 
[37] base64enc_0.1-3      pkgconfig_2.0.3      htmltools_0.5.6     
[40] fastmap_1.1.1        htmlwidgets_1.6.2    rlang_1.1.1         
[43] shiny_1.7.5          farver_2.1.1         generics_0.1.3      
[46] zoo_1.8-12           jsonlite_1.8.7       crosstalk_1.2.0     
[49] gtools_3.9.4         dplyr_1.1.3          distributional_0.3.2
[52] inline_0.3.19        magrittr_2.0.3       loo_2.6.0           
[55] bayesplot_1.10.0     Matrix_1.6-1         munsell_0.5.0       
[58] fansi_1.0.4          abind_1.4-5          lifecycle_1.0.3     
[61] stringi_1.7.12       multcomp_1.4-25      MASS_7.3-60         
[64] pkgbuild_1.4.2       plyr_1.8.8           grid_4.2.1          
[67] parallel_4.2.1       promises_1.2.1       crayon_1.5.2        
[70] miniUI_0.1.1.1       lattice_0.21-8       splines_4.2.1       
[73] ps_1.7.5             pillar_1.9.0         igraph_1.5.1        
[76] markdown_1.8         estimability_1.4.1   shinystan_2.6.0     
[79] reshape2_1.4.4       codetools_0.2-19     stats4_4.2.1        
[82] rstantools_2.3.1.1   glue_1.6.2           V8_4.4.1            
[85] RcppParallel_5.1.7   vctrs_0.6.3          httpuv_1.6.11       
[88] gtable_0.3.4         ggplot2_3.4.4        mime_0.12           
[91] xtable_1.8-4         coda_0.19-4          later_1.3.1         
[94] survival_3.5-7       tibble_3.2.1         shinythemes_1.2.0   
[97] TH.data_1.1-2        ellipsis_0.3.2       bridgesampling_1.1-2

Any idea what the problem might be? Thanks