rmcelreath / rethinking

Statistical Rethinking course and book package
2.1k stars 596 forks source link

difference in n_eff of population mean depending on model specification #405

Open cjungerius opened 1 year ago

cjungerius commented 1 year ago

EDIT: after playing around with this and reading up a bit more I understand now that non-centering (which I suppose m2 is a clumsy attempt of, but the same thing happens with transpars> vector[I]:a <<- abar + z*asigma ) isn't always good for your sampling, which is the mistaken idea I was laboring under. Feel free to disregard the question below if it seems silly or trivial.


I'll admit upfront that I ask this question mostly out of curiosity. I'm not entirely sure it's appropriate to post this here, since this doesn't concern a code issue, but I don't really know what a more appropriate place would be so I figured I'd drop a question here:

I'm using ulam alongside some other stan frontends (e.g. brms) to model data obtained from multiple participants in a hierarchical way, with observations nested in participants. What I noticed is that, depending on the model specification, the effective n for the population mean can vary widely.

For a reproducable example I coded up a rudimentary simulation of participants drawn from a population to illustrate my point:

library(rethinking)
library(tidyverse)

set.seed(42)

abar <- 10      # pop mean
asigma <- 5     # pop variance
n <- 100         # n subjects
n_trials <- 100 # n trials (of course)
sigma <- 2     # within-subject/residual variance

as <- rnorm(n,abar,asigma) # draw some participants from pop

# now generate some data
d <- tibble(
I = rep(1:n,each=n_trials),
A = rep(as,each=n_trials) 
) %>% 
  rowwise %>% 
  mutate(Y = rnorm(1,A,sigma)) %>% 
  ungroup

I can model this data in (at least) two equivalent ways:


# abar is explicitly the mean of the a[I] distribution
m1 <- ulam(
  alist(
    Y ~ normal(mu, sigma),
    mu <- a[I],

    a[I] ~ normal(abar,asigma),

    abar ~ normal(0,10),
    asigma ~ exponential(1),

    sigma ~ exponential(1)
  ), data=d,chains=4,iter=2000,cores=4
)

# abar is pulled out of the a[I] distribution, which is now centered on 0
m2 <- ulam(
  alist(
    Y ~ normal(mu, sigma),
    mu <- abar + a[I],

    a[I] ~ normal(0,asigma),

    abar ~ normal(0,10),
    asigma ~ exponential(1),

    sigma ~ exponential(1)
  ), data=d,chains=4,iter=2000,cores=4
)

These two models have the same parameters and reach the same estimates of all coefficients, including abar. However, in my testing the n_eff for abar in model 1 is ~ 8000, while it only reaches 15(!) in model 2. While they reach the same conclusions numerically, I worry about the stability this has for posterior predictive checks etc.

I'm sure there's something obvious I'm missing, but I'm curious if there are known reasons for this discrepancy. Of course generally I would be happy to just always specify my model as seen in m1, but for more tricky models with multiple varying effects, or non-centered parametrizations, as I understand it, I have no choice but to 0-center the effects, which always leads to low sampling efficiency for these group means in my experience. Any advice would be appreciated!

rmcelreath commented 1 year ago

Hi Chris. This is a major topic in my book and course, the centered vs non-centered parameterizations of multilevel priors. In the book, see Chapter 13. In most recent lectures, lecture 13. Here's a linked set to the beginning of the relevant section:

https://www.youtube.com/watch?v=sgqMkZeslxA&list=PLDcUM9US4XdPz-KxHM4XHt7uUVGWWVSus&index=13&t=2432s

Lecture 14 has more content on the topic, including some technical aspects with multivariate priors.

cjungerius commented 1 year ago

Thank you! These are exactly the bits of your lecture series I watched last night which led to my edit - which has taught me a valuable lesson about not taking shortcuts 😅. Because of various factors, including brms using non-centered parameterization as the default, I assumed it was strictly better than the alternative. I understand now that my type of data (many trials, approximately equal trial numbers for each condition for each participant) might be an example of a case where it is not.

Out of curiosity, is there any intuition for why the non-centered parameterization does worse in such data dominant cases? I mean to say: I understand the opposite, where unnesting allows for more efficient exploration of e.g. Neal's funnel-type posteriors, but how can the unnesting of parameters make the posterior harder for the MCMC chains to explore? Is it because the chains spend a lot of time in low-probability areas when the exploration of e.g. a_bar is made independent from the lower level parameters?

I clearly have some more reading to do, so I shall return to the relevant chapters of your book. Thanks again!

rmcelreath commented 1 year ago

Michael Betancourt has a long post about this: https://betanalpha.github.io/assets/case_studies/hierarchical_modeling.html

I didn't review it this morning, but if I remember right, you get an inverted funnel problem when the likelihood is strong and use non-centered.