mjskay / tidybayes

Bayesian analysis + tidy data + geoms (R package)
http://mjskay.github.io/tidybayes
GNU General Public License v3.0
712 stars 59 forks source link

Condition means with interacting random effects #255

Closed jamesrrae closed 4 years ago

jamesrrae commented 4 years ago

I'm unsure if tidybayes is working correctly when pulling draws for interacting random effects. Below is an example.

Here's the set up:


# List required packages
list.of.packages <- c("brms", "tidyverse", "randomizr", "tidybayes")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]

# Install packages if missing
if(length(new.packages)) install.packages(new.packages)

# Generate data
dat <- tibble(
  y = rnorm(n = 100, mean = 0, sd = 1),
  treatment = randomizr::simple_ra(N = 100, 
                           prob_each = c(.22, .22, .22, .22, .12), 
                           conditions = c("treat1", "treat2","treat3","treat4","control")),
  gender = randomizr::simple_ra(N = 100, 
                                prob_each = c(.48, .48, .04), 
                                conditions = c("male", "female","other")))

So say I fit a model with two interacting random effects using brms:

mod1 <- brm(y ~  0 + (1 | treatment) + (1 | gender) + (1 | treatment:gender), data = dat)

Now I want get get the means for treatment across each gender using tidybayes:

mod1 %>%
  tidybayes::spread_draws(r_treatment[condition, ], r_gender[gender, ], `r_treatment:gender`[condition_gender, ]) 

Here's what the output looks like:

Screen Shot 2020-05-28 at 10 29 54 AM

What I'm finding surprising is the r_treatment:gender column. The first row looks right because it's a draw for when condition == control and gender == female. However, the second row doesn't look right to me. While the main effects are the same (condition == control and gender == female), the interaction draw is for when gender == male.

I could definitely be missing something, but my first impression was that this might not be right so I wanted to flag as a potential issue.

mjskay commented 4 years ago

Ah yeah --- the thing about spread_draws is it doesn't know anything about the model structure or the relationship between variables. It's kind of "dumb" and only knows what you tell it explicitly --- it is just doing a join across indices that have the same name. So, because condition_gender is a combined column, it does not have the same name as either the condition or gender column, so spread_draws just gives you all combinations of the levels of those indices. Thus, in this case, you might get things that don't make sense --- the expectation is that you would either filter to relevant rows afterwards or change the spec in spread_draws to give it more information about what columns are related, so it does the join you need.

For an example of the latter, I'd start with the list of variable names to get a sense of the existing structure:

get_variables(mod1)
 [1] "sd_gender__Intercept"                         "sd_treatment__Intercept"                     
 [3] "sd_treatment:gender__Intercept"               "sigma"                                       
 [5] "r_gender[male,Intercept]"                     "r_gender[female,Intercept]"                  
 [7] "r_gender[other,Intercept]"                    "r_treatment[treat1,Intercept]"               
 [9] "r_treatment[treat2,Intercept]"                "r_treatment[treat3,Intercept]"               
[11] "r_treatment[treat4,Intercept]"                "r_treatment[control,Intercept]"              
[13] "r_treatment:gender[control_female,Intercept]" "r_treatment:gender[control_male,Intercept]"  
[15] "r_treatment:gender[treat1_female,Intercept]"  "r_treatment:gender[treat1_male,Intercept]"   
[17] "r_treatment:gender[treat2_female,Intercept]"  "r_treatment:gender[treat2_male,Intercept]"   
[19] "r_treatment:gender[treat2_other,Intercept]"   "r_treatment:gender[treat3_female,Intercept]" 
[21] "r_treatment:gender[treat3_male,Intercept]"    "r_treatment:gender[treat4_female,Intercept]" 
[23] "r_treatment:gender[treat4_male,Intercept]"    "r_treatment:gender[treat4_other,Intercept]"  
[25] "lp__"                                         "accept_stat__"                               
[27] "stepsize__"                                   "treedepth__"                                 
[29] "n_leapfrog__"                                 "divergent__"                                 
[31] "energy__"      

So the terms for the interaction look like r_treatment:gender[treat4_other,Intercept]. spread_draws is just doing string parsing under the hood, so one approach is to change the separator between indices (which by default is commas and spaces) to be commas and underscores: that would allow it to pull out the condition and gender separately from a string like "treat4_other". Then it will do the join correctly because those columns will have the same names across r_treatment, r_gender, and r_treatment:gender:

mod1 %>%
  tidybayes::spread_draws(r_treatment[condition, ], r_gender[gender, ], `r_treatment:gender`[condition, gender, ], sep = "[,_]")
# A tibble: 56,000 x 8
# Groups:   condition, gender [14]
   condition r_treatment .chain .iteration .draw gender r_gender `r_treatment:gender`
   <chr>           <dbl>  <int>      <int> <int> <chr>     <dbl>                <dbl>
 1 control        0.0105      1          1     1 female   0.0882              0.104  
 2 control        0.0105      1          1     1 male    -0.130              -0.137  
 3 control       -0.0928      1          2     2 female   0.286               0.00337
 4 control       -0.0928      1          2     2 male    -0.399               0.0351 
 5 control        0.0394      1          3     3 female  -0.0291              0.517  
 6 control        0.0394      1          3     3 male    -0.124              -0.259  
 7 control       -0.0174      1          4     4 female   0.194              -0.146  
 8 control       -0.0174      1          4     4 male    -0.342              -0.162  
 9 control       -0.0495      1          5     5 female   0.0343             -0.283  
10 control       -0.0495      1          5     5 male    -0.223              -0.0212 
# ... with 55,990 more rows

On the other hand if all you care about is the means conditional on gender and treatment, I would skip munging the coefficients manually entirely, as it tends to error-prone. Instead I would use add_fitted_draws, which uses brms::fitted under the hood to calculate conditional means for you:

dat %>%
  # need nesting() here because not all combinations were in the original data.
  # the other option is to leave out nesting() and put allow_new_levels = TRUE
  # in add_fitted_draws()
  expand(nesting(treatment, gender)) %>%
  add_fitted_draws(mod1)
# A tibble: 56,000 x 7
# Groups:   treatment, gender, .row [14]
   treatment gender  .row .chain .iteration .draw  .value
   <fct>     <fct>  <int>  <int>      <int> <int>   <dbl>
 1 treat1    male       1     NA         NA     1 -0.119 
 2 treat1    male       1     NA         NA     2 -0.307 
 3 treat1    male       1     NA         NA     3 -0.0206
 4 treat1    male       1     NA         NA     4 -0.198 
 5 treat1    male       1     NA         NA     5 -0.187 
 6 treat1    male       1     NA         NA     6 -0.270 
 7 treat1    male       1     NA         NA     7  0.119 
 8 treat1    male       1     NA         NA     8 -0.122 
 9 treat1    male       1     NA         NA     9 -0.533 
10 treat1    male       1     NA         NA    10 -0.296 
# ... with 55,990 more rows
jamesrrae commented 4 years ago

Ah, this makes sense. I also see that add_fitted_draws makes the task a whole lot easier and less error prone.

Just so I know how to scale this solution, if I want the medians conditional on gender and treatment from a model that includes other covariates (like age - see below), I'd just add_fitted_draws on all predictors, and then just group_by the ones I can about before calculating the medians and qi?

# Generate data that now has age included
dat <- tibble(
  y = rnorm(n = 100, mean = 0, sd = 1),
  treatment = randomizr::simple_ra(N = 100, 
                           prob_each = c(.22, .22, .22, .22, .12), 
                           conditions = c("treat1", "treat2","treat3","treat4","control")),
  gender = randomizr::simple_ra(N = 100, 
                                prob_each = c(.48, .48, .04), 
                                conditions = c("male", "female","other")),
  age = randomizr::simple_ra(N = 100, 
                                prob_each = c(.25, .25, .25, .25),
                                conditions = c("very young", "young","old", "very old")))

# Fit model with factors I care about and ones I don't
mod1 <- brm(y ~  0 + (1 | treatment) + (1 | gender) + (1 | treatment:gender) + (1 | age), data = dat)

# Add fitted draws
dat %>%
  expand(nesting(treatment, gender, age)) %>%. # have to include all predictors or get error
  add_fitted_draws(mod1) %>%
  group_by(treatment, gender) %>% # group by predictors I care about
  median_qi(.value)
# A tibble: 10 x 8
# Groups:   treatment [5]
   treatment gender .value .lower .upper .width .point .interval
   <fct>     <fct>   <dbl>  <dbl>  <dbl>  <dbl> <chr>  <chr>    
 1 treat1    male   0.170  -0.470  0.767   0.95 median qi       
 2 treat1    female 0.182  -0.357  0.758   0.95 median qi       
 3 treat2    male   0.0267 -0.544  0.573   0.95 median qi       
 4 treat2    female 0.0928 -0.513  0.675   0.95 median qi       
 5 treat3    male   0.258  -0.282  0.806   0.95 median qi       
 6 treat3    female 0.444  -0.142  1.20    0.95 median qi       
 7 treat4    male   0.375  -0.189  1.05    0.95 median qi       
 8 treat4    female 0.278  -0.365  0.902   0.95 median qi       
 9 control   male   0.142  -0.492  0.748   0.95 median qi       
10 control   female 0.233  -0.383  0.919   0.95 median qi 
mjskay commented 4 years ago

At that point you're starting to talk about marginalization I think --- I wrote something about it here: https://htmlpreview.github.io/?https://github.com/mjskay/uncertainty-examples/blob/master/marginal-effects_categorical-predictor.html

Though if you do it using random effects you could also potentially do the marginalization from within brms, depending on what you want. Andrew MacDonald used to have a nice post on the relevant information for doing that but his website seems to be down :/ https://thestudyofthehousehold.me/2018/02/13/2018-02-13-easily-made-fitted-and-predicted-values-made-easy/