mjskay / tidybayes

Bayesian analysis + tidy data + geoms (R package)
http://mjskay.github.io/tidybayes
GNU General Public License v3.0
710 stars 59 forks source link

Minor inconsistency in tidybayes::add_epred_draws() #292

Closed isabellaghement closed 2 years ago

isabellaghement commented 2 years ago

Hi Matthew,

There seems to be an inconsistency in how tidybayes::add_epred_draws() works for a mixed effects model with nested random grouping factors. I found some data on the Internet which I will use to illustrate the problem: classes nested in schools. In this context, I am primarily interested in inference on the effect of a school-level variable.

As per https://lme4.r-forge.r-project.org/slides/2009-07-01-Lausanne/3SimpleD.pdf:

"The classroom data are a cross-section of students within classes within schools. The mathgain variable is the difference in mathematics achievement scores in grade 1 and kindergarten."

### import classroom data in R 
classroom <- read.csv("http://www-personal.umich.edu/~bwest/classroom.csv")
classroom <- dplyr::arrange(classroom, classid, schoolid)

### examine structure of classroom dataset 
str(classroom)

### convert classid and schoolid to factors (as they will be treated as random grouping factors) 
classroom$classid <- factor(classroom$classid)
classroom$schoolid <- factor(classroom$schoolid)

### need a binary school-level variable; focus will be on conducting inference on the effect of this variable
range(classroom$housepov)
classroom$housepov <- ifelse(classroom$housepov > 0.250, 1, 0)
table(classroom$housepov)

### fit brms model 
model <- brm(mathgain ~ 1 + housepov + (1|schoolid/classid),
            data = classroom)

### compare expected values of mathgain response 
### for "typical" school where housepov = 1 versus 
###     "typical" school where housepov = 0
### ignoring the within-school variability between classes 

epred_typical_school <-  tidybayes::add_epred_draws( 
  model, 
  newdata = data.frame(housepov = c(0, 1), 
                       schoolid = NA, 
                       classid = NA), 
  re_formula = ~ (1|schoolid))   

epred_typical_school 

### compare expected values of mathgain response  
### for "typical" class within typical school with housepov = 1 versus 
###     "typical" class within typical school with housepov = 0 
### accounting for the within-school variability between classes 

epred_typical_class_typical_school <-  tidybayes::add_epred_draws( 
  model, 
  newdata = data.frame(housepov = c(0, 1), 
                       schoolid = NA, 
                       classid = NA),
  re_formula = ~ (1|schoolid/classid))   

The inconsistency that I get comes from the re_formula being specified as re_formula = ~ (1|schoolid/classid) when both schoolid and classid are specified as NA in newdata. This throws R off and yields the following error message:

_Error: Levels 'NA_NA' of grouping factor 'schoolid:classid' cannot be found in the fitted model. Consider setting argument 'allow_newlevels' to TRUE.

It seems to me that, when schoolid and classid are both set to NA in newdata:

R should NOT complain when encountering re_formula = ~ (1|schoolid/classid) if it does NOT complain when encountering re_formula = ~ (1|schoolid)

OR

R should complain when encountering re_formula = ~ (1|schoolid/classid) but should also complain when encountering re_formula = ~ (1|schoolid)

Can you please look into this inconsistency for me and see if it's something that can be (hopefully easily) fixed to avoid confusing the users of tidybayes?

I haven't checked if this consistency is also an issue for brms::posterior_epred().

Thanks very much!

Isabella

mjskay commented 2 years ago

Thanks Isabella --- this looks like an issue in brms::posterior_epred(). If you change your call to brms::posterior_epred() the same error occurs:

epred_typical_class_typical_school <-  brms::posterior_epred( 
  model, 
  newdata = data.frame(housepov = c(0, 1), 
                       schoolid = NA, 
                       classid = NA),
  re_formula = ~ (1|schoolid/classid))  
## Error: Levels 'NA_NA' of grouping factor 'schoolid:classid' cannot be found in the fitted model. 
## Consider setting argument 'allow_new_levels' to TRUE.

If you file an issue on the brms repository they should be better able to help. Currently add_epred_draws() doesn't modify anything related to re_formula or random effects; it just passes that stuff down to posterior_epred() and relies on its behavior.

isabellaghement commented 2 years ago

Thanks very much for your prompt response, Matthew! It's good to know what is going on - I will file this as an issue on the brms repository as you advised. If you leave this issue open for now, I will report back here to confirm if Paul was able to fix this inconsistency.

Isabella

P.S. The brms issue is now filed here:

https://discourse.mc-stan.org/t/posterior-epred-is-inconsistent-in-how-it-handles-re-formula-when-random-grouping-factors-are-set-to-na-in-newdata/25421

mjskay commented 2 years ago

Looks like this is fixed now, closing.