rmcelreath / rethinking

Statistical Rethinking course and book package
2.16k stars 605 forks source link

NA handling by glimmer #94

Open pboesu opened 7 years ago

pboesu commented 7 years ago

I'm not entirely sure whether this is a bug or a feature, given the scope of glimmer, but I noticed that glimmer fails cryptically when data contains any NA values.

library(rethinking)
data(UCBadmit)

#introduce missingness
UCBadmit$applications[3:5] <- NA 

# varying intercepts
f3 <- cbind(admit,reject) ~ (1|dept) + applicant.gender + applications
m3 <- glimmer( f3 , UCBadmit , binomial )
# fails with 
# Error in `[[<-.data.frame`(`*tmp*`, pf$yname, value = c(512L, 89L, 353L,  : 
# replacement has 12 rows, data has 11

I've debugged this, and traced it to the default NA handling in the calls to model.matrix and model.frame in xparse_glimmer_formula (https://github.com/rmcelreath/rethinking/blob/master/R/glimmer.R#L60 and https://github.com/rmcelreath/rethinking/blob/master/R/glimmer.R#L64)

model.frame (which is implicitly called by passing data to model.matrix) by default resorts to na.omit, thereby truncating the data.frame, which in turn triggers the error message quoted in the above example when glimmer is building pf$dat (https://github.com/rmcelreath/rethinking/blob/master/R/glimmer.R#L243)

The quick solution to this is to change the global options:

options(na.action = na.pass)
m3 <- glimmer( f3 , UCBadmit , binomial )
options(na.action = na.omit)

which will create the map2stan code and then only trigger an error downstream when executing map2stan

m3 <- map2stan(m3$f, data=m3$d )
#Error in map2stan(m3$f, data = m3$d) : 
# Variable 'applications' has missing values (NA) in:
#Intercept + b_applicant_gendermale * applicant_gendermale[i] + b_applications *      applications + #v_Intercept[dept]
#Either remove cases with NA or declare a distribution to use for imputation.

unless the model code returned from glimmer is amended to handle the imputation, e.g. (excuse the non-sensical imputation prior)

m3impute <- map2stan(
  alist(
    admit ~ dbinom( admit_size , p ),
    logit(p) <- Intercept +
      b_applicant_gendermale*applicant_gendermale +
      b_applications*applications +
      v_Intercept[dept],
    applications ~ dnorm(mu_N,sigma_N),
    mu_N ~ dnorm(0,10),
    sigma_N ~ dcauchy(0,1),
    Intercept ~ dnorm(0,10),
    b_applicant_gendermale ~ dnorm(0,10),
    b_applications ~ dnorm(0,10),
    v_Intercept[dept] ~ dnorm(0,sigma_dept),
    sigma_dept ~ dcauchy(0,2)
  ), data = m3$d)

I appreciate that glimmer is not intended to offer thought-free model conversion, but it would be helpful to either (a) mention the default na.action behaviour in the documentation, (b) have glimmer fail with an informative message (like map2stan does with NAs), or (c) allow to pass through an na.action argument to model.frame.

I'm happy to draft a pull request for either option if you'd let me know which one you'd prefer.

Thanks for the great book & package!

Click to expand sessionInfo() R version 3.3.2 (2016-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.3 LTS locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] rethinking_1.59 rstan_2.15.1 StanHeaders_2.15.0-1 ggplot2_2.2.1 loaded via a namespace (and not attached): [1] Rcpp_0.12.10 codetools_0.2-14 matrixStats_0.52.2 mvtnorm_1.0-6 lattice_0.20-33 assertthat_0.1 [7] MASS_7.3-44 grid_3.3.2 plyr_1.8.4 gtable_0.2.0 stats4_3.3.2 coda_0.19-1 [13] scales_0.4.1 lazyeval_0.2.0 tools_3.3.2 loo_1.1.0 munsell_0.4.3 inline_0.3.14 [19] colorspace_1.3-2 gridExtra_2.2.1 tibble_1.2
rmcelreath commented 7 years ago

Thanks. Seems like the easiest thing is to check for any NAs and then stop with a message.

Allowing glimmer to handle imputation is actually a great idea. So I'll mark this issue as "enhancement" and "bug".