stan-dev / rstanarm

rstanarm R package for Bayesian applied regression modeling
https://mc-stan.org/rstanarm
GNU General Public License v3.0
388 stars 133 forks source link

Passing empty tibble to posterior_predict() when using intercept-only models #491

Closed davidkane9 closed 3 years ago

davidkane9 commented 3 years ago

Summary:

posterior_predict() (and related functions like posterior_epred()) should accept an empty tibble as the newdata argument when working with a model whose only right-hand side variable is the intercept.

Reproducible Steps:

suppressPackageStartupMessages(library(rstanarm))
suppressPackageStartupMessages(library(tidyverse))

fit_1 <- stan_glm(formula = mpg ~ 1,
                  data = mtcars,
                  refresh = 0)

# This works even though we don't "need" newobs to contain a value for cyl.

newobs <- tibble(cyl = 5)

pp <- posterior_predict(fit_1, newdata = newobs)

# This fails even though it should produce the same result as the first example.

newobs <- tibble()

pp <- posterior_predict(fit_1, newdata = newobs)
#> Error in .subset(x, j): invalid subscript type 'list'

Created on 2020-12-09 by the reprex package (v0.3.0)

RStanARM Version: 2.21.1

R Version: 4.0.3

Operating System: MacOS Catalina: 10.15.7

jgabry commented 3 years ago

Thanks for reporting this. Yeah, even though in this case the newdata argument isn't necessary at all I agree that there shouldn't be an error with an empty data frame (tibble or regular data frame).

davidkane9 commented 3 years ago

Thanks for taking a look at this.

the newdata argument isn't necessary

Well, I certainly find it highly desirable for teaching. An intercept-only model is one of the first things I show students with stan_glm() and it would be very nice to have posterior_predict() return a matrix with one column when given an empty tibble, as opposed to the default of one column for every data point, which is an object which new students find very confusing.

jgabry commented 3 years ago

Oh I see what you mean. Forget what I said about it being unnecessary. When there are additional RHS variables the number of observations to predict is specified by the number of rows in newdata, so to be consistent with that I would think we'd allow tibbles with no columns but there would still have to be a positive number of rows. So tibble(.rows = 1), although uglier than tibble(), would be more consistent with the behavior of newdata in general because tibble() gives you 0 rows.

jgabry commented 3 years ago

And the more I think about this the more surprised I am that this hasn’t come up before because intercept only models are definitely a natural place to start teaching. Glad you noticed this!

davidkane9 commented 3 years ago

So tibble(.rows = 1)

That makes sense.

hasn’t come up before

Few teachers are as crazy as I am, teaching stan_glm() in a beginners' statistics class . . . .

jgabry commented 3 years ago

This turned out to be really easy to fix (I just opened PR #492). It was already set up to work properly but one of our checks validating the newdata argument was too strict, resulting in an error with 0 columns when that was unnecessary. If I remove that check when there are 0 columns then this works:

fit <- stan_glm(mpg ~ 1, data = mtcars)
posterior_predict(fit, newdata = tibble::tibble(.rows = 1)) 
posterior_epred(fit, newdata = tibble::tibble(.rows = 1))