Updates to rstanarm tidiers

benmarwick commented 8 years ago

@jgabry I've been using rstanarm::stan_lm a bit using tidyverse methods, and it seems like we could make broom more useful for rstanarm users.

I wonder if it would be possible to update along these line:

the tidy method to include 5% and 95% estimate values (since we have no p-value in the stan_lm output, these estimate values are useful to see if the distribution includes zero or not, and a rough equivalent to decide if the predictor is significant or not)
the glance method to include the R2 value for a model. To help those weaning themselves from lm() ;)
the augment method to get the pointwise output from loo::loo so we can have expected log pointwise predictive density (and standard error), estimated effective number of parameters and standard error, the LOO information criterion (and standard error) for each data point. So we can have a rough analogue for Cook's Distance to measure the influence of individual data points.

What do you think?

I've posted a question to Stackoverflow about getting influence measures for individual points using rstanarm. Since posting that Q and looking into the literature a bit, I reckon that an update to the augment method in broom would be an ideal solution to this problem, and maybe improve the accessibility of rstanarm to others using the tidyverse.

jgabry commented 8 years ago

Hey Ben, yeah there's definitely room for improvement. With the initial implementation I just wanted to get it working and provide some minimal useful functionality, but there's much more we can do. It's really helpful to get feedback like this because otherwise we don't know how people are using it and what would be helpful to add. So thanks!

A few comments/questions in response to your particular points:

the tidy method to include 5% and 95% estimate values (since we have no p-value in the stan_lm output, these estimate values are useful to see if the distribution includes zero or not, and a rough equivalent to decide if the predictor is significant or not)

Setting intervals=TRUE and prob=p when you call tidy should give you the lower and upper bounds of the p% posterior interval. So for a 90% interval (lower=5%, upper=95%) you would do

tidy(fit, intervals = TRUE, prob = 0.9)

Is that what you're looking for or are you referring to something different?

the glance method to include the R2 value for a model. To help those weaning themselves from lm() ;)

In the stan_lm model we get a posterior distribution for R^2 and I don't think it would be good practice to only report a point estimate in glance. This might be different from how glance for lm works, but glance for lm spews out point estimates of all sorts of junk that are just begging to be over-interpreted. That said, the tidy method currently does not give you a summary of the posterior for R^2 the way it does for other parameters, so I do think we should allow getting the estimate and p% posterior interval for R^2 from tidy, but not glance.

Currently tidy.stanreg accepts a parameters argument that can be one of "non-varying", "varying", or "hierarchical". How about we add an option like "auxiliary" or "aux" that returns summaries of parameters like R^2 for stan_lm, the shape parameter for stan_glm when family = "gamma", etc?

the augment method to get the pointwise output from loo::loo so we can have expected log pointwise predictive density (and standard error), estimated effective number of parameters and standard error, the LOO information criterion (and standard error) for each data point.

This is a good idea and shouldn't be hard to implement.

Thanks again for the feedback.

benmarwick commented 8 years ago

Hi Jonah,

Thanks for your detailed reply, I must have glossed over the option to use intervals = TRUE, prob = 0.9 when reading the docs, sorry about that! And thanks for pointing them out so clearly.

Your suggestion for R2, sounds great, much better that what I proposed. The idea of an aux argument that gives a summary of the posterior for R2 would be wonderful.

Glad that my loo idea makes sense, thanks for being open to that! Which of those pointwise variables is the most useful for spotting data points with outside influence on the model?

Looking at this example, it seems that both elpd_loo and looic are quite informative, I'm not really sure what's the preferred one.

# get data:
require(foreign)
require(MASS)
cdata <- read.dta("http://www.ats.ucla.edu/stat/data/crime.dta")
summary(cdata)

# generate model
summary(ols <- lm(crime ~ poverty + single, data = cdata))

# explore diagnostics
library(ggfortify)
autoplot(ols, 
         which = 1:6, 
         ncol = 2, 
         label.size = 3) 
# high influence points are: 9, 25, 51

ols

# simple Bayesian model
require(rstanarm)
summary(bols <- stan_lm(crime ~ poverty + single, 
                        data = cdata,
                        prior = NULL, 
                        chains = 1, 
                        cores = 2, 
                        seed = 1))

# explore diagnostics for Bayesian model
require(loo)
loo_out <- loo(bols, k_threshold = 0.7)
loo_out_df <- data.frame(loo_out$pointwise)

library(ggplot2)
n <-  3 # how many points to label
ggplot(loo_out_df, 
       aes(elpd_loo,
           looic)) +
  geom_point(aes(size = p_loo)) +
  geom_text(data = loo_out_df[order(loo_out_df$elpd_loo)[1:n],] ,
            aes(label = row.names(loo_out_df[order(loo_out_df$elpd_loo)[1:n],])),
                nudge_x = 0.35) +
  xlab("Expected log pointwise predictive density") +
  ylab("LOO information criterion") +
  scale_size_area(name = "Estimated \neffective\nnumber \nof \nparameters")

loo

jgabry commented 8 years ago

On Monday, September 19, 2016, Ben Marwick notifications@github.com wrote:

Hi Jonah,

Thanks for your detailed reply, I must have glossed over the option to use intervals = TRUE, prob = 0.9 when reading the docs, sorry about that! And thanks for pointing them out so clearly.

Cool, no problem. Glad that works.

Your suggestion for R2, sounds great, much better that what I proposed. The idea of an aux argument that gives a summary of the posterior for R2 would be wonderful.

Ok, that's now officially on the to-do list. Hopefully I can make a pull request for that soon. In the meantime you can of course get that info in various way using the functions in the rstanarm package (summary, posterior_interval, etc.).

Glad that my loo idea makes sense, thanks for being open to that! Which of those pointwise variables is the most useful for spotting data points with outside influence on the model?

Looking at this example, it seems that both elpd_loo and looic are quite informative, I'm not really sure what's the preferred one.

Those contain exactly the same information. They just differ by a constant factor of -2, that is, looic is elpd_loo converted to the deviance scale. Multiplying by -2 adds no extra information but people are used to it for historical reasons so we report both in the loo package.

In terms of identifying influential data points, one thing you can do is look at the estimates of the Pareto shape parameter k that are stored in the object returned by loo (and can also be plotted by calling plot on that object). We discuss the interpretation of k in theory and in practice in our forthcoming paper (preprint available here https://arxiv.org/abs/1507.04544).

Roughly speaking, if k_i > 0.7 or so (in practice, 0.5 in theory) then the full posterior distribution does not well approximate the leave one out posterior for the ith data point, so the ith data point can in that sense be considered influential.

Hope that helps,

Jonah

benmarwick commented 8 years ago

Thank you very much! Your explanations are very helpful, I'm a beginner with these methods. and you've really helped me understand a lot here. I will look into the Pareto shape parameter k. Thanks again!

jgabry commented 8 years ago

No problem. These are good questions! And feel free to follow up, I guess here if it relates to rstanarm functionality in broom or over at the Stan users Google group (or rstanarm or loo GitHub repos) if it relates to rstanarm itself or loo or general modeling questions.

Jonah

On Monday, September 19, 2016, Ben Marwick notifications@github.com wrote:

Thank you very much! Your explanations are very helpful, I'm a beginner with these methods. and you've really helped me understand a lot here. I will look into the Pareto shape parameter k. Thanks again!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dgrtwo/broom/issues/162#issuecomment-248151553, or mute the thread https://github.com/notifications/unsubscribe-auth/AHb4QybH-4jdnXaP3DGkrYWKhXde2PK6ks5qrxFmgaJpZM4KAyjm .

Emaasit commented 7 years ago

I am just seeing this. This is exactly what I was looking for. Thanks @benmarwick for pointing me here.

jgabry commented 7 years ago

Hi Daniel, are you referring to the proposed additions to the current broom package functionality for rstanarm models or to the info about elpd, looic, and k, from the loo package?

It's helpful to know precisely what features users are looking for and also which topics we need to document better in our packages (and papers).

Thanks!

Jonah

On Sunday, September 25, 2016, Daniel Emaasit (PhD Student) < notifications@github.com> wrote:

I am just seeing this. This is exactly what I was looking for. Thanks @benmarwick https://github.com/benmarwick for pointing me here.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dgrtwo/broom/issues/162#issuecomment-249455925, or mute the thread https://github.com/notifications/unsubscribe-auth/AHb4Q3L8IinWEh73-Ua96Iqq_nzFm_PSks5qtw5ygaJpZM4KAyjm .

jgabry commented 7 years ago

@benmarwick Sorry for the delay, but finally got a PR submitted (#177) that let's you access parameters like R2.

Emaasit commented 7 years ago

@jgabry Could this additonal functionality in PR https://github.com/tidyverse/broom/pull/177 be implemented for Stanfit objects too? For example, I would like to access these parameters in the figure below in a tidy dataframe.

stanfit-broom

jgabry commented 7 years ago

Hmm, I don' think so because rstanarm knows exactly what all these parameters are called depending on the user's stanreg object whereas rstan doesn't know which parameter names correspond to these special parameters (if they exist) in a stanfit object.

For example, when writing your own Stan program you could call the residual sd in a linear regression "tomato" if you really wanted to and rstan won't care. But rstan also won't know that the parameter "tomato" represents the residual sd. It doesn't even "know" that the model is a linear regression model. It doesn't need to know. But in rstanarm the residual sd in a linear regression is always just "sigma", so it can easily be extracted from any linear regression model.

That said, for regular stanfit objects these special parameters aren't really special, they're just regular parameters. So you should already be able to select them by name using the "pars" argument to the tidy method for stanfit objects. In the example above I suppose you would do tidy(stanfit, pars = "tomato").

On Mon, Dec 12, 2016 at 4:07 PM Daniel Emaasit (PhD Student) < notifications@github.com> wrote:

@jgabry https://github.com/jgabry Could this additonal functionality in PR #177 https://github.com/tidyverse/broom/pull/177 be implemented for Stanfit objects too?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tidyverse/broom/issues/162#issuecomment-266553091, or mute the thread https://github.com/notifications/unsubscribe-auth/AHb4Qxv-wVHeILs4quoVFC3GjCutvaRQks5rHbergaJpZM4KAyjm .

jgabry commented 7 years ago

@Emaasit, actually I hadn't seen the image you included so I may have misunderstood your question. I think you should be able to get a lot of that output already using the tidy method for stanfit objects. I didn't do the initial implementation of tidy.stanfit (otherwise I never would have named an argument conf.int!) but it looks like the arguments it takes should be enough to mostly recreate the print output from rstan. If not, then can you be more specific about what's missing?

Emaasit commented 7 years ago

@jgabry Initially when I did tidy(my_stanfit_object), I only got three variables in the resulting dataframe. But I just found out that has more arguments to it. ?broom::tidy.stanfit.

## S3 method for class 'stanfit'
tidy(x, pars, estimate.method = "mean", conf.int = FALSE,
  conf.level = 0.95, conf.method = "quantile", droppars = "lp__",
  rhat = FALSE, ess = FALSE, ...)

bbolker commented 6 years ago

can this be closed now? Ideally, someone would check the development version of broom.mixed and see if all of the relevant material made it over.

FWIW/in my defense, I think I implemented tidy.stanfit, and I didn't invent the conf.int argument - I carried it over from elsewhere in broom (I don't like it either!)

alexpghayes commented 6 years ago

Yes. We're currently doing a bit of maintenance catchup first but the plan for the summer is to sort out the big picture of where tidiers should live.

github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

tidymodels / broom

Updates to rstanarm tidiers #162