various extensions, esp for mixed models

bbolker commented 9 years ago

broom looks really great overall. I have been thinking for a long time about updating my coefplot2 package (on R-forge, but somewhat defunct there), and it looks like broom could serve as the back-end, but I have a bunch of questions and ideas about extensions. Sorry that the following is so long, but I hope I'm starting a useful discussion. Please feel free to ignore these ideas if they're too awkward or can't be implemented without breaking compatibility too badly, but I do think they could be useful things to think about.

Model types

arm::coefplot knows about bugs, glm, lm, and polr objects (showMethods("coefplot"))

coefplot2 uses a coeftab() method to extract coefficient tables: in addition to the classes that coefplot handles, coefplot2 knows about glmmadmb, glmmML, mcmc, MCMCglmm, rjags, merMod, and mer (old lme4). Right now it extracts lower/upper confidence intervals and standard deviations if they're available; it only gets the Wald intervals for merMod objects, leaving the interval/uncertainty estimates out for the random effects parameters. It uses

ctype: (character) one of "quad" (quadratic/Wald), "profile" (likelihood profile), "quantile" (quantiles of posterior density), "HPDinterval" (highest posterior density)

ptype: parameter type: one or more of "fixef" (default: fixed effect parameters), "ranef" (posterior modes of random effects), "vcov" (parameters of random effects variance-covariance matrices)

If some of the interface issues can be sorted out I'd be happy to contribute methods for all of these.

Interface issues

In general there are a lot of different kinds of coefficients one might want to retrieve from a model. I'm going to focus on two issues, parameter types and confidence intervals.

Parameter types

At the moment, the tidy method for merMod objects offers the choice of "fixed" or "random". The "fixed" case makes sense, but the other possibilities get a little bit interesting. My own most common use cases for random-effects parameters is to want to retrieve the standard deviations and correlations of the random effects, not the estimated values of the coefficients for each level of the grouping variable(s). Thus I would implement something like

library("lme4")
fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
aa <- as.data.frame(VarCorr(fm1))
aa2 <- aa[1:(nrow(aa)-1),]  ## skip Residual variance
## lots of choices here; lme4:::tnames could be exported if necessary
termnames <- lme4:::tnames(fm1,old=FALSE,prefix=c("sd","cor"))
data.frame(terms,estimate=aa2[,"sdcor"],std.error=NA,statistic=NA)
##
##                    terms    estimate std.error statistic
##      Subject.(Intercept) 24.74044761        NA        NA
##             Subject.Days  5.92213324        NA        NA
## Subject.(Intercept).Days  0.06555134        NA        NA

There are more decisions/possible user inputs here:

is residual variance estimate included (if it exists -- it doesn't for most GLMMs)?
there are three possible parameterizations: variance/covariance, standard deviation/correlation, or Cholesky decomposition scale
it would be nice to allow the user to specify a vector of parameter types to extract (although this will obviously make it harder to check the types -- match.arg() won't work any more
some users might also want to get conditional modes/BLUPs
it would sure be nice if the interface for coef() was more uniform across packages and model types,

Another useful place to look is at the coef() method for hurdle and zeroinfl models from the pscl package: from ?predict.hurdle,

coef(object, model = c("full", "count", "zero"), ...) (where these refer to parameters determining the probability of a structural zero, the expected number of counts in non-structural-zero cases, or both)

I guess the question is how much can be passed through tidy as optional arguments to coef vs. trying to give the tidy methods as unified interface ...

Confidence intervals etc.

While many model types have well-defined and useful standard errors, others don't. In particular the Wald standard errors for GLMs can be very unreliable; profile confidence intervals are more reliable. The same is often true for random effects parameters. It would be nice to have the option to incorporate lower and upper confidence intervals in a tidy data frame, although it's a little hard to know how to get this -- would you pass a confidence-interval data frame or a likelihood profile? (For most models including glm objects it's easy and lightweight to recompute the profile confidence intervals via confint(), but for merMod objects this can be an expensive operation ...)

One thing to think about here is that the default column names that R uses for confidence intervals (2.5 %, 97.5 %) are a nuisance -- I usually use lwr and upr, although these don't specify the level -- maybe lwr_0.025, upr_0.975?

There are sometimes additional decisions: for Bayesian models, should highest posterior density (coda::HPDinterval) or quantiles of the marginal posterior be used?

Wish lists

Some things that I've wanted coefplot2 to be able to do:

automatically rescale parameters, i.e. take a model that was fitted without centering and scaling parameters and adjust the parameters and SEs accordingly. This is fairly straightforward if the means and standard deviations of the original predictor variables are known (so is the inverse transformation). arm::coefplot does this, I think, but only by re-fitting the model, which is an unnecessary expense.
automatically change scale of parameter/confidence intervals; for example, Wald estimates of variance parameters are much better on the log scale, and this conversion is easy [i.e. if the std dev estimate is b and its standard error is s, then the confidence intervals based on a log scale are exp(log(b) +/- 1.96*s/b]

bbolker commented 9 years ago

bump?

dgrtwo commented 9 years ago

Sorry, I've been unavailable! This work and ideas are terrific and as soon as I can in the next two days I'm going to do them justice. I was hoping for exactly this kind of exploration of models from someone.

ashander commented 9 years ago

Great topic. I wanted to take a look at integrating coefplot2 capabilities to extend broom to get (at least) Wald CIs. For full integration there is a license issue -- both coefplot2 and arm are GPL-equivalent as far as I can tell. Importing coefplot2::coeftab doesn't seem wise as coefplot2 not on cran..

Thoughts?

bbolker commented 9 years ago

If broom's interface can be extended appropriately, I have no problem ditching coefplot2 entirely (if one is willing to use ggplot, 99% of the effort of producing the coefficient plot is in tidying the coefficient tables; if desired, it would be pretty easy to write a basic-plot plot method for broom objects). I also don't have a problem copying code over, as long as I can make sure it's code I wrote myself (and hence have the right to re-license) as opposed to copied from arm. Frankly, though, nothing in arm::coefplot or coefplot2::coefplot is that hard to replicate.

Alternatively, if broom does all of the back-end stuff that coefplot2::coeftab does, I might gut coefplot2 and make it a thin layer on top of broom.

Can you give a more specific use case?

dgrtwo commented 9 years ago

I apologize for the late response, I have been distracted (mostly by finishing my PhD thesis). Your ideas and potential contributions could be one of the most important things to happen to broom and I'm very thankful!

Some responses:

Model types

broom can handle lm, glm, and merMod, but not bugs, polr, mcmc, glmmadmb, glmmML, mcmc, MCMCglmm, or rjags. I would love for these to be added to broom and will do everything to help. (Of course you would get co-authorship on the package).

Interface issues

parameter types

At the moment, the tidy method for merMod objects offers the choice of "fixed" or "random". The "fixed" case makes sense, but the other possibilities get a little bit interesting. My own most common use cases for random-effects parameters is to want to retrieve the standard deviations and correlations of the random effects, not the estimated values of the coefficients for each level of the grouping variable(s).

I think I thought about this when I was looking into tidy.merMod output, but this is the area of modeling I am least familiar with. It sounds to me like there are two "modes" of tidying: tidying the per-level estimates, or tidying the standard deviations and correlations of random effects. Perhaps a mode argument- I'm fine with the latter being the default.

If you opened up an issue or a fork we could talk about this one more.

confidence intervals etc

While many model types have well-defined and useful standard errors, others don't. In particular the Wald standard errors for GLMs can be very unreliable; profile confidence intervals are more reliable. The same is often true for random effects parameters. It would be nice to have the option to incorporate lower and upper confidence intervals in a tidy data frame, although it's a little hard to know how to get this -- would you pass a confidence-interval data frame or a likelihood profile? (For most models including glm objects it's easy and lightweight to recompute the profile confidence intervals via confint(), but for merMod objects this can be an expensive operation ...)

One thing to think about here is that the default column names that R uses for confidence intervals (2.5 %, 97.5 %) are a nuisance -- I usually use lwr and upr, although these don't specify the level -- maybe lwr_0.025, upr_0.975?

Right now a number of tidying methods take an argument conf.int that optionally returns the confidence interval, which then leads to conf.low and conf.high columns being added.

In my view we actually shouldn't include the confidence level in the column names. Instead, if you stored multiple confidence levels in the same object, there should be a third column: conf.level- that contains the level. That would allow, for example, a ggplotted curve to have both a "thin" confidence interval and a "fat" one.

I would leave the conf.level to be created by the user, though, in the case that she is combining multiple confidence levels. By convention I avoid having broom return values that describe what was passed to the model rather than what is returned (for example, glance.glm does not return a column with family).

There are sometimes additional decisions: for Bayesian models, should highest posterior density (coda::HPDinterval) or quantiles of the marginal posterior be used?

That's a good question- probably set as an option to the tidy function.

Wish Lists

automatically rescale parameters, i.e. take a model that was fitted without centering and scaling parameters and adjust the parameters and SEs accordingly. This is fairly straightforward if the means and standard deviations of the original predictor variables are known (so is the inverse transformation). arm::coefplot does this, I think, but only by re-fitting the model, which is an unnecessary expense.

Interesting- I'm unfamiliar with this process. Could a rescale_model function work directly on the tidied output and the original data? How much would it need to know about the model?

automatically change scale of parameter/confidence intervals; for example, Wald estimates of variance parameters are much better on the log scale, and this conversion is easy [i.e. if the std dev estimate is b and its standard error is s, then the confidence intervals based on a log scale are exp(log(b) +/- 1.96*s/b]

This sounds like it works well on tidy output. We already give this option (exponentiate = TRUE) for tidy.lm and a few other tidying methods. It could be an option for any tidy. method.

Alternatively, if broom does all of the back-end stuff that coefplot2::coeftab does, I might gut coefplot2 and make it a thin layer on top of broom.

I think this sounds like a great idea. I've been interested in general methods for plotting standard tidy outputs and could try to contribute my own as well.

ashander commented 9 years ago

@bbolker just thinking of a subset of your use cases (getting confidence intervals in the output of tidy with effects="fixed" for easy plotting). The point you raise with the current output of tidy for random effects is a good one. Ultimately it might make sense for tidy.merMod and other mixed models to tidy both fixed and random into one data.frame, or have the option to output only fixed or only random.

I also didn't mean to that arm::coeftab or coefplot2::coeftab would be super hard to reproduce, but noting that actually integrating their code might raise a license issue. A quick and dirty approach would be to import the methods and use them within tidy methods for the object types that work with the two coefplot. For long term, this isn't desirable. For coefplot2 because it isn't on CRAN, and more generally for maintainability. It might be a good strategy though, for developing tests against which broom-specific methods can be written. These could also take the conf.int arg mentioned.

Both your notions -- of writing plot methods for broom objects, or making coeplot2 a thin wrapper if broom can handle all tidying -- seem great. The strength of broom is its clear goal and well defined outputs (data.frame) but lightweight plotting seems like a natural addition.

bbolker commented 9 years ago

haven't had a chance to absorb everything here but just wanted to point out that I have started to tackle some of it on my own broom fork; will presumably submit a pull request eventually, but feel free to drop in and comment on what I'm trying there.

RMHogervorst commented 7 years ago

I would still like the extensions for polr etc.

two questions:

Is there a manual or something to help extending broom?
Since the output of summary ( polr ) looks just like normal models, is there a way to force broom to use it default lm method?

bbolker commented 7 years ago

It's actually reasonably easy to get tidy.lm working with polr objects - onejust has to define a family function for polr objects, then the rest of the machinery works.

library(MASS)
library(broom)

family.polr <- function(object,...) NULL
tidy.polr <- broom:::tidy.lm

house.plr <- polr(Sat ~ Infl + Type + Cont, weights = Freq, data = housing)
tidy(house.plr)
##            term   estimate std.error statistic
## 1    InflMedium  0.5663937 0.1046528  5.412123
## 2      InflHigh  1.2888191 0.1271561 10.135720
## 3 TypeApartment -0.5723501 0.1192380 -4.800064
## 4    TypeAtrium -0.3661866 0.1551733 -2.359855
## 5   TypeTerrace -1.0910149 0.1514860 -7.202083
## 6      ContHigh  0.3602841 0.0955358  3.771195
## 7    Low|Medium -0.4961353 0.1248472 -3.973939
## 8   Medium|High  0.6907083 0.1254719  5.504883

alexpghayes commented 6 years ago

@bbolker Are there lingering issues in this thread or is it safe to close?

github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

tidymodels / broom