Closed bbolker closed 6 years ago
bump?
Sorry, I've been unavailable! This work and ideas are terrific and as soon as I can in the next two days I'm going to do them justice. I was hoping for exactly this kind of exploration of models from someone.
Great topic. I wanted to take a look at integrating coefplot2
capabilities to extend broom
to get (at least) Wald CIs. For full integration there is a license issue -- both coefplot2
and arm
are GPL-equivalent as far as I can tell. Importing coefplot2::coeftab
doesn't seem wise as coefplot2
not on cran..
Thoughts?
If broom
's interface can be extended appropriately, I have no problem ditching coefplot2
entirely (if one is willing to use ggplot
, 99% of the effort of producing the coefficient plot is in tidying the coefficient tables; if desired, it would be pretty easy to write a basic-plot plot
method for broom
objects). I also don't have a problem copying code over, as long as I can make sure it's code I wrote myself (and hence have the right to re-license) as opposed to copied from arm
. Frankly, though, nothing in arm::coefplot
or coefplot2::coefplot
is that hard to replicate.
Alternatively, if broom
does all of the back-end stuff that coefplot2::coeftab
does, I might gut coefplot2
and make it a thin layer on top of broom
.
Can you give a more specific use case?
I apologize for the late response, I have been distracted (mostly by finishing my PhD thesis). Your ideas and potential contributions could be one of the most important things to happen to broom and I'm very thankful!
Some responses:
broom
can handle lm
, glm
, and merMod
, but not bugs
, polr
, mcmc
, glmmadmb
, glmmML
, mcmc
, MCMCglmm
, or rjags
. I would love for these to be added to broom and will do everything to help. (Of course you would get co-authorship on the package).
At the moment, the tidy method for merMod objects offers the choice of "fixed" or "random". The "fixed" case makes sense, but the other possibilities get a little bit interesting. My own most common use cases for random-effects parameters is to want to retrieve the standard deviations and correlations of the random effects, not the estimated values of the coefficients for each level of the grouping variable(s).
I think I thought about this when I was looking into tidy.merMod
output, but this is the area of modeling I am least familiar with. It sounds to me like there are two "modes" of tidying: tidying the per-level estimates, or tidying the standard deviations and correlations of random effects. Perhaps a mode
argument- I'm fine with the latter being the default.
If you opened up an issue or a fork we could talk about this one more.
While many model types have well-defined and useful standard errors, others don't. In particular the Wald standard errors for GLMs can be very unreliable; profile confidence intervals are more reliable. The same is often true for random effects parameters. It would be nice to have the option to incorporate lower and upper confidence intervals in a tidy data frame, although it's a little hard to know how to get this -- would you pass a confidence-interval data frame or a likelihood profile? (For most models including glm objects it's easy and lightweight to recompute the profile confidence intervals via confint(), but for merMod objects this can be an expensive operation ...)
One thing to think about here is that the default column names that R uses for confidence intervals (2.5 %, 97.5 %) are a nuisance -- I usually use lwr and upr, although these don't specify the level -- maybe lwr_0.025, upr_0.975?
Right now a number of tidying methods take an argument conf.int
that optionally returns the confidence interval, which then leads to conf.low
and conf.high
columns being added.
In my view we actually shouldn't include the confidence level in the column names. Instead, if you stored multiple confidence levels in the same object, there should be a third column: conf.level
- that contains the level. That would allow, for example, a ggplotted curve to have both a "thin" confidence interval and a "fat" one.
I would leave the conf.level
to be created by the user, though, in the case that she is combining multiple confidence levels. By convention I avoid having broom return values that describe what was passed to the model rather than what is returned (for example, glance.glm
does not return a column with family
).
There are sometimes additional decisions: for Bayesian models, should highest posterior density (coda::HPDinterval) or quantiles of the marginal posterior be used?
That's a good question- probably set as an option to the tidy
function.
- automatically rescale parameters, i.e. take a model that was fitted without centering and scaling parameters and adjust the parameters and SEs accordingly. This is fairly straightforward if the means and standard deviations of the original predictor variables are known (so is the inverse transformation). arm::coefplot does this, I think, but only by re-fitting the model, which is an unnecessary expense.
Interesting- I'm unfamiliar with this process. Could a rescale_model
function work directly on the tidied output and the original data? How much would it need to know about the model?
automatically change scale of parameter/confidence intervals; for example, Wald estimates of variance parameters are much better on the log scale, and this conversion is easy [i.e. if the std dev estimate is b and its standard error is s, then the confidence intervals based on a log scale are exp(log(b) +/- 1.96*s/b]
This sounds like it works well on tidy output. We already give this option (exponentiate = TRUE
) for tidy.lm
and a few other tidying methods. It could be an option for any tidy.
method.
Alternatively, if broom does all of the back-end stuff that coefplot2::coeftab does, I might gut coefplot2 and make it a thin layer on top of broom.
I think this sounds like a great idea. I've been interested in general methods for plotting standard tidy
outputs and could try to contribute my own as well.
@bbolker just thinking of a subset of your use cases (getting confidence intervals in the output of tidy
with effects="fixed"
for easy plotting). The point you raise with the current output of tidy
for random effects is a good one. Ultimately it might make sense for tidy.merMod
and other mixed models to tidy both fixed and random into one data.frame, or have the option to output only fixed or only random.
I also didn't mean to that arm::coeftab
or coefplot2::coeftab
would be super hard to reproduce, but noting that actually integrating their code might raise a license issue. A quick and dirty approach would be to import the methods and use them within tidy methods for the object types that work with the two coefplot
. For long term, this isn't desirable. For coefplot2
because it isn't on CRAN, and more generally for maintainability. It might be a good strategy though, for developing tests against which broom-specific methods can be written. These could also take the conf.int
arg mentioned.
Both your notions -- of writing plot methods for broom objects, or making coeplot2 a thin wrapper if broom can handle all tidying -- seem great. The strength of broom
is its clear goal and well defined outputs (data.frame
) but lightweight plotting seems like a natural addition.
haven't had a chance to absorb everything here but just wanted to point out that I have started to tackle some of it on my own broom fork; will presumably submit a pull request eventually, but feel free to drop in and comment on what I'm trying there.
I would still like the extensions for polr etc.
two questions:
It's actually reasonably easy to get tidy.lm
working with polr
objects - onejust has to define a family
function for polr
objects, then the rest of the machinery works.
library(MASS)
library(broom)
family.polr <- function(object,...) NULL
tidy.polr <- broom:::tidy.lm
house.plr <- polr(Sat ~ Infl + Type + Cont, weights = Freq, data = housing)
tidy(house.plr)
## term estimate std.error statistic
## 1 InflMedium 0.5663937 0.1046528 5.412123
## 2 InflHigh 1.2888191 0.1271561 10.135720
## 3 TypeApartment -0.5723501 0.1192380 -4.800064
## 4 TypeAtrium -0.3661866 0.1551733 -2.359855
## 5 TypeTerrace -1.0910149 0.1514860 -7.202083
## 6 ContHigh 0.3602841 0.0955358 3.771195
## 7 Low|Medium -0.4961353 0.1248472 -3.973939
## 8 Medium|High 0.6907083 0.1254719 5.504883
@bbolker Are there lingering issues in this thread or is it safe to close?
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
broom
looks really great overall. I have been thinking for a long time about updating mycoefplot2
package (on R-forge, but somewhat defunct there), and it looks likebroom
could serve as the back-end, but I have a bunch of questions and ideas about extensions. Sorry that the following is so long, but I hope I'm starting a useful discussion. Please feel free to ignore these ideas if they're too awkward or can't be implemented without breaking compatibility too badly, but I do think they could be useful things to think about.Model types
arm::coefplot
knows aboutbugs
,glm
,lm
, andpolr
objects (showMethods("coefplot")
)coefplot2
uses acoeftab()
method to extract coefficient tables: in addition to the classes thatcoefplot
handles,coefplot2
knows aboutglmmadmb
,glmmML
,mcmc
,MCMCglmm
,rjags
,merMod
, andmer
(oldlme4
). Right now it extracts lower/upper confidence intervals and standard deviations if they're available; it only gets the Wald intervals formerMod
objects, leaving the interval/uncertainty estimates out for the random effects parameters. It usesIf some of the interface issues can be sorted out I'd be happy to contribute methods for all of these.
Interface issues
In general there are a lot of different kinds of coefficients one might want to retrieve from a model. I'm going to focus on two issues, parameter types and confidence intervals.
Parameter types
At the moment, the
tidy
method formerMod
objects offers the choice of "fixed" or "random". The "fixed" case makes sense, but the other possibilities get a little bit interesting. My own most common use cases for random-effects parameters is to want to retrieve the standard deviations and correlations of the random effects, not the estimated values of the coefficients for each level of the grouping variable(s). Thus I would implement something likeThere are more decisions/possible user inputs here:
match.arg()
won't work any morecoef()
was more uniform across packages and model types,Another useful place to look is at the
coef()
method forhurdle
andzeroinfl
models from thepscl
package: from?predict.hurdle
,I guess the question is how much can be passed through
tidy
as optional arguments tocoef
vs. trying to give thetidy
methods as unified interface ...Confidence intervals etc.
While many model types have well-defined and useful standard errors, others don't. In particular the Wald standard errors for GLMs can be very unreliable; profile confidence intervals are more reliable. The same is often true for random effects parameters. It would be nice to have the option to incorporate lower and upper confidence intervals in a tidy data frame, although it's a little hard to know how to get this -- would you pass a confidence-interval data frame or a likelihood profile? (For most models including
glm
objects it's easy and lightweight to recompute the profile confidence intervals viaconfint()
, but formerMod
objects this can be an expensive operation ...)One thing to think about here is that the default column names that R uses for confidence intervals (
2.5 %
,97.5 %
) are a nuisance -- I usually uselwr
andupr
, although these don't specify the level -- maybelwr_0.025
,upr_0.975
?There are sometimes additional decisions: for Bayesian models, should highest posterior density (
coda::HPDinterval
) or quantiles of the marginal posterior be used?Wish lists
Some things that I've wanted
coefplot2
to be able to do:arm::coefplot
does this, I think, but only by re-fitting the model, which is an unnecessary expense.