Closed iago-pssjd closed 3 years ago
modelsummary_wide
(will be renamed to modelsummary_group
in the next version) can handle models with groups of coefficients like this. You just need to specify the coef_group
argument with the name of the column.
What would you have expected modelsummary
to produce? The problem is that models are extremely inconsistent. gamlss
includes a parameter
column. nnet::multinom
includes a y.level
column. Other packages include a reponse
column. I'm not sure what is the most sensible approach here. Maybe just print out a warning pointing users to modelsummary_group
when I detect a known grouped parameter name?
Thank you @vincentarelbundock . I will try modelsummary_wide
Now that you say that I see that it is a big problem how to do the merge when a nnet::multinom
model is involved. In that case a possible solution I think could be summarising all y.levels
in one observation (like a list,as does tidyr::chop
) and, when printing, specifying the y.levels
in the same row (something like ylevel1: estimate1; y.level2: estimate2; ...
).
For the case of gamlss
models a possibility is something like the one for multinom
, but what I expected before thinking in this is just giving the mu
parameter.
Hmm, interesting.
FWIW, modelsummary_wide
shows each group in different columns, which is more or less the standard approach for multinomial models (in my experience).
That's a good option, but the issue I think with it, it is that the tables can become very wide. In fact my previous suggestion for printing should be changed to the (language correspondent of) ylevel1: estimate1\ny.level2: estimate2\n...
I don't have a great idea for what this would look like in terms of user interface or internal code, but I included a "case study" on the website to show how you can achieve all of this with minimal manual coding:
Actually, I think I do have any idea now. Stay tuned ;)
@iago-pssjd if you have a chance to try it out, I'd really like to get some feedback on the user-interface of the new group
and group_map
arguments. See here:
https://vincentarelbundock.github.io/modelsummary/articles/modelsummary.html#group-1
My hope is that this will make modelsummary_wide
obsolete, and make the whole thing much more flexible. I'm especially curious to know:
gamlss
models, the groupid is called component
)Note that this is only available in the Github version so far.
Thanks @vincentarelbundock I will try now.
Beyond the new issues commented in https://github.com/vincentarelbundock/modelsummary/issues/238#issuecomment-810958550, (now I use options(modelsummary_get = "broom")
to avoid them), in this case I have found another couple of issues.
GA
and GAGLM
we see Model 1
and Model 2
(if the group
formula was ~ model + something
model name could be pasted with such something
, for example GA sigma
).conditional
component
of GA
is the equivalent (almost same intercept and parameters) of the "standard" estimates of glm
, so they should be in the same rows.glm
, I would say that it is not necessary.dat <- rgamma(100, shape=1, scale=10)
library(gamlss)
models <- list()
models[["GA"]] <- gamlss(dat ~ 1, family = GA, trace = FALSE)
models[["GAGLM"]] <- glm(dat ~ 1, family = Gamma("log"))
modelsummary(models[c("GA", "GAGLM")], output = "markdown")
| | GA | GAGLM |
|:-----------|:--------:|:--------:|
|(Intercept) | -0.015 | 2.318 |
| | 2.318 | 2.318 |
| | (0.062) | (0.101) |
| | (0.098) | (0.101) |
|Num.Obs. | | 100 |
|AIC | 667.6 | 668.8 |
|BIC | 672.8 | 674.1 |
|Log.Lik. | -331.818 | -332.424 |
modelsummary(models[c("GA", "GAGLM")], output = "markdown", group = component + term ~ model)
| | | Model 1 | Model 2 |
|:-----------|:-----------|:--------:|:--------:|
|conditional |(Intercept) | 2.318 | |
| | | (0.098) | |
|sigma | | -0.015 | |
| | | (0.062) | |
| | | | 2.318 |
| | | | (0.101) |
| |Num.Obs. | | 100 |
| |AIC | 667.6 | 668.8 |
| |BIC | 672.8 | 674.1 |
| |Log.Lik. | -331.818 | -332.424 |
Warning message:
In format_estimates(est = msl[[i]]$tidy, fmt = fmt, estimate = estimate[[i]], :
Group name "component" was not found in the extracted data. The "group" argument must be a column name in the data.frame produced by `get_estimates(model)`
Anyway, it is being a great work, thanks a lot! If I found anything more I'll tell you.
Thanks!
I fixed the model name preservation issue.
Matching coefficients for grouped and non-grouped models is actually a very difficult problem to solve. You and I know that this specific intercept matches across the GAMLSS and GLM models because we have external statistical knowledge. But modelsummary
doesn't have any statistical knowledge per se, so I would have to encode matches for every possible pair of model types that users might want to combine. What if I want to match "grouped" multinomial logit to "ungrouped" OLS? That would require I specify custom matches for 10s of thousands of pairs of models. Not doable, unfortunately.
I updated the website to suggest a semi-manual strategy that allows users to specify how they want to align parameters across grouped and non-grouped model types:
https://vincentarelbundock.github.io/modelsummary/articles/modelsummary.html#group-1
I also augmented the warning to make it more useful. I anticipate that quite a few people will run into this issue, so I prefer to leave a more informative warning in, otherwise, I might have to answer dozens of support emails ;)
Thank you for the answer and the fixes @vincentarelbundock .
Regarding to the matching coefficients I agree with you. I also thought the case for the multinomial logit which I would not expect to be matched with any other models. However, for the case of GAMLSS I would say it is not necessary to encode matches for every possible pair of model types. In the component
column, the mu
value would be changed to the default value which would be merged with all other defaults (I do not know now how is done the merge, but I imagine all the models listed would have attached a component
column with a default value [not in the case of nnet::multinom
where component = y.level
]; let me know if the process is done in another way).
I recognise this wouldn't be a definitive solution, since there could be other models with other group
columns which I do not know and which should be treated separately, but it could be used a set of standardised values for this group/component/y.level
column in such a way it would be enough to encode special model types, instead of pairs of those.
I will try the strategy you suggest.
Right. So maybe I was exaggerating a bit. But a key design decision of modelsummary
is that it does not hold information about specific models, and that it delegates information extract to broom
.
If you look at the broom
repository, there are dozens of individual contributors. Keeping track of various model types and their special feature is a very labour intensive task. Now it sounds like an easy fix: just one column for gamlss
. But there will be other models to keep track of in the future. I unfortunately don't have the time to do this. For this reason, I prefer to provide a general and solid infrastructre to produce tables, that works in most cases, even if people sometimes have to write an extra line or two of code to handle special cases.
If you look at the example on the website, it would be super easy to wrap in function that you could apply to all GLM models. So that's pretty easy on the user end, I think.
Thanks @vincentarelbundock .
I understand what you say. It's fine. Further, I've just tried the strategy for GAMLSS that you provided in the vignette, and it's really easy. I just add a bit more complex example for the case you want to add to the vignette. Note also in the current example in the vignette, that in the model gamlss(dat ~ 1, trace = FALSE)
lacks family = GA
, so then the GLM intercept coincides with the GAMLSS "mu" intercept. But I was testing a more complex example as follows (not sure if the lapply part could be simplified):
library(modelsummary)
library(gamlss)
#> Loading required package: splines
#> Loading required package: gamlss.data
#>
#> Attaching package: 'gamlss.data'
#> The following object is masked from 'package:datasets':
#>
#> sleep
#> Loading required package: gamlss.dist
#> Loading required package: MASS
#> Loading required package: nlme
#> Loading required package: parallel
#> ********** GAMLSS Version 5.3-2 **********
#> For more on GAMLSS look at https://www.gamlss.com/
#> Type gamlssNews() to see new features/changes/bug fixes.
data(trees)
dat <- rgamma(100, shape=1, scale=10)
models <- list()
models[['Bivariate']] <- lm(Girth ~ Height, data = trees)
models[['Multivariate']] <- lm(Girth ~ Height + Volume, data = trees)
models[["BCT"]] <- gamlss(y~pb(x),sigma.fo=~pb(x),family=BCT, data=abdom, method=mixed(1,20), trace=FALSE)
models[["GA"]] <- gamlss(dat ~ 1, family = GA, trace = FALSE)
models[["GLM"]] <- glm(dat ~ 1, family = Gamma("log"))
msl <- modelsummary(models, output = "modelsummary_list")
msl2 <- msl
msl2[c("Multivariate", "Bivariate", "GLM")] <- lapply(msl[c("Multivariate", "Bivariate", "GLM")],
function(x) structure(list(tidy = transform(x[["tidy"]], parameter = "mu"),
glance = x[["glance"]]),
class = c("modelsummary_list", "list")))
modelsummary(msl2, group = parameter + term ~ model, output = "markdown")
Bivariate | Multivariate | BCT | GA | GLM | ||
---|---|---|---|---|---|---|
mu | (Intercept) | -6.188 | 10.816 | -64.443 | 2.231 | 2.231 |
(5.960) | (1.973) | (1.329) | (0.097) | (0.100) | ||
Height | 0.256 | -0.045 | ||||
(0.078) | (0.028) | |||||
Volume | 0.195 | |||||
(0.011) | ||||||
pb(x) | 10.695 | |||||
(0.058) | ||||||
sigma | (Intercept) | -2.650 | -0.027 | |||
(0.108) | (0.063) | |||||
pb(x) | -0.010 | |||||
(0.004) | ||||||
nu | (Intercept) | -0.107 | ||||
(0.557) | ||||||
tau | 2.495 | |||||
(0.301) | ||||||
Num.Obs. | 31 | 31 | 610 | 100 | 100 | |
R2 | 0.270 | 0.941 | ||||
R2 Adj. | 0.244 | 0.937 | ||||
AIC | 154.1 | 78.2 | 4794.5 | 649.9 | 651.1 | |
BIC | 158.4 | 84.0 | 4846.4 | 655.1 | 656.3 | |
Log.Lik. | -74.061 | -35.116 | -2385.497 | -322.959 | -323.540 | |
F | 10.707 | 222.471 |
Created on 2021-03-31 by the reprex package (v1.0.0)
Another example Updated:
library(modelsummary)
options(modelsummary_get = "all")
library(gamlss)
#> Loading required package: splines
#> Loading required package: gamlss.data
#>
#> Attaching package: 'gamlss.data'
#> The following object is masked from 'package:datasets':
#>
#> sleep
#> Loading required package: gamlss.dist
#> Loading required package: MASS
#> Loading required package: nlme
#> Loading required package: parallel
#> ********** GAMLSS Version 5.3-4 **********
#> For more on GAMLSS look at https://www.gamlss.com/
#> Type gamlssNews() to see new features/changes/bug fixes.
library(broom.mixed)
#> Registered S3 method overwritten by 'broom.mixed':
#> method from
#> tidy.gamlss broom
data(trees)
dat <- rgamma(100, shape=1, scale=10)
models <- list(
'Bivariate' = lm(Girth ~ Height, data = trees),
'Multivariate' = lm(Girth ~ Height + Volume, data = trees),
'GAMLSS' = gamlss(y~pb(x),sigma.fo=~pb(x),family=BCT, data=abdom, method=mixed(1,20), trace=FALSE),
'GA' = gamlss(dat ~ 1, family = GA, trace = FALSE)
'GLM' = glm(dat ~ 1, family = Gamma("log")))
# exponentiate = TRUE must be specified here; also confidence intervals must be done explicit
msl <- modelsummary(models, estimate = "{estimate} ({conf.low}, {conf.high})", output = "modelsummary_list", exponentiate = TRUE)
f <- function(x) {
x$tidy$component <- "conditional"
x
}
msl[c("Multivariate", "Bivariate", "GLM")] <- lapply(msl[c("Multivariate", "Bivariate", "GLM")], f)
modelsummary(msl, group = component + term ~ model, estimate = "{estimate} ({conf.low}, {conf.high})",
statistic = "S.E. = {std.error}, p = {p.value}{stars} ",
stars = c('*' = .05, '**' = .01, '***' = .001), output = "markdown")
Bivariate | Multivariate | GAMLSS | GA | GLM | ||
---|---|---|---|---|---|---|
conditional | (Intercept) | -6.188 (-18.378, 6.002) | 10.816 (6.774, 14.858) | 0.000 (0.000, 0.000) | 10.604 (8.628, 13.033) | 10.604 (8.452, 13.552) |
S.E. = 5.960, p = 0.308 | S.E. = 1.973, p = 0.000*** | S.E. = 0.000, p = 0.000*** | S.E. = 1.116, p = 0.000*** | S.E. = 0.120, p = 0.000*** | ||
Height | 0.256 (0.096, 0.416) | -0.045 (-0.103, 0.012) | ||||
S.E. = 0.078, p = 0.003** | S.E. = 0.028, p = 0.119 | |||||
Volume | 0.195 (0.173, 0.218) | |||||
S.E. = 0.011, p = 0.000*** | ||||||
pb(x) | 44118.541 (39387.768, 49417.517) | |||||
S.E. = 2553.178, p = 0.000*** | ||||||
sigma | (Intercept) | 0.071 (0.057, 0.087) | 1.052 (0.932, 1.188) | |||
S.E. = 0.008, p = 0.000*** | S.E. = 0.065, p = 0.411 | |||||
pb(x) | 0.990 (0.983, 0.997) | |||||
S.E. = 0.004, p = 0.009** | ||||||
nu | (Intercept) | 0.898 (0.262, 3.086) | ||||
S.E. = 0.566, p = 0.865 | ||||||
tau | 12.120 (5.258, 27.937) | |||||
S.E. = 5.164, p = 0.000*** | ||||||
Num.Obs. | 31 | 31 | 610 | 100 | 100 | |
R2 | 0.270 | 0.941 | ||||
R2 Adj. | 0.244 | 0.937 | ||||
AIC | 154.1 | 78.2 | 4794.5 | 675.5 | 677.1 | |
BIC | 158.4 | 84.0 | 4846.4 | 680.8 | 682.3 | |
Log.Lik. | -74.061 | -35.116 | -2385.497 | -335.771 | -336.536 | |
F | 10.707 | 222.471 | ||||
RMSE | 2.64 | 0.75 | 1.00 | 1.00 | 12.69 |
Note: ^^ * p < 0.05, ** p < 0.01, *** p < 0.001
Created on 2021-04-01 by the reprex package (v1.0.0)
I got rid of the weird error message. Here's how I would re-write your code in a slightly more verbose but (I think) clearer way:
library(gamlss)
library(modelsummary)
options(modelsummary_get = "parameters")
data(trees)
dat <- rgamma(100, shape=1, scale=10)
models <- list(
'Bivariate' = lm(Girth ~ Height, data = trees),
'Multivariate' = lm(Girth ~ Height + Volume, data = trees),
'GAMLSS' = gamlss(y~pb(x),sigma.fo=~pb(x),family=BCT, data=abdom, method=mixed(1,20), trace=FALSE),
'GA' = gamlss(dat ~ 1, trace = FALSE),
'GLM' = glm(dat ~ 1, family = Gamma("log")))
msl <- modelsummary(models, output = "modelsummary_list")
f <- function(x) {
x$tidy$parameter <- "mu"
x
}
msl[c(1:2, 5)] <- lapply(msl[c(1:2, 5)], f)
modelsummary(msl, group = parameter + term ~ model)
Just a last comment relative to this issue and the last comments in #238.
If options(modelsummary_get = "all")
in the last example, the specification of confidence intervals in either estimate
or statistic
(like in my example estimate = "{estimate} ({conf.low}, {conf.high})"
) has to be done both (first) when
msl <- modelsummary(models, estimate = "{estimate} ({conf.low}, {conf.high})", output = "modelsummary_list")
(in order to get computed the confidence intervals by broom
) and (second) when
modelsummary(msl2, group = component + term ~ model, estimate = "{estimate} ({conf.low}, {conf.high})", output = "markdown")
to format.
Good point. Since we don't store the full model object, it is impossible to retrieve the confidence interval unless we asked for it explicitly on the initial run.
I added a note to the documentation.
Hi,
gamlss
models have distinct intercepts (and possibly slopes) for possible parameters for the distribution chosen, which can bemu, sigma, tau, nu
, of which the main ismu
(as can be seen in the examples,mu
parameter is shown as fixed effects byparameters::model_parameters
). When the models are merged bymodelsummary
, the intercept rows are multiplied in all possible combinations, producing multiple non necessary rows, mixingmu
intercepts withsigma
and the other intercepts as can be seen below.For the second model, I paste the
data.frame
coercion of the original function to show how the various parameters are saved there.Created on 2021-03-27 by the reprex package (v1.0.0)
Thank you!