Closed nicolas-robette closed 5 months ago
Thanks for the report. I will explore it later this week
The order of the factor levels are not respected by marginaleffects::avg_predictions()
.
library(tidyverse)
#> Warning: le package 'ggplot2' a été compilé avec la version R 4.3.3
#> Warning: le package 'tidyr' a été compilé avec la version R 4.3.3
library(labelled)
library(gtsummary)
library(marginaleffects)
data(hdv2003, package = "questionr")
d <-
hdv2003 |>
mutate(
sexe = sexe |> fct_relevel("Femme"),
groupe_ages = age |>
cut(
c(18, 25, 45, 65, 99),
right = FALSE,
include.lowest = TRUE,
labels = c("18-24 ans", "25-44 ans",
"45-64 ans", "65 ans et plus")
),
etudes = nivetud |>
fct_recode(
"Primaire" = "N'a jamais fait d'etudes",
"Primaire" = "A arrete ses etudes, avant la derniere annee d'etudes primaires",
"Primaire" = "Derniere annee d'etudes primaires",
"Secondaire" = "1er cycle",
"Secondaire" = "2eme cycle",
"Technique / Professionnel" = "Enseignement technique ou professionnel court",
"Technique / Professionnel" = "Enseignement technique ou professionnel long",
"Supérieur" = "Enseignement superieur y compris technique superieur"
) |>
fct_na_value_to_level("Non documenté")
) |>
set_variable_labels(
sport = "Pratique un sport ?",
sexe = "Sexe",
groupe_ages = "Groupe d'âges",
etudes = "Niveau d'études",
heures.tv = "Heures de télévision / jour"
)
mod <- glm(
sport ~ sexe + groupe_ages + etudes + heures.tv,
family = binomial,
data = d
)
mod |>
tbl_regression(
tidy_fun = broom.helpers::tidy_marginal_predictions,
type = "response",
estimate_fun = scales::label_percent(accuracy = 0.1)
) |>
as_kable()
Characteristic | Average Marginal Predictions | 95% CI | p-value |
---|---|---|---|
Sexe | |||
Femme | 32.5% | 29.9%, 35.0% | \<0.001 |
Homme | 40.4% | 37.5%, 43.2% | \<0.001 |
Groupe d’âges | |||
25-44 ans | 42.7% | 39.3%, 46.2% | \<0.001 |
18-24 ans | 51.2% | 42.2%, 60.1% | \<0.001 |
45-64 ans | 29.9% | 26.6%, 33.2% | \<0.001 |
65 ans et plus | 24.9% | 19.7%, 30.0% | \<0.001 |
Niveau d’études | |||
Supérieur | 53.2% | 48.4%, 57.9% | \<0.001 |
Non documenté | 59.2% | 47.0%, 71.5% | \<0.001 |
Primaire | 16.1% | 11.9%, 20.4% | \<0.001 |
Technique / Professionnel | 34.0% | 30.3%, 37.7% | \<0.001 |
Secondaire | 31.8% | 27.2%, 36.4% | \<0.001 |
Heures de télévision / jour | |||
0 | 41.0% | 37.6%, 44.3% | \<0.001 |
1 | 38.6% | 36.2%, 41.0% | \<0.001 |
2 | 36.3% | 34.3%, 38.2% | \<0.001 |
3 | 34.0% | 31.7%, 36.2% | \<0.001 |
12 | 16.8% | 8.6%, 25.1% | \<0.001 |
levels(mod$data$etudes)
#> [1] "Primaire" "Secondaire"
#> [3] "Technique / Professionnel" "Supérieur"
#> [5] "Non documenté"
avg_predictions(mod, variables = "etudes", by = "etudes")
#>
#> etudes Estimate Pr(>|z|) S 2.5 % 97.5 %
#> Supérieur 0.534 0.193 2.4 0.483 0.584
#> Non documenté 0.599 0.144 2.8 0.466 0.718
#> Primaire 0.149 <0.001 90.5 0.113 0.193
#> Technique / Professionnel 0.329 <0.001 46.8 0.291 0.370
#> Secondaire 0.307 <0.001 39.9 0.262 0.357
#>
#> Columns: etudes, estimate, p.value, s.value, conf.low, conf.high
#> Type: invlink(link)
Created on 2024-03-19 with reprex v2.1.0
It seems related to https://github.com/vincentarelbundock/marginaleffects/issues/1039
FYI, the order of categories in the output of avg_predictions()
respects the order in which the levels appear in the dataset. The issue linked sounds related, but it is not actually related (it is about the datagrid()
function which plays no role in marginalization).
The logic for respecting the order of appearance in avg_predictions
is to be consistent with most aggregation functions in R
. For example:
dat <- transform(mtcars, cyl = factor(cyl, c(4, 6, 8)))
data.table::data.table(dat)[, .(mean(mpg)), by = cyl]
cyl V1
<fctr> <num>
1: 6 19.74286
2: 4 26.66364
3: 8 15.10000
dat |> dplyr::summarize(mpg_avg = mean(mpg), .by = cyl)
cyl mpg_avg
1 6 19.74286
2 4 26.66364
3 8 15.10000
# but not this one
aggregate(mpg ~ cyl, FUN = mean, data = dat)
cyl mpg
1 4 26.66364
2 6 19.74286
3 8 15.10000
One user-level solution is to sort the dataset beforehand.
The linked issue will not solve the present issue.
Thank you very much @vincentarelbundock for your feedback.
Knowing that, I will try to sort the result of avg_predictions() then.
Hi @nicolas-robette
could you have a look at #246 ?
library(tidyverse)
library(labelled)
library(gtsummary)
library(marginaleffects)
data(hdv2003, package = "questionr")
d <-
hdv2003 |>
mutate(
sexe = sexe |> fct_relevel("Femme"),
groupe_ages = age |>
cut(
c(18, 25, 45, 65, 99),
right = FALSE,
include.lowest = TRUE,
labels = c("18-24 ans", "25-44 ans",
"45-64 ans", "65 ans et plus")
),
etudes = nivetud |>
fct_recode(
"Primaire" = "N'a jamais fait d'etudes",
"Primaire" = "A arrete ses etudes, avant la derniere annee d'etudes primaires",
"Primaire" = "Derniere annee d'etudes primaires",
"Secondaire" = "1er cycle",
"Secondaire" = "2eme cycle",
"Technique / Professionnel" = "Enseignement technique ou professionnel court",
"Technique / Professionnel" = "Enseignement technique ou professionnel long",
"Supérieur" = "Enseignement superieur y compris technique superieur"
) |>
fct_na_value_to_level("Non documenté")
) |>
set_variable_labels(
sport = "Pratique un sport ?",
sexe = "Sexe",
groupe_ages = "Groupe d'âges",
etudes = "Niveau d'études",
heures.tv = "Heures de télévision / jour"
)
mod <- glm(
sport ~ sexe + groupe_ages + etudes + heures.tv,
family = binomial,
data = d
)
mod |>
tbl_regression(
tidy_fun = broom.helpers::tidy_marginal_predictions,
type = "response",
estimate_fun = scales::label_percent(accuracy = 0.1)
) |>
as_kable()
Characteristic | Average Marginal Predictions | 95% CI | p-value |
---|---|---|---|
Sexe | |||
Femme | 32.5% | 29.9%, 35.0% | \<0.001 |
Homme | 40.4% | 37.5%, 43.2% | \<0.001 |
Groupe d’âges | |||
18-24 ans | 51.2% | 42.2%, 60.1% | \<0.001 |
25-44 ans | 42.7% | 39.3%, 46.2% | \<0.001 |
45-64 ans | 29.9% | 26.6%, 33.2% | \<0.001 |
65 ans et plus | 24.9% | 19.7%, 30.0% | \<0.001 |
Niveau d’études | |||
Primaire | 16.1% | 11.9%, 20.4% | \<0.001 |
Secondaire | 31.8% | 27.2%, 36.4% | \<0.001 |
Technique / Professionnel | 34.0% | 30.3%, 37.7% | \<0.001 |
Supérieur | 53.2% | 48.4%, 57.9% | \<0.001 |
Non documenté | 59.2% | 47.0%, 71.5% | \<0.001 |
Heures de télévision / jour | |||
0 | 41.0% | 37.6%, 44.3% | \<0.001 |
1 | 38.6% | 36.2%, 41.0% | \<0.001 |
2 | 36.3% | 34.3%, 38.2% | \<0.001 |
3 | 34.0% | 31.7%, 36.2% | \<0.001 |
12 | 16.8% | 8.6%, 25.1% | \<0.001 |
Created on 2024-04-05 with reprex v2.1.0
This is great, thanks for the fix ! Bests, Nicolas
Dear Joseph, I've just started using {gtsummary} together with {broom.helpers} for regressions, which is really great !
I have noticed that the order of the levels of categorical variables does not respect the order of the levels of the factors in the dataset, when using marginal tidying functions. This is then reflected in the output of tbl_regression() function.
Is there a way to fix this ? I managed to do so by manually changing the "table_body" element of the tbl_regression() output, but there may be a better way.
Here is an example, taken from 24th chapter of guide-R. The "groupe_ages" and "etudes" variables have their levels a bit messed up.
Best regards, Nicolas