larmarange / broom.helpers

A set of functions to facilitate manipulation of tibbles produced by broom
https://larmarange.github.io/broom.helpers/
GNU General Public License v3.0
21 stars 8 forks source link

order of variable levels with marginal tidiers #245

Closed nicolas-robette closed 5 months ago

nicolas-robette commented 5 months ago

Dear Joseph, I've just started using {gtsummary} together with {broom.helpers} for regressions, which is really great !

I have noticed that the order of the levels of categorical variables does not respect the order of the levels of the factors in the dataset, when using marginal tidying functions. This is then reflected in the output of tbl_regression() function.

Is there a way to fix this ? I managed to do so by manually changing the "table_body" element of the tbl_regression() output, but there may be a better way.

Here is an example, taken from 24th chapter of guide-R. The "groupe_ages" and "etudes" variables have their levels a bit messed up.

library(tidyverse)
library(labelled)
library(gtsummary)

data(hdv2003, package = "questionr")

d <-
  hdv2003 |> 
  mutate(
    sexe = sexe |> fct_relevel("Femme"),
    groupe_ages = age |>
      cut(
        c(18, 25, 45, 65, 99),
        right = FALSE,
        include.lowest = TRUE,
        labels = c("18-24 ans", "25-44 ans",
                   "45-64 ans", "65 ans et plus")
      ),
    etudes = nivetud |> 
      fct_recode(
        "Primaire" = "N'a jamais fait d'etudes",
        "Primaire" = "A arrete ses etudes, avant la derniere annee d'etudes primaires",
        "Primaire" = "Derniere annee d'etudes primaires",
        "Secondaire" = "1er cycle",
        "Secondaire" = "2eme cycle",
        "Technique / Professionnel" = "Enseignement technique ou professionnel court",
        "Technique / Professionnel" = "Enseignement technique ou professionnel long",
        "Supérieur" = "Enseignement superieur y compris technique superieur"
    ) |> 
    fct_na_value_to_level("Non documenté")  
  ) |> 
  set_variable_labels(
    sport = "Pratique un sport ?",
    sexe = "Sexe",
    groupe_ages = "Groupe d'âges",
    etudes = "Niveau d'études",
    heures.tv = "Heures de télévision / jour"
  )

mod <- glm(
  sport ~ sexe + groupe_ages + etudes + heures.tv,
  family = binomial,
  data = d
)

mod |> 
  tbl_regression(
    tidy_fun = broom.helpers::tidy_marginal_predictions,
    type = "response",
    estimate_fun = scales::label_percent(accuracy = 0.1)
  ) 

Best regards, Nicolas

larmarange commented 5 months ago

Thanks for the report. I will explore it later this week

larmarange commented 5 months ago

The order of the factor levels are not respected by marginaleffects::avg_predictions().

library(tidyverse)
#> Warning: le package 'ggplot2' a été compilé avec la version R 4.3.3
#> Warning: le package 'tidyr' a été compilé avec la version R 4.3.3
library(labelled)
library(gtsummary)
library(marginaleffects)

data(hdv2003, package = "questionr")

d <-
  hdv2003 |> 
  mutate(
    sexe = sexe |> fct_relevel("Femme"),
    groupe_ages = age |>
      cut(
        c(18, 25, 45, 65, 99),
        right = FALSE,
        include.lowest = TRUE,
        labels = c("18-24 ans", "25-44 ans",
                   "45-64 ans", "65 ans et plus")
      ),
    etudes = nivetud |> 
      fct_recode(
        "Primaire" = "N'a jamais fait d'etudes",
        "Primaire" = "A arrete ses etudes, avant la derniere annee d'etudes primaires",
        "Primaire" = "Derniere annee d'etudes primaires",
        "Secondaire" = "1er cycle",
        "Secondaire" = "2eme cycle",
        "Technique / Professionnel" = "Enseignement technique ou professionnel court",
        "Technique / Professionnel" = "Enseignement technique ou professionnel long",
        "Supérieur" = "Enseignement superieur y compris technique superieur"
      ) |> 
      fct_na_value_to_level("Non documenté")  
  ) |> 
  set_variable_labels(
    sport = "Pratique un sport ?",
    sexe = "Sexe",
    groupe_ages = "Groupe d'âges",
    etudes = "Niveau d'études",
    heures.tv = "Heures de télévision / jour"
  )

mod <- glm(
  sport ~ sexe + groupe_ages + etudes + heures.tv,
  family = binomial,
  data = d
)

mod |> 
  tbl_regression(
    tidy_fun = broom.helpers::tidy_marginal_predictions,
    type = "response",
    estimate_fun = scales::label_percent(accuracy = 0.1)
  ) |> 
  as_kable()
Characteristic Average Marginal Predictions 95% CI p-value
Sexe
Femme 32.5% 29.9%, 35.0% \<0.001
Homme 40.4% 37.5%, 43.2% \<0.001
Groupe d’âges
25-44 ans 42.7% 39.3%, 46.2% \<0.001
18-24 ans 51.2% 42.2%, 60.1% \<0.001
45-64 ans 29.9% 26.6%, 33.2% \<0.001
65 ans et plus 24.9% 19.7%, 30.0% \<0.001
Niveau d’études
Supérieur 53.2% 48.4%, 57.9% \<0.001
Non documenté 59.2% 47.0%, 71.5% \<0.001
Primaire 16.1% 11.9%, 20.4% \<0.001
Technique / Professionnel 34.0% 30.3%, 37.7% \<0.001
Secondaire 31.8% 27.2%, 36.4% \<0.001
Heures de télévision / jour
0 41.0% 37.6%, 44.3% \<0.001
1 38.6% 36.2%, 41.0% \<0.001
2 36.3% 34.3%, 38.2% \<0.001
3 34.0% 31.7%, 36.2% \<0.001
12 16.8% 8.6%, 25.1% \<0.001

levels(mod$data$etudes)
#> [1] "Primaire"                  "Secondaire"               
#> [3] "Technique / Professionnel" "Supérieur"                
#> [5] "Non documenté"

avg_predictions(mod, variables = "etudes", by = "etudes")
#> 
#>                     etudes Estimate Pr(>|z|)    S 2.5 % 97.5 %
#>  Supérieur                    0.534    0.193  2.4 0.483  0.584
#>  Non documenté                0.599    0.144  2.8 0.466  0.718
#>  Primaire                     0.149   <0.001 90.5 0.113  0.193
#>  Technique / Professionnel    0.329   <0.001 46.8 0.291  0.370
#>  Secondaire                   0.307   <0.001 39.9 0.262  0.357
#> 
#> Columns: etudes, estimate, p.value, s.value, conf.low, conf.high 
#> Type:  invlink(link)

Created on 2024-03-19 with reprex v2.1.0

larmarange commented 5 months ago

It seems related to https://github.com/vincentarelbundock/marginaleffects/issues/1039

vincentarelbundock commented 5 months ago

FYI, the order of categories in the output of avg_predictions() respects the order in which the levels appear in the dataset. The issue linked sounds related, but it is not actually related (it is about the datagrid() function which plays no role in marginalization).

The logic for respecting the order of appearance in avg_predictions is to be consistent with most aggregation functions in R. For example:

dat <- transform(mtcars, cyl = factor(cyl, c(4, 6, 8)))

data.table::data.table(dat)[, .(mean(mpg)), by = cyl]

          cyl       V1
       <fctr>    <num>
    1:      6 19.74286
    2:      4 26.66364
    3:      8 15.10000

dat |> dplyr::summarize(mpg_avg = mean(mpg), .by = cyl)

      cyl  mpg_avg
    1   6 19.74286
    2   4 26.66364
    3   8 15.10000

# but not this one
aggregate(mpg ~ cyl, FUN = mean, data = dat)

      cyl      mpg
    1   4 26.66364
    2   6 19.74286
    3   8 15.10000

One user-level solution is to sort the dataset beforehand.

The linked issue will not solve the present issue.

larmarange commented 5 months ago

Thank you very much @vincentarelbundock for your feedback.

Knowing that, I will try to sort the result of avg_predictions() then.

larmarange commented 5 months ago

Hi @nicolas-robette

could you have a look at #246 ?

larmarange commented 5 months ago
library(tidyverse)
library(labelled)
library(gtsummary)
library(marginaleffects)

data(hdv2003, package = "questionr")

d <-
  hdv2003 |> 
  mutate(
    sexe = sexe |> fct_relevel("Femme"),
    groupe_ages = age |>
      cut(
        c(18, 25, 45, 65, 99),
        right = FALSE,
        include.lowest = TRUE,
        labels = c("18-24 ans", "25-44 ans",
                   "45-64 ans", "65 ans et plus")
      ),
    etudes = nivetud |> 
      fct_recode(
        "Primaire" = "N'a jamais fait d'etudes",
        "Primaire" = "A arrete ses etudes, avant la derniere annee d'etudes primaires",
        "Primaire" = "Derniere annee d'etudes primaires",
        "Secondaire" = "1er cycle",
        "Secondaire" = "2eme cycle",
        "Technique / Professionnel" = "Enseignement technique ou professionnel court",
        "Technique / Professionnel" = "Enseignement technique ou professionnel long",
        "Supérieur" = "Enseignement superieur y compris technique superieur"
      ) |> 
      fct_na_value_to_level("Non documenté")  
  ) |> 
  set_variable_labels(
    sport = "Pratique un sport ?",
    sexe = "Sexe",
    groupe_ages = "Groupe d'âges",
    etudes = "Niveau d'études",
    heures.tv = "Heures de télévision / jour"
  ) 

mod <- glm(
  sport ~ sexe + groupe_ages + etudes + heures.tv,
  family = binomial,
  data = d
)

mod |> 
  tbl_regression(
    tidy_fun = broom.helpers::tidy_marginal_predictions,
    type = "response",
    estimate_fun = scales::label_percent(accuracy = 0.1)
  ) |> 
  as_kable()
Characteristic Average Marginal Predictions 95% CI p-value
Sexe
Femme 32.5% 29.9%, 35.0% \<0.001
Homme 40.4% 37.5%, 43.2% \<0.001
Groupe d’âges
18-24 ans 51.2% 42.2%, 60.1% \<0.001
25-44 ans 42.7% 39.3%, 46.2% \<0.001
45-64 ans 29.9% 26.6%, 33.2% \<0.001
65 ans et plus 24.9% 19.7%, 30.0% \<0.001
Niveau d’études
Primaire 16.1% 11.9%, 20.4% \<0.001
Secondaire 31.8% 27.2%, 36.4% \<0.001
Technique / Professionnel 34.0% 30.3%, 37.7% \<0.001
Supérieur 53.2% 48.4%, 57.9% \<0.001
Non documenté 59.2% 47.0%, 71.5% \<0.001
Heures de télévision / jour
0 41.0% 37.6%, 44.3% \<0.001
1 38.6% 36.2%, 41.0% \<0.001
2 36.3% 34.3%, 38.2% \<0.001
3 34.0% 31.7%, 36.2% \<0.001
12 16.8% 8.6%, 25.1% \<0.001

Created on 2024-04-05 with reprex v2.1.0

nicolas-robette commented 5 months ago

This is great, thanks for the fix ! Bests, Nicolas