rvlenth / emmeans

Estimated marginal means
https://rvlenth.github.io/emmeans/
364 stars 32 forks source link

invert OR option #507

Closed consultfiv closed 1 month ago

consultfiv commented 2 months ago

Describe the bug

This it is not a bug itself, it is just a recommendation for an improvement in the emmeans function, in order to make improved console's outputs. I tried and achieved to get an improve output in the console that I think it provides easier and faster interpretations of the results.

Expected behaviour

This R script (requires packages "emmeans" and "dplyr") creates a function that performs several key steps for analyzing and transforming the results of a generalized linear model (GLM) using the emmeans package. Here’s a breakdown of the process:

After the emmeans function calculates the estimated marginal means for the Feature1 levels, and pairwise comparisons between these levels are performed, the results are converted to a tibble and numeric values are rounded to three decimal places for clarity. For odds ratios less than 1, their inverse is calculated, along with inverses for the corresponding confidence intervals (LCL and UCL). The contrasts are also reversed. The dataframe is updated to include the transformed odds ratios and confidence intervals, and variables are renamed to reflect these changes. The dataframe is ordered by the odds ratio in descending order to highlight the most significant effects. This script ensures that the analysis correctly handles and interprets odds ratios and their confidence intervals, especially when dealing with ratios less than 1, and organizes the results for easy interpretation.

To reproduce

(To facilitate its use, I converted it directly to a function. I also deleted comments, if you want them I can also provide comments for each action)

library(emmeans)
library(dplyr)

invert_OR <- function(emmeans_obj) {
  emmeans_df <- as_tibble(emmeans_obj)

  emmeans_df$contrast <- as.character(emmeans_df$contrast)

  emmeans_df <- emmeans_df %>%
    dplyr::mutate(
      new_odds.ratio = ifelse(odds.ratio < 1, 1 / odds.ratio, odds.ratio),
      new_asymp.LCL = ifelse(odds.ratio < 1, 1 / asymp.UCL, asymp.LCL),
      new_asymp.UCL = ifelse(odds.ratio < 1, 1 / asymp.LCL, asymp.UCL),
      new_contrast = ifelse(odds.ratio < 1, 
                            sapply(strsplit(as.character(contrast), " / "), 
                                   function(x) paste(rev(x), collapse = " / ")), 
                            contrast)
    ) %>%
    dplyr::select(new_contrast, new_odds.ratio, SE, df, new_asymp.LCL, new_asymp.UCL, null, z.ratio, p.value) %>%
    dplyr::rename(
      contrast = new_contrast, 
      odds.ratio = new_odds.ratio, 
      asymp.LCL = new_asymp.LCL, 
      asymp.UCL = new_asymp.UCL
    )

  emmeans_df$contrast <- factor(emmeans_df$contrast, levels = unique(emmeans_df$contrast))

  emmeans_df <- emmeans_df %>%
    arrange(desc(odds.ratio))

 options(scipen = 999)
  emmeans_df[] <- lapply(emmeans_df, function(x) {
    if (is.numeric(x)) {
      round(x, 3)
    } else {
      x
    }
  })

  return(emmeans_df)
}

example:

set.seed(135)
n <- 100
Feature1 <- factor(sample(c("A", "B", "C", "D"), n, replace = TRUE))
Class <- factor(sample(c("Class1", "Class2"), n, replace = TRUE))
dataset <- data.frame(
  Feature1 = Feature1,
  Class = Class
)
glm.1 <- glm(Class ~ ., data = dataset, family = family.glm)
emmeans1 <- emmeans(glm.1, pairwise ~ Feature1, type = "response") %>%
  pairs(reverse = F, infer = T)

invert_OR(emmeans1)

Show the actual output

# A tibble: 6 × 9
  contrast odds.ratio    SE    df asymp.LCL asymp.UCL  null z.ratio p.value
  <fct>         <dbl> <dbl> <dbl>     <dbl>     <dbl> <dbl>   <dbl>   <dbl>
1 D / C          2.42 0.239   Inf     0.548     10.8      1  -1.53    0.42 
2 D / B          2.22 0.264   Inf     0.494     10        1  -1.36    0.522
3 D / A          1.57 0.338   Inf     0.400      6.21     1  -0.853   0.829
4 A / C          1.54 0.894   Inf     0.345      6.85     1   0.741   0.881
5 A / B          1.41 0.829   Inf     0.311      6.39     1   0.585   0.937
6 B / C          1.09 0.687   Inf     0.216      5.50     1   0.138   0.999

Additional context

As you can see, with the functionality of invert_OR( ), which it would be desireble to be as an aditional option included in the original function, every relationship is given in odds ratio bigger than 1, facilitating the interpretation of the relationship between each term of each pair.

Emmeans regular output look like this:

> emmeans1
 contrast odds.ratio    SE  df asymp.LCL asymp.UCL null z.ratio p.value
 A / B         1.410 0.829 Inf    0.3114      6.39    1   0.585  0.9367
 A / C         1.538 0.894 Inf    0.3455      6.85    1   0.741  0.8805
 A / D         0.635 0.338 Inf    0.1612      2.50    1  -0.853  0.8292
 B / C         1.091 0.687 Inf    0.2164      5.50    1   0.138  0.9991
 B / D         0.450 0.264 Inf    0.1000      2.03    1  -1.364  0.5223
 C / D         0.412 0.239 Inf    0.0932      1.83    1  -1.529  0.4199

Confidence level used: 0.95 
Conf-level adjustment: tukey method for comparing a family of 4 estimates 
Intervals are back-transformed from the log odds ratio scale 
P value adjustment: tukey method for comparing a family of 4 estimates 
Tests are performed on the log odds ratio scale 
rvlenth commented 2 months ago

Thanks for your suggestion. I went through and edited your comment. adding Markdown markup for code, so you can see the indentation and alignment. I also removed the "bug" tag, since this isn't a bug.

I guess this illustrates that different people have different preferences in how to present statistical results. But I personally do not regard this as an improvement at all. The fact that in the original contrast output, some odds ratios are greater than 1 and others are less than 1 is important information that is immediately apparent, and I think reversing the direction of some ratios adds confusion.

In addition, the original output contains annotations that inform us about P-value adjustments, etc. that are not present in the tibble version. These annotations are important enough to me that I have a special startup message warning people not to suppress them (see ? untidy). I also am not myself a fan of tibbles; yes, data frames are idiosyncratic, and tibbles "solve" some of them; but they do so by creating other idiosyncrasies and add advertising ("a tibble"), unneeded row numbers, and distracting information about storage types. If we are going to add extra lines to the output, I'd rather show information that is actually useful. I realize that is just my opinion.

There are other packages that re-package emmeans's output in various ways to suit the preferences of particular user communities. I do recognize that different users have different needs, but I think the invert_OR() function would better be incorporated in a separate package, rather than in emmeans.

rvlenth commented 2 months ago

By the way, your specification pairwise ~ Feature1 is unneeded; you'll get the same results using ~ Feature1 while avoiding computing the pairwise comparisons twice.

rvlenth commented 2 months ago

Also, a simple way to obtain the same ordering without reversing any odds ratios is to sort by p.value

consultfiv commented 2 months ago

Thank you for your attention and for your responses. I understand all you said. Anyway, congrats for building such a great package which I have been enjoying and recommending since the day I discovered it.

rvlenth commented 1 month ago

I'm closing this issue, as it is resolved.