insightsengineering / tern

Table, Listings, and Graphs (TLG) library for common outputs used in clinical trials
https://insightsengineering.github.io/tern/
Other
76 stars 21 forks source link

[Bug]: s_odds_ratio not work with large N #1314

Open lchensigma opened 3 days ago

lchensigma commented 3 days ago

What happened?

A bug happened!

changing nex <- 200 makes code work.


nex <- 2000 # Number of example rows
dta <- data.frame(
  "rsp" = sample(c(TRUE, FALSE), nex, TRUE),
  "grp" = sample(c("A", "B"), nex, TRUE),
  "f1" = sample(c("a1", "a2"), nex, TRUE),
  "f2" = sample(c("x", "y", "z"), nex, TRUE),
  strata = factor(sample(c("C", "D"), nex, TRUE)),
  stringsAsFactors = TRUE
)

s_odds_ratio(
  df = subset(dta, grp == "A"),
  .var = "rsp",
  .ref_group = subset(dta, grp == "B"),
  .in_ref_col = FALSE,
  .df_row = dta,
  variables = list(arm = "grp", strata = "strata")
)

sessionInfo()

R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=C                            
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] optimx_2023-10.21   emmeans_1.10.4      forestploter_1.1.2  tern.mmrm_0.3.1    
 [5] scda_0.1.6          tern_0.9.5          haven_2.5.4         rtables_0.6.9      
 [9] magrittr_2.0.3      formatters_0.5.9    rstan_2.32.6        StanHeaders_2.32.10
[13] tidybayes_3.0.7     bayesplot_1.11.1    brms_2.21.0         Rcpp_1.0.13        
[17] dreamer_3.1.0       DoseFinding_1.2-1   labelled_2.13.0     gt_0.11.0          
[21] flextable_0.9.6     broom_1.0.6         readxl_1.4.3        boxr_0.3.6         
[25] pander_0.6.5        patchwork_1.3.0     ggprism_1.0.5       gtsummary_2.0.2    
[29] ggthemes_5.1.0      plyr_1.8.9          knitr_1.48          lubridate_1.9.3    
[33] forcats_1.0.0       stringr_1.5.1       dplyr_1.1.4         purrr_1.0.2        
[37] readr_2.1.5         tidyr_1.3.1         tibble_3.2.1        ggplot2_3.5.1      
[41] tidyverse_2.0.0    

loaded via a namespace (and not attached):
  [1] tensorA_0.36.2.1        rstudioapi_0.16.0       jsonlite_1.8.8          TH.data_1.1-2          
  [5] estimability_1.5.1      nloptr_2.1.1            farver_2.1.2            rmarkdown_2.28         
  [9] ragg_1.3.3              vctrs_0.6.5             askpass_1.2.0           htmltools_0.5.8.1      
 [13] distributional_0.5.0    curl_5.2.2              cellranger_1.1.0        pracma_2.4.4           
 [17] parallelly_1.38.0       sandwich_3.1-1          zoo_1.8-12              uuid_1.2-1             
 [21] lifecycle_1.0.4         pkgconfig_2.0.3         Matrix_1.6-5            R6_2.5.1               
 [25] fastmap_1.2.0           rbibutils_2.2.16        numDeriv_2016.8-1.1     digest_0.6.37          
 [29] colorspace_2.1-1        rprojroot_2.0.4         textshaping_0.4.0       fansi_1.0.6            
 [33] timechange_0.3.0        abind_1.4-8             compiler_4.3.1          here_1.0.1             
 [37] fontquiver_0.2.1        withr_3.0.1             backports_1.5.0         inline_0.3.19          
 [41] QuickJSR_1.3.1          pkgbuild_1.4.4          MASS_7.3-60.0.1         openssl_2.2.1          
 [45] equatags_0.2.1          loo_2.8.0               tools_4.3.1             zip_2.3.1              
 [49] glue_1.7.0              mmrm_0.3.12             nlme_3.1-166            rsconnect_1.3.1        
 [53] grid_4.3.1              checkmate_2.3.2         generics_0.1.3          xslt_1.4.5             
 [57] gtable_0.3.5            tzdb_0.4.0              data.table_1.16.0       hms_1.1.3              
 [61] katex_1.4.1             xml2_1.3.6              utf8_1.2.4              pillar_1.9.0           
 [65] ggdist_3.3.2            posterior_1.6.0         splines_4.3.1           lattice_0.22-6         
 [69] survival_3.7-0          tidyselect_1.2.1        fontLiberation_0.1.0    fontBitstreamVera_0.1.1
 [73] arrayhelpers_1.1-0      gridExtra_2.3           V8_5.0.0                stats4_4.3.1           
 [77] xfun_0.47               bridgesampling_1.1-2    matrixStats_1.4.1       stringi_1.8.4          
 [81] yaml_2.3.10             evaluate_1.0.0          codetools_0.2-19        officer_0.6.6          
 [85] gdtools_0.4.0           cli_3.6.3               RcppParallel_5.1.9      xtable_1.8-4           
 [89] systemfonts_1.1.0       Rdpack_2.6.1            munsell_0.5.1           coda_0.19-4.1          
 [93] svUnit_1.0.6            parallel_4.3.1          rstantools_2.4.0        Brobdingnag_1.2-9      
 [97] mvtnorm_1.3-1           scales_1.3.0            rlang_1.1.4             cowplot_1.1.3          
[101] multcomp_1.4-26

Relevant log output

$or_ci
est lcl ucl 
 NA  NA  NA 
attr(,"label")
[1] "Odds Ratio (95% CI)"

$n_tot
n_tot 
 2000 
attr(,"label")
[1] "Total n"

Code of Conduct

Contribution Guidelines

Security Policy

Melkiades commented 2 days ago

For me, the code works. Try to install the most recent versions of the package and let us know! Remember also to use reprex::reprex({#your code}) so we now also what is the expected bug/output

lchensigma commented 2 days ago

Update to newest version.

With sample size of 2000. Not work, Odds ratio and CI are NAs.

library(tern)
#> Loading required package: rtables
#> Loading required package: formatters
#> Loading required package: magrittr
#> 
#> Attaching package: 'rtables'
#> The following object is masked from 'package:utils':
#> 
#>     str
#> Registered S3 method overwritten by 'tern':
#>   method   from 
#>   tidy.glm broom
library(tidyverse)
nex <- 2000 # Number of example rows
dta <- data.frame(
  "rsp" = sample(c(TRUE, FALSE), nex, TRUE),
  "grp" = sample(c("A", "B"), nex, TRUE),
  "f1" = sample(c("a1", "a2"), nex, TRUE),
  "f2" = sample(c("x", "y", "z"), nex, TRUE),
  strata = factor(sample(c("C", "D"), nex, TRUE)),
  stringsAsFactors = TRUE
)

s_odds_ratio(
  df = subset(dta, grp == "A"),
  .var = "rsp",
  .ref_group = subset(dta, grp == "B"),
  .in_ref_col = FALSE,
  .df_row = dta,
  variables = list(arm = "grp", strata = "strata")
)
#> $or_ci
#> est lcl ucl 
#>  NA  NA  NA 
#> attr(,"label")
#> [1] "Odds Ratio (95% CI)"
#> 
#> $n_tot
#> n_tot 
#>  2000 
#> attr(,"label")
#> [1] "Total n"

Created on 2024-09-30 with reprex v2.0.2

with sample size of 200. Works now. OR and CI were populated.

library(tern)
#> Loading required package: rtables
#> Loading required package: formatters
#> Loading required package: magrittr
#> 
#> Attaching package: 'rtables'
#> The following object is masked from 'package:utils':
#> 
#>     str
#> Registered S3 method overwritten by 'tern':
#>   method   from 
#>   tidy.glm broom
library(tidyverse)
nex <- 200 # Number of example rows
dta <- data.frame(
  "rsp" = sample(c(TRUE, FALSE), nex, TRUE),
  "grp" = sample(c("A", "B"), nex, TRUE),
  "f1" = sample(c("a1", "a2"), nex, TRUE),
  "f2" = sample(c("x", "y", "z"), nex, TRUE),
  strata = factor(sample(c("C", "D"), nex, TRUE)),
  stringsAsFactors = TRUE
)

s_odds_ratio(
  df = subset(dta, grp == "A"),
  .var = "rsp",
  .ref_group = subset(dta, grp == "B"),
  .in_ref_col = FALSE,
  .df_row = dta,
  variables = list(arm = "grp", strata = "strata")
)
#> $or_ci
#>       est       lcl       ucl 
#> 0.7336370 0.4203395 1.2804489 
#> attr(,"label")
#> [1] "Odds Ratio (95% CI)"
#> 
#> $n_tot
#> n_tot 
#>   200 
#> attr(,"label")
#> [1] "Total n"

Created on 2024-09-30 with reprex v2.0.2

Melkiades commented 10 hours ago

Thanks for confirming this. I run different cases and there seems to be a mysterious cutoff at nex > 1950. The following is summarizing my findings with examples.


Issue Description

In the first example using the survival::clogit() function as it is used in the {tern} package, when the number of rows exceeds 1950, the model produces NA values in the output. However, the second example from the clogit() function documentation works without issue, even with larger datasets. Below is an explanation and possible solutions.

Reproduction Steps

Example 1: {tern} package

The following example, derived from the {tern} package, fails when nex exceeds 1950 (roughly):

library(tibble)
library(survival)
nex <- 2000
dta <- data.frame(
  "rsp" = sample(c(TRUE, FALSE), size = nex, replace = TRUE),
  "grp" = sample(c("A", "B"), nex, TRUE),
  "f1" = sample(c("a1", "a2"), nex, TRUE),
  "f2" = sample(c("x", "y", "z"), nex, TRUE),
  strata = factor(sample(c("C", "D"), nex, TRUE)),
  stringsAsFactors = TRUE
)

df = subset(dta, grp == "A")
.var = "rsp"
.ref_group = subset(dta, grp == "B")
.df_row = dta
variables = list(arm = "grp", strata = "strata")

ref_grp <- as.character(unique(.ref_group[[variables$arm]]))
trt_grp <- as.character(unique(df[[variables$arm]]))
grp <- stats::relevel(factor(.df_row[[variables$arm]]), ref = ref_grp)

data <- data.frame(
  rsp = .df_row[[.var]],
  grp = grp,
  strata = interaction(.df_row[variables$strata])
) |> as_tibble()

formula <- stats::as.formula("rsp ~ grp + strata(strata)")
survival::clogit(formula = formula, data = data)

table(data)

Example 2: {survival} package example from clogit()

This example from the clogit() function documentation works even with larger datasets:

# Example from clogit() documentation
resp <- levels(logan$occupation)
n <- nrow(logan)
indx <- rep(1:n, length(resp))
logan2 <- data.frame(logan[indx,],
                     id = indx,
                     tocc = factor(rep(resp, each=n)))
logan2$case <- (logan2$occupation == logan2$tocc)

# This works as expected
clogit(case ~ tocc + tocc:education + strata(id), logan2)

# Check data structure
table(logan2)

Cause of the Issue

  1. Sparse Data and Stratification:
    The issue in the first example likely arises from how the strata are defined. With more rows, the interaction(.df_row[variables$strata]) may result in many strata combinations, some of which may have very few or even no events (cases), leading to perfect separation or non-identifiable parameters. This makes the conditional likelihood impossible to estimate, leading to NA values in the model output.

  2. Small or Imbalanced Strata:
    If the response variable (rsp) is perfectly predicted within certain strata (e.g., all TRUE or all FALSE), the model cannot compute meaningful estimates, as the likelihood function becomes undefined for such strata.

  3. Differences Between Examples:
    The second example works because the strata (strata(id)) are well-structured, and each group has sufficient variability within the outcome to avoid these problems. By ensuring a balanced design with repeated measures (via indx), the likelihood estimates remain well-behaved even with large datasets.

Debugging Steps

You can diagnose potential problematic strata by examining the distribution of the response variable within each strata:

# Check the distribution of responses within strata
table(data$strata, data$rsp)

# Check the number of observations per strata
table(data$strata)

Potential Solutions

  1. Reduce the Number of Strata:
    Consider simplifying or collapsing strata to ensure more variability within each stratum, reducing the likelihood of perfect separation or overfitting.

  2. Balance the Dataset:
    Ensure that each stratum has both cases and controls. This can be done by adjusting how the strata are constructed, either by increasing data or reducing the number of strata-defining variables.

  3. Add the method = "approximate" Argument:
    You can try using the method = "approximate" argument in the clogit() function, which uses an approximation method instead of the exact likelihood approach. This can help prevent issues related to perfect separation and sparse strata.

Here is how you can apply this change:

# Modify the clogit() call to use approximate method
survival::clogit(formula = formula, data = data, method = "approximate")

This method may help resolve issues related to convergence and NA values in the output when the dataset becomes large and the stratification leads to small or unbalanced groups.


If you think adding a way to manage the likelihood ties from {tern} function is a good solution for you, I will add it ;) Let me know what you think.