Closed lchensigma closed 1 month ago
For me, the code works. Try to install the most recent versions of the package and let us know! Remember also to use reprex::reprex({#your code})
so we now also what is the expected bug/output
Update to newest version.
With sample size of 2000. Not work, Odds ratio and CI are NA
s.
library(tern)
#> Loading required package: rtables
#> Loading required package: formatters
#> Loading required package: magrittr
#>
#> Attaching package: 'rtables'
#> The following object is masked from 'package:utils':
#>
#> str
#> Registered S3 method overwritten by 'tern':
#> method from
#> tidy.glm broom
library(tidyverse)
nex <- 2000 # Number of example rows
dta <- data.frame(
"rsp" = sample(c(TRUE, FALSE), nex, TRUE),
"grp" = sample(c("A", "B"), nex, TRUE),
"f1" = sample(c("a1", "a2"), nex, TRUE),
"f2" = sample(c("x", "y", "z"), nex, TRUE),
strata = factor(sample(c("C", "D"), nex, TRUE)),
stringsAsFactors = TRUE
)
s_odds_ratio(
df = subset(dta, grp == "A"),
.var = "rsp",
.ref_group = subset(dta, grp == "B"),
.in_ref_col = FALSE,
.df_row = dta,
variables = list(arm = "grp", strata = "strata")
)
#> $or_ci
#> est lcl ucl
#> NA NA NA
#> attr(,"label")
#> [1] "Odds Ratio (95% CI)"
#>
#> $n_tot
#> n_tot
#> 2000
#> attr(,"label")
#> [1] "Total n"
Created on 2024-09-30 with reprex v2.0.2
with sample size of 200. Works now. OR and CI were populated.
library(tern)
#> Loading required package: rtables
#> Loading required package: formatters
#> Loading required package: magrittr
#>
#> Attaching package: 'rtables'
#> The following object is masked from 'package:utils':
#>
#> str
#> Registered S3 method overwritten by 'tern':
#> method from
#> tidy.glm broom
library(tidyverse)
nex <- 200 # Number of example rows
dta <- data.frame(
"rsp" = sample(c(TRUE, FALSE), nex, TRUE),
"grp" = sample(c("A", "B"), nex, TRUE),
"f1" = sample(c("a1", "a2"), nex, TRUE),
"f2" = sample(c("x", "y", "z"), nex, TRUE),
strata = factor(sample(c("C", "D"), nex, TRUE)),
stringsAsFactors = TRUE
)
s_odds_ratio(
df = subset(dta, grp == "A"),
.var = "rsp",
.ref_group = subset(dta, grp == "B"),
.in_ref_col = FALSE,
.df_row = dta,
variables = list(arm = "grp", strata = "strata")
)
#> $or_ci
#> est lcl ucl
#> 0.7336370 0.4203395 1.2804489
#> attr(,"label")
#> [1] "Odds Ratio (95% CI)"
#>
#> $n_tot
#> n_tot
#> 200
#> attr(,"label")
#> [1] "Total n"
Created on 2024-09-30 with reprex v2.0.2
Thanks for confirming this. I run different cases and there seems to be a mysterious cutoff at nex > 1950. The following is summarizing my findings with examples.
In the first example using the survival::clogit()
function as it is used in the {tern}
package, when the number of rows exceeds 1950, the model produces NA
values in the output. However, the second example from the clogit()
function documentation works without issue, even with larger datasets. Below is an explanation and possible solutions.
{tern}
packageThe following example, derived from the {tern}
package, fails when nex
exceeds 1950 (roughly):
library(tibble)
library(survival)
nex <- 2000
dta <- data.frame(
"rsp" = sample(c(TRUE, FALSE), size = nex, replace = TRUE),
"grp" = sample(c("A", "B"), nex, TRUE),
"f1" = sample(c("a1", "a2"), nex, TRUE),
"f2" = sample(c("x", "y", "z"), nex, TRUE),
strata = factor(sample(c("C", "D"), nex, TRUE)),
stringsAsFactors = TRUE
)
df = subset(dta, grp == "A")
.var = "rsp"
.ref_group = subset(dta, grp == "B")
.df_row = dta
variables = list(arm = "grp", strata = "strata")
ref_grp <- as.character(unique(.ref_group[[variables$arm]]))
trt_grp <- as.character(unique(df[[variables$arm]]))
grp <- stats::relevel(factor(.df_row[[variables$arm]]), ref = ref_grp)
data <- data.frame(
rsp = .df_row[[.var]],
grp = grp,
strata = interaction(.df_row[variables$strata])
) |> as_tibble()
formula <- stats::as.formula("rsp ~ grp + strata(strata)")
survival::clogit(formula = formula, data = data)
table(data)
{survival}
package example from clogit()
This example from the clogit()
function documentation works even with larger datasets:
# Example from clogit() documentation
resp <- levels(logan$occupation)
n <- nrow(logan)
indx <- rep(1:n, length(resp))
logan2 <- data.frame(logan[indx,],
id = indx,
tocc = factor(rep(resp, each=n)))
logan2$case <- (logan2$occupation == logan2$tocc)
# This works as expected
clogit(case ~ tocc + tocc:education + strata(id), logan2)
# Check data structure
table(logan2)
Sparse Data and Stratification:
The issue in the first example likely arises from how the strata are defined. With more rows, the interaction(.df_row[variables$strata])
may result in many strata combinations, some of which may have very few or even no events (cases), leading to perfect separation or non-identifiable parameters. This makes the conditional likelihood impossible to estimate, leading to NA
values in the model output.
Small or Imbalanced Strata:
If the response variable (rsp
) is perfectly predicted within certain strata (e.g., all TRUE
or all FALSE
), the model cannot compute meaningful estimates, as the likelihood function becomes undefined for such strata.
Differences Between Examples:
The second example works because the strata (strata(id)
) are well-structured, and each group has sufficient variability within the outcome to avoid these problems. By ensuring a balanced design with repeated measures (via indx
), the likelihood estimates remain well-behaved even with large datasets.
You can diagnose potential problematic strata by examining the distribution of the response variable within each strata:
# Check the distribution of responses within strata
table(data$strata, data$rsp)
# Check the number of observations per strata
table(data$strata)
Reduce the Number of Strata:
Consider simplifying or collapsing strata to ensure more variability within each stratum, reducing the likelihood of perfect separation or overfitting.
Balance the Dataset:
Ensure that each stratum has both cases and controls. This can be done by adjusting how the strata are constructed, either by increasing data or reducing the number of strata-defining variables.
Add the method = "approximate"
Argument:
You can try using the method = "approximate"
argument in the clogit()
function, which uses an approximation method instead of the exact likelihood approach. This can help prevent issues related to perfect separation and sparse strata.
Here is how you can apply this change:
# Modify the clogit() call to use approximate method
survival::clogit(formula = formula, data = data, method = "approximate")
This method may help resolve issues related to convergence and NA
values in the output when the dataset becomes large and the stratification leads to small or unbalanced groups.
If you think adding a way to manage the likelihood ties from {tern} function is a good solution for you, I will add it ;) Let me know what you think.
Thank you. I think it may due to the too many ties and approximate can solve the issue here.
What happened?
A bug happened!
changing
nex <- 200
makes code work.sessionInfo()
Relevant log output
Code of Conduct
Contribution Guidelines
Security Policy