Simulating survival data with exogenous, time-varying covariates

ellessenne commented 2 years ago

Hi Sam,

I hope all is well!

I was wondering, can {simsurv} can simulate survival data with a covariate that is exogenous and time-varying (such a deterministic treatment), something that could be analysed with a Cox model that uses start-stop notation (as in here)?

I tried the following and it did not work:

library(simsurv)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(flexsurv)
#> Loading required package: survival
library(survival)

set.seed(9911)
covs <- expand.grid(id = seq(500), t0 = 0:10)
covs[["t"]] <- covs[["t0"]] + 1
covs[["exo"]] <- stats::rnorm(nrow(covs))
covs <- covs[order(covs$id, covs$t), ]

s1 <- simsurv(
  lambdas = 0.5, gammas = 1, betas = c(exo = -1),
  x = select(covs, id, exo), maxt = 15, ids = unique(covs$id), idvar = "id"
)

df <- merge(covs, s1) %>%
  filter(t0 <= eventtime) %>%
  group_by(id) %>%
  mutate(n = row_number(t0)) %>%
  mutate(last = (n == max(n))) %>%
  ungroup() %>%
  mutate(
    status = ifelse(last, status, 0),
    t = ifelse(last, eventtime, t)
  ) %>%
  select(-last, -n)

coxph(formula = Surv(time = t0, time2 = t, event = status) ~ exo, data = df)
#> Call:
#> coxph(formula = Surv(time = t0, time2 = t, event = status) ~ 
#>     exo, data = df)
#> 
#>         coef exp(coef) se(coef)      z      p
#> exo -0.42799   0.65182  0.05097 -8.397 <2e-16
#> 
#> Likelihood ratio test=69.8  on 1 df, p=< 2.2e-16
#> n= 2266, number of events= 377
flexsurvreg(formula = Surv(time = t0, time2 = t, event = status) ~ exo, data = df, dist = "weibull")
#> Call:
#> flexsurvreg(formula = Surv(time = t0, time2 = t, event = status) ~ 
#>     exo, data = df, dist = "weibull")
#> 
#> Estimates: 
#>        data mean  est     L95%    U95%    se      exp(est)  L95%    U95%  
#> shape      NA     0.3599  0.3307  0.3916  0.0155      NA        NA      NA
#> scale      NA     4.6714  3.5065  6.2234  0.6837      NA        NA      NA
#> exo    0.2235     1.2109  0.9280  1.4939  0.1444  3.3566    2.5294  4.4543
#> 
#> N = 2266,  Events: 377,  Censored: 1889
#> Total time at risk: 2523.275
#> Log-likelihood = -602.6366, df = 3
#> AIC = 1211.273

^{Created on 2022-02-11 by the reprex package (v2.0.1)}

Do you think it would work if I define the hazard function by hand, passing then data frames for betas and x?

Many thanks in advance,

Alessandro

sambrilleman commented 2 years ago

Hi Alessandro

Great to hear from you! Doing well thanks. Hope all is going well at your end!

Hmm, yeah I don't think that will work, since I think from distant memory that it will try and integrate each row as starting from time 0.

I think you'd require the numeric integration, which would only happen in two situations I think:

when you specify tde / tdefunction
when you specify your own user-defined hazard function, cum hazard, etc.

Usually 1. is for a time-dependent effect, since it interacts your coefficient with a function of t. I wondered if there is a way for you to potentially trick that tdefunction interaction into getting the value you need for your time-varying covariate (rather than time varying coefficient), since the interaction term is just of the form tde_beta * x * tde_function. But unfortunately I don't think it's flexible for your use case. I think tdefunction is constrained to one function at the most, and it only takes one argument t, so probably not flexible enough for you to generate a time-varying covariate using it (i.e. interacted with an x or set of x's).

So, yeah, I think you'd have to go the user-defined (cumulative) hazard function route. Which is a bit more work, but should do what you need if defined in the right way.

HTH.

Cheers, Sam.

ellessenne commented 2 years ago

Hi Sam, All is well here too!

First of all, thanks for your detailed answer! I did suspect it was not going to be straightforward...

I found a paper that derives closed-form solutions for specific patterns of time-varying covariates, one of which fits with the scenario I am trying to simulate, so I think that'll be straightforward!

The next step is to include all of that in a joint modelling setting, but for that, I think combining numerical integration and root finding with the analytical results above should do. Fitting that joint model with a time-varying endogenous covariate + the biomarker will be a different story 😅

Anyway, massive thanks again for the feedback! Cheers,

Alessandro

sambrilleman / simsurv

Simulating survival data with exogenous, time-varying covariates #15