sambrilleman / simsurv

Simulate Survival Data
GNU General Public License v3.0
22 stars 8 forks source link

Insufficient nodes for simulating from complex hazards #20

Open irtimmins opened 7 months ago

irtimmins commented 7 months ago

Hi, I have problems using simsurv for more complex hazard functions (which I need for cases where the cumulative hazard is not analytically tractable).

A reproducible example is given below. The figure shows an example where the KM survival estimates from simsurv, which are generated from specifying the hazard function, deviate significantly from the true model.

Here, we take a bell shaped hazard function with sharp spike at 2 year timepoint, and we can see that simsurv, which is using numerical integration of the hazard function, does not reproduce the survival function at a large sample of N = 10,000.

In contrast, as shown in the figure, using simsurv and specifying the cumulative hazard (which exists for this example) does reproduce the survival function accurately.

This is due to having too few nodes limited at 15 for the Gauss-Kronrod quadrature. The Stata version of this in survsim does not run into this problem, as nodes can be set e.g. to 100 or 200, which overcomes the problem in this case (though Stata also struggles for nodes = 15).

Can greater flexibility be added to the number of nodes in simsurv?

library(dplyr)
library(tidyr)
library(tibble)
library(ggplot2)
library(survminer)
library(simsurv)

set.seed(1024771823)

N <- 1e4 # sample for simsurv datasets.
x_cov <- tibble(id = 1:N)

# set parameters to define survival model.
mu <- 2
sd <- 0.1

haz_norm <- function(t, x = NULL, betas = NULL){
  res <- dnorm(x=t, mean = mu, sd = sd)+0.2
  res
}
cumhaz_norm <- function(t, x = NULL, betas = NULL){
  res <- pnorm(q=t, mean = mu, sd = sd)+0.2*t
  res
}

# simulate data using hazard function
sim_data_h <- simsurv(x = x_cov,
                      hazard = haz_norm,
                      nodes = 15)
# extract Kaplan-Meier estimates
km_h <- survfit(Surv(eventtime, status) ~ 1, data=sim_data_h )
km_h_plot <- ggsurvplot(km_h,data=sim_data_h)$data.survplot %>% 
  select(time, surv) %>%
  mutate(simsurv_scale = "hazard",
         alpha = 1,
         linetype = 1)

# simulate data using cumulative hazard function
sim_data_ch <- simsurv(x = x_cov,
                       cumhazard = cumhaz_norm)
# extract Kaplan-Meier estimates
km_ch <- survfit(Surv(eventtime, status) ~ 1, data=sim_data_ch)
km_ch_plot <- ggsurvplot(km_ch,data=sim_data_ch)$data.survplot %>% 
  select(time, surv) %>%
  mutate(simsurv_scale = "cumhazard",
         alpha = 1,
         linetype = 1)

# generate true values of survival function.
true_surv_plot <- tibble(time = seq(from = 0, to = 20, length.out = 1e4)) %>%
  mutate(surv = exp(-(pnorm(q=time, mean = mu, sd = sd)+0.2*time))) %>%
  mutate(simsurv_scale = "true model",
         alpha = 1,
         linetype = 2)

# combine values together into tidy format tibble
plot_df <- rbind(true_surv_plot, km_h_plot, km_ch_plot) %>%
  filter(time <= 20) %>%
  mutate(linetype = as.factor(linetype))

ggplot(aes(x = time, y= surv, 
           colour = simsurv_scale, 
           linetype = simsurv_scale), data = plot_df) +
  theme_classic()+
  geom_step()+
  xlim(c(0,20))+
  ylim(c(0,1))+
  scale_colour_manual(name = "Simsurv scale", values = c("#c51b7d","#3288bd", "#1a1a1a"))+
  scale_linetype_manual(name = "Simsurv scale", values = c("solid","solid","dashed"))+
  ylab("Survival")

simsurv_nodes