therneau / survival

Survival package for R
381 stars 104 forks source link

error when tstart != 0 #260

Closed juliannaharwood closed 3 months ago

juliannaharwood commented 3 months ago

Hello,

In the latest version there was an error handler added to predict.coxph that seems to be thrown whenever tstart != 0. I am using time dependent covariates in my model training, and am also interested in making predictions like "what is the probability of survival at time Y given the observation has survived until time X." I now get errors for these use cases. Here is a reproducible example where the model can't make predictions for the data it was trained on:

# Generate sample dataset
set.seed(123)
n <- 100000  # Number of observations
time <- rexp(n, rate = 1)  # Generate survival times from exponential distribution
status <- sample(0:1, n, replace = TRUE)  # Generate censoring status (0 = censored, 1 = event occurred)
covariate1 <- rnorm(n)  # Generate a continuous covariate
covariate2 <- c(rep("Group A", n/4), rep("Group B", n/4), rep("Group C", n/4), rep("Group D", n/4)) # Generate factor covariate
covariate3 <- runif(n, min = min(time), max = max(time)) # Generate another continuous covariate to be used to create a tdc covariate

# Create train data
train_data <- data.frame(time = time, status = status, covariate1 = covariate1,
                         covariate2 = covariate2, covariate3 = covariate3)

# Add tdc variable
train_data <- train_data %>% mutate(id = 1:nrow(train_data))
train_data <- tmerge(data1 = train_data, data2 = train_data, id = id, tstop = time)
train_data <- tmerge(
  data1 = train_data, data2 = train_data, id = id, covariate3_ended =
    tdc(covariate3, rep(1, length(train_data$id)), 0)
)
train_data$status <- train_data$status & (train_data$time == train_data$tstop)

# Check head of train data
head(train_data)

# Fit Cox PH model
cox_model <- coxph(Surv(tstart, tstop, status) ~ covariate1 + covariate2 + covariate3 + covariate3_ended, data = train_data, model = TRUE)

# Summary of the model
summary(cox_model)

# Generate new predictions
new_prediction <- predict(cox_model, train_data, type = "survival", se.fit = TRUE)

Error:

Error in predict.coxph(cox_model, train_data, type = "survival", se.fit = TRUE) : 
  predicted survival must be from the start of the curve

And session info:

R version 4.2.1 (2022-06-23)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 14.5

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.1.1    survival_3.7-0

loaded via a namespace (and not attached):
 [1] lattice_0.20-45  here_1.0.1       fansi_1.0.3      utf8_1.2.2       rprojroot_2.0.3  grid_4.2.1       R6_2.5.1         lifecycle_1.0.4 
 [9] magrittr_2.0.3   pillar_1.8.1     rlang_1.1.2      cli_3.6.1        rstudioapi_0.13  Matrix_1.4-1     vctrs_0.6.5      generics_0.1.3  
[17] splines_4.2.1    tools_4.2.1      glue_1.6.2       compiler_4.2.1   pkgconfig_2.0.3  tidyselect_1.2.0 tibble_3.2.1    
therneau commented 3 months ago

The prediction routine had updates due to an error in some cases, including addition of a further set of tests (tests/predsurv.R) to further verify correctness. You are correct that I added the new check, but I had not thought through the case of conditional survival. You can get what you want via exp(predict(..., type="expected")).

I have a solid use case for per-subject expected hazard, each id with unique covariates and time point (it is the 'E' part of the martingale residual) and also for per subject expected and survival with per id covariates and a single common time point tau for all. But I have never found one for assessing each subject's predicted survival at a different time point for each. I would be very curious to learn the back story behind your example. Assuming that there was no use case is part of the genesis of that error message.

therneau commented 3 months ago

The error message has been removed, and more detail added to the help file

juliannaharwood commented 2 months ago

Thank you for updating to accommodate this use case! The example code I gave above wasn't reflective of my actual use case, it was just the easiest way I could think to generate the error. My use case involves a subscription product where I want to measure a subscriber's likelihood of unsubscribing at a given time T. I don't always observe a subscriber at T = 0, so if I observe them at t > 0, I want to know p(unsubscribed at T | subscribed at t). Is that a valid use case for this package?

therneau commented 2 months ago

Yes, individual prediction is a valid use case. I was thinking of model validation, where the "bulk" results from everyone at individual times is not interesting. Thanks for the reply.