Standard errors for new data exhibit unexpected behavior

juliannaharwood commented 4 months ago

When I use a fitted model to predict survival for duplicate rows of new data, the survival prediction stays the same, but the se.fit varies for each observation:

survival <- predict(model, sample, type = "survival", se.fit = TRUE)
survival$fit
[1] 0.9787033 0.9787033 0.9787033
survival$se.fit
[1] 6.449212e-06 4.529473e-03 5.883216e-06

When I used a fitted model that was trained on data of size m x n to predict new data of size m2 x n2, if m2 > m then the se.fit field of the prediction object is null after m values (meaning there are only m standard errors ever returned for new data).

I'm wondering if this has something to do with this line in the source code of predict.coxph:

se[indx2] <- sqrt(v2-v1)* risk[indx2]

It seems that the risk which is based off of the training data (and thus the length of the training data) is being used to calculate the standard error, which will limit it to the length of the training data.

Thanks in advance!

therneau commented 4 months ago

You have given me nothing to work with. What is "model", what is "sample"? How could I reproduce your issue?

juliannaharwood commented 4 months ago

Hi, thanks for getting back to me. Below is reproducible code to help troubleshoot:

# Generate sample dataset
set.seed(123)
n <- 100000  # Number of observations
time <- rexp(n, rate = 1)  # Generate survival times from exponential distribution
status <- sample(0:1, n, replace = TRUE)  # Generate censoring status (0 = censored, 1 = event occurred)
covariate1 <- rnorm(n)  # Generate a continuous covariate
covariate2 <- c(rep("Group A", n/4), rep("Group B", n/4), rep("Group C", n/4), rep("Group D", n/4)) # Generate factor covariate
covariate3 <- runif(n, min = min(time), max = max(time)) # Generate another continuous covariate to be used to create a tdc covariate

# Create train data
train_data <- data.frame(time = time, status = status, covariate1 = covariate1,
                   covariate2 = covariate2, covariate3 = covariate3)

# Add tdc variable
train_data <- train_data %>% mutate(id = 1:nrow(train_data))
train_data <- tmerge(data1 = train_data, data2 = train_data, id = id, tstop = time)
train_data <- tmerge(
  data1 = train_data, data2 = train_data, id = id, covariate3_ended =
    tdc(covariate3, rep(1, length(train_data$id)), 0)
)
train_data$status <- train_data$status & (train_data$time == train_data$tstop)

# Check head of train data
head(train_data)

# Fit Cox PH model
cox_model <- coxph(Surv(tstart, tstop, status) ~ covariate1 + covariate2 + covariate3 + covariate3_ended, data = train_data, model = TRUE)

# Summary of the model
summary(cox_model)

# Create new data to predict that has all the same covariates and twice the # of observations as train data 
new_data <- data.frame(tstart = rep(0, n*2), tstop = rep(mean(time), n*2), status = rep(TRUE, n*2),
                       covariate1 = rep(mean(covariate1), n*2),
                       covariate2 = rep("Group A", n*2), covariate3 = rep(mean(covariate3), n*2),
                       covariate3_ended = rep(1, n*2))
head(new_data)

# Generate new predictions
new_prediction <- predict(cox_model, new_data, type = "survival", se.fit = TRUE)

# check if predictions and standard errors are the same
head(new_prediction$fit)
head(new_prediction$se.fit)

# check how many standard errors are missing
table(is.na(new_prediction$se.fit))

And sample output from the last 3 lines to demonstrate the unexpected behaviors 1 and 2 from my original message:

> # check if predictions and standard errors are the same
> head(new_prediction$fit)
[1] 0.6039441 0.6039441 0.6039441 0.6039441 0.6039441 0.6039441
> head(new_prediction$se.fit)
[1] 0.005645550 0.005655281 0.005700433 0.005662153 0.005694242 0.005648238
> 
> # check how many standard errors are missing
> table(is.na(new_prediction$se.fit))

 FALSE   TRUE 
109208  90792

therneau commented 4 months ago

Now repaired. Finishing up a new test file.

therneau / survival

Standard errors for new data exhibit unexpected behavior #255