Closed juliannaharwood closed 4 months ago
You have given me nothing to work with. What is "model", what is "sample"? How could I reproduce your issue?
Hi, thanks for getting back to me. Below is reproducible code to help troubleshoot:
# Generate sample dataset
set.seed(123)
n <- 100000 # Number of observations
time <- rexp(n, rate = 1) # Generate survival times from exponential distribution
status <- sample(0:1, n, replace = TRUE) # Generate censoring status (0 = censored, 1 = event occurred)
covariate1 <- rnorm(n) # Generate a continuous covariate
covariate2 <- c(rep("Group A", n/4), rep("Group B", n/4), rep("Group C", n/4), rep("Group D", n/4)) # Generate factor covariate
covariate3 <- runif(n, min = min(time), max = max(time)) # Generate another continuous covariate to be used to create a tdc covariate
# Create train data
train_data <- data.frame(time = time, status = status, covariate1 = covariate1,
covariate2 = covariate2, covariate3 = covariate3)
# Add tdc variable
train_data <- train_data %>% mutate(id = 1:nrow(train_data))
train_data <- tmerge(data1 = train_data, data2 = train_data, id = id, tstop = time)
train_data <- tmerge(
data1 = train_data, data2 = train_data, id = id, covariate3_ended =
tdc(covariate3, rep(1, length(train_data$id)), 0)
)
train_data$status <- train_data$status & (train_data$time == train_data$tstop)
# Check head of train data
head(train_data)
# Fit Cox PH model
cox_model <- coxph(Surv(tstart, tstop, status) ~ covariate1 + covariate2 + covariate3 + covariate3_ended, data = train_data, model = TRUE)
# Summary of the model
summary(cox_model)
# Create new data to predict that has all the same covariates and twice the # of observations as train data
new_data <- data.frame(tstart = rep(0, n*2), tstop = rep(mean(time), n*2), status = rep(TRUE, n*2),
covariate1 = rep(mean(covariate1), n*2),
covariate2 = rep("Group A", n*2), covariate3 = rep(mean(covariate3), n*2),
covariate3_ended = rep(1, n*2))
head(new_data)
# Generate new predictions
new_prediction <- predict(cox_model, new_data, type = "survival", se.fit = TRUE)
# check if predictions and standard errors are the same
head(new_prediction$fit)
head(new_prediction$se.fit)
# check how many standard errors are missing
table(is.na(new_prediction$se.fit))
And sample output from the last 3 lines to demonstrate the unexpected behaviors 1 and 2 from my original message:
> # check if predictions and standard errors are the same
> head(new_prediction$fit)
[1] 0.6039441 0.6039441 0.6039441 0.6039441 0.6039441 0.6039441
> head(new_prediction$se.fit)
[1] 0.005645550 0.005655281 0.005700433 0.005662153 0.005694242 0.005648238
>
> # check how many standard errors are missing
> table(is.na(new_prediction$se.fit))
FALSE TRUE
109208 90792
Now repaired. Finishing up a new test file.
se.fit
varies for each observation:m x n
to predict new data of sizem2 x n2
, ifm2 > m
then these.fit
field of the prediction object is null afterm
values (meaning there are onlym
standard errors ever returned for new data).I'm wondering if this has something to do with this line in the source code of
predict.coxph
:It seems that the
risk
which is based off of the training data (and thus the length of the training data) is being used to calculate the standard error, which will limit it to the length of the training data.Thanks in advance!