therneau / survival

Survival package for R
394 stars 106 forks source link

interval 2 kaplan meir returns strange confidence intervals #242

Closed Dpananos closed 9 months ago

Dpananos commented 9 months ago

I'm simulating an interval censored regression problem with the following code

library(tidyverse) 
library(survival)
n <-  10000

rate <- 1/10
raw_time <- rexp(n, rate=rate)

interval_start <- ifelse(floor(raw_time)>=21, 21, floor(raw_time))
interval_end <- ifelse(ceiling(raw_time) > 21, Inf, ceiling(raw_time))

d <- tibble(interval_start, interval_end) %>% 
  count(across(everything()), name='wt')

s <- survfit(Surv(time=interval_start, time2=interval_end, type='interval2') ~ 1, data=d, weights = wt)
plot(s)

Created on 2024-01-26 with reprex v2.0.2

In this particular example, the confidence intervals look sensible. However, rerunning this code often returns confidence intervals which look too wide. See below.

Created on 2024-01-26 with reprex v2.0.2

In fact, the much wider confidence intervals are the norm rather than the exception. Why might this be?

therneau commented 9 months ago

Interval censoring uses the Turnbull algorithm, which by default uses a robust variance. This was incorrect in the help file, I will fix that. With a robust variance the code assumes you have 21 observations (rows in the data set), not 10 thousand (sum of weights).

ghost commented 9 months ago

Not sure I understand. So this is expected behaviour? Would it be better not to use the weights then?

therneau commented 9 months ago

There are 2 kinds of weights. A replication weight of '5' means that there are 5 actual observations, all the same, and we have collapsed them into a single row of data to save space. When I started my career computers were much smaller and replication weights were a common trick to fit things into memory. (The first machine I used seriously was an IBM 1130; it had 512 K of disk space, and 2 kilobytes of total memory that had to hold your program, data, and the operating system). I haven't seen replication weights, in real data, in at least 2 decades. Sampling weights state state the influence that a given observation should have, but there is only one obs. They arise out of survey sampling, inverse probability of censoring weights, etc. and are usually not integers. The robust variance computation assumes sampling weights. The KM, for interval censored data, uses the robust variance by default. Why it does so is a statistical discussion and could be argued either way. You can override this by explicitly adding robust=FALSE to the call.