Survival tmle is not clear what it is doing with targeted_times

jlstiles commented 2 years ago

Thanks for a swift reply in advance!!!!!

This concerns Chapter 11 of the tlverse handbook on survival. It is not clear what time points are being targeted. The Grid for estimating the hazard needs to be distinguished from the time points of interest, in essence. Also, I think this leads to incorrect hazard loss minimization because you don't want to penalize the hazard on time points later than the last time point of interest--i.e. if I want survival at times 1:5, I will only use data from times 1:5, not fit the hazard on times after time 5. Maybe you'd disagree but survival at time, t, is only dependent on the conditional hazard before time t.

library(tmle3)
library(sl3)
vet_data <- read.csv(
  paste0(
    "https://raw.githubusercontent.com/tlverse/deming2019-workshop/",
    "master/data/veteran.csv"
  )
)
vet_data$trt <- vet_data$trt - 1
# make fewer times for illustration
vet_data$time <- ceiling(vet_data$time / 20)

head(vet_data)
var_types <- list(
  T_tilde = Variable_Type$new("continuous"),
  t = Variable_Type$new("continuous"),
  Delta = Variable_Type$new("binomial")
)

Note, the testsurv function will test the chapter with different target_times and either cut the data to the target_times or not.

testsurv = function(target_times, cut) {
  survival_spec <- tmle_survival(
    treatment_level = 1, control_level = 0,
    target_times = target_times,
    variable_types = var_types
  )
  node_list <- list(
    W = c("celltype", "karno", "diagtime", "age", "prior"),
    A = "trt", T_tilde = "time", Delta = "status", id = "X"
  )

  long_data_tuple <- survival_spec$transform_data(vet_data, node_list)
  df_long <- long_data_tuple$long_data

  if (cut) df_long<- df_long[df_long$t<=5,]

  long_node_list <- long_data_tuple$long_node_list
  lrnr_mean <- make_learner(Lrnr_mean)
  lrnr_glm <- make_learner(Lrnr_glm)
  lrnr_gam <- make_learner(Lrnr_gam)
  sl_A <- Lrnr_sl$new(learners = list(lrnr_mean, lrnr_glm, lrnr_gam))
  learner_list <- list(A = sl_A, N = sl_A, A_c = sl_A)
  tmle_task <- survival_spec$make_tmle_task(df_long, long_node_list)

  set.seed(100)

  initial_likelihood <- survival_spec$make_initial_likelihood(
    tmle_task,
    learner_list
  )

  set.seed(101)

  up <- tmle3_Update_survival$new(
    maxit = 3e1,
    cvtmle = TRUE,
    convergence_type = "scaled_var",
    delta_epsilon = 1e-2,
    fit_method = "l2",
    use_best = TRUE,
    verbose = FALSE
  )

  targeted_likelihood <- Targeted_Likelihood$new(initial_likelihood,
                                                 updater = up
  )
  tmle_params <- survival_spec$make_params(tmle_task, targeted_likelihood)
  tmle_fit_manual <- fit_tmle3(
    tmle_task, targeted_likelihood, tmle_params,
    targeted_likelihood$updater
  )

  list(tmle_fit_manual=tmle_fit_manual, tmle_params=tmle_params)

}

first try to estimate target times 1:5 but enter the entire df_long which has assigned info for all patients for times 1:50. However, this appears to give targeted estimates for all times 1:50 as tmle_params shows (it ignores the self$options$target_times argument in that method, apparently.

ex1 = testsurv(1:5, cut = F)
# this should just be 5 params according to target_times
ex1$tmle_params

Gives parameter estimates for times 1:50. which was not wanted.

# it is giving parameters no one asked for
ex1[[1]]$initial_psi
ex1$tmle_fit_manual$estimates[[1]]$psi[1:25]

do the same as before but cut the data to only contain info for times 1:5. This should yield identical answers to ex1 but doesn't lest we want to fit on times after the last time point of interest.

ex2 = testsurv(1:5, cut = T)
ex2[[1]]$initial_psi
ex2$tmle_params

If ex1 is actually ignoring the target_times and just targeting all time points, this should yield the same answer as ex1 but doesn't. Initial estimates match but targeted estimates do not so. Maybe ex1 is only targeting times 1:5 even though the tmle_params are the same????? What is this code doing?

ex3 = testsurv(1:50, cut = F)
ex3[[1]]$initial_psi
ex3$tmle_fit_manual$estimates[[1]]$psi[1:25]
ex3$tmle_params

jlstiles commented 2 years ago

Another observation: What is the ex1$tmle_fit_manual$estimates[[1]]$IC exactly? The IC should have n rows (for n=137 independent patients) but instead has (n(# time points)) = 13750 = 6580 rows. And this IC matrix also has 50 columns for all 50 time points as opposed to just 5 colums for times 1:5 as target_times specifies. I'm not sure what this IC is because for time point 1 (assuming the IC is the first column in the IC matrix) there ought to be only 137 nonzero values for each of the residuals for the hazard from time 0 to 1. How could that have 6580 non-zero entries?

This just in: After checking further, it appears this IC matrix is 50 identical IC's stacked on top of each other--a stack for each time point.

jlstiles commented 2 years ago

Just one more comment on this: How do we specify to use single epsilon iterative (as in clfm) vs multi-epsilon ridge as is specified here vs recursive one-step? The ridge idea is very cool but not needed for only a few parameters. Thanks!!!

up <- tmle3_Update_survival$new( maxit = 3e1, cvtmle = TRUE, convergence_type = "scaled_var", delta_epsilon = 1e-2, fit_method = "l2", use_best = TRUE, verbose = FALSE )

tlverse / tmle3

Survival tmle is not clear what it is doing with targeted_times #82