therneau / survival

Survival package for R
388 stars 104 forks source link

Scaled Schoenfeld residuals not affected by robust variance estimation #161

Open Irina-Sv opened 3 years ago

Irina-Sv commented 3 years ago

Good day!

I have noticed that scaled Schoenfeld residuals and variance estimates from cox.zph differ between version 3.1-11 and 2.41-3 of Survival and that it seem to depend on the use of robust estimation of variance.

Is it that in version 3.1-11 and 3.2-13 of survival, cox.zph does not scale the Schoenfeld residuals by robust variance estimation of the parameters? We would be very grateful for any feedback about this.

As an example, consider the following example data:

  # --- Install Survival version 2.41-3 (latest version in our environment is 3.1-11) ---
  # library(devtools)
  # 
  # #install_version("survival", version = "2.41-3", lib = "999_temp_pkg", repos = "http://cran.us.r-project.org")
  # 

  #--- Load the diabetic data (can be found in survival version 3.1-11) ---
  data(diabetic, package = "survival")
  juvenile <- 1*(diabetic$age < 20)

  #--- Load Survival version 3.1-11 ---
  library(survival) 

  # Model with robust variance estimation 
  fit_hat_newer <- coxph(Surv(time, status) ~ trt + juvenile, cluster = id, data= diabetic)
  A_hat_newer <- cox.zph(fit_hat_newer, terms = F)
  # Model without robust variance estimation 
  fit_newer <- coxph(Surv(time, status) ~ trt + juvenile, data= diabetic)
  A_newer <- cox.zph(fit_newer, terms = F)

  #--- Load Survival version 2.41-3 ---
  detach("package:survival", unload = TRUE)  
  library(survival, lib.loc = "999_temp_pkg") # Version 2.41-3

  # Model with robust variance estimation 
  fit_hat_older <- coxph(Surv(time, status) ~ trt + juvenile + cluster(id), data= diabetic)
  A_hat_older <- cox.zph(fit_hat_older)
  # Model without robust variance estimation 
  fit_older <- coxph(Surv(time, status) ~ trt + juvenile, data= diabetic)
  A_older <- cox.zph(fit_older)

  #--- Compare estimates ---
  # A_hat_newer: Returned output from the newer version of cox.zph with robust variance estimation
  # A_hat_older: Returned output from the older version of cox.zph with robust variance estimation
  # A_newer: Returned output from the newer version of cox.zph without robust variance estimation
  # A_older: Returned output from the newer version of cox.zph without robust variance estimation

  print(max(abs(A_hat_newer$y - A_newer$y))) # =0.00 Same estimates
  print(max(abs(A_hat_newer$var - A_newer$var)))  #=0.00 Same estimates
  # For the newer version, Scaled Schoenfeld residuals are the same when:
  # robust variance estimation is performed
  # robust variance estimation is not performed

  print(max(abs(A_hat_older$y - A_older$y))) #=0.99 Different estimates
  print(max(abs(A_hat_older$var - A_older$var))) #=0.01 Different estimates
  # For the older version, Scaled Schoenfeld residuals are different when:
  # robust variance estimation is performed
  # robust variance estimation is not performed

  print(max(abs(A_hat_newer$y - A_older$y))) #=1e-14 Almost the same estimates
  # The Scaled Schoenfeld residuals are similar between:
  # estimates from the newer version of survival with robust estimation
  # estimates from the older version of survival without robust estimation  
therneau commented 3 years ago

You are correct, and this is a problem.
Older versions of cox.zph used an approximate score test, which is quite good 99% of the time, but could give seriously biased answers if there was a strong strata* covariate interplay, e.g., a covariate with variance V in one strata and 0 in another strata. Think of an analysis stratified by institution, a factor variable "X" as a covariate, and some of the institutions only enrolled a subset of the levels; some of the dummy variables for X will have this pattern. This had been noticed and critiqued many years ago, and I finally addressed and added some serious test cases. To get both correctness and speed it uses C code.

What I didn't think through was that the approximate method 'inherited' a robust variance, automatically, but I now need to handle this explicitly. Damn. My test suite didn't have zph + robust variance case. First I need to work out exactly what the proper mathematics is, however.

Irina-Sv commented 3 years ago

Thank you so much for the quick and clear reply, and thank you for maintaining such an amazing R package!!

therneau commented 2 years ago

I have been very slow to attach this issue, but I've finally rolled up my sleeves and carefully looked at it. That started with adding all the math detail to the vignette. I now see that I need to create special code for the dfbeta residuals. I've started on it.