therneau / survival

Survival package for R
386 stars 104 forks source link

Bug when using the 'weights' and 'formula' argument inside a function #172

Closed RobinDenz1 closed 2 years ago

RobinDenz1 commented 2 years ago

Dear @therneau,

First of all, thank you very much for all of the work you did on survival analysis and especially for this package.

While conducting a simulation study, I have encountered a bug which I sadly can't seem to fix myself. The problem is as follows: Whenever I pass both weights and a formula to the coxph function inside of another function, I get an error message. I have written a few examples for this.

When I call the coxph function normally, it works fine:

library(survival)

# this works
coxph(Surv(futime, death) ~ age + sex,
      data=mgus,
      x=TRUE,
      weights=runif(n=nrow(mgus)))

Doing everything inside of a function works fine as well:

# this also works
test_fun_1 <- function(data) {

  ps_weights <- runif(n=nrow(data))

  mod <- coxph(Surv(futime, death) ~ age + sex,
               data=data,
               x=TRUE,
               weights=ps_weights)

  return(mod)
}
test_fun_1(data=mgus)

However passing both the formula and the weights does not work (resulting in the error message: Error in eval(extras, data, env) : Object 'ps_weights' not found):

# this does not work
test_fun_2 <- function(cox_formula, data, ps_weights) {

  mod <- coxph(cox_formula,
               data=data,
               x=TRUE,
               weights=ps_weights)

  return(mod)
}
test_fun_2(cox_formula=Surv(futime, death) ~ age + sex,
           data=mgus,
           ps_weights=runif(n=nrow(mgus)))

Passing only the formula and estimating the weights within the function does not work either (also resulting in the error message: Error in eval(extras, data, env) : Object 'ps_weights' not found):

# this does not work
test_fun_3 <- function(cox_formula, data) {

  ps_weights <- runif(n=nrow(data))

  mod <- coxph(cox_formula,
               data=data,
               x=TRUE,
               weights=ps_weights)

  return(mod)
}
test_fun_3(cox_formula=Surv(futime, death) ~ age + sex, data=mgus)

I am not sure what exactly is going on here. The obvious solution would be to not do it inside a function, but this is sadly not an option for me as I need to use parallel processing to make my study feasible computationally (I am using different formulas and weights in each simulation run).

Here is the output from a call to sessionInfo():

R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C                    LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] survival_3.2-13

loaded via a namespace (and not attached):
 [1] magrittr_2.0.1   splines_4.0.3    tidyselect_1.1.0 lattice_0.20-41  R6_2.5.0         rlang_0.4.10    
 [7] foreach_1.5.1    fansi_0.4.2      dplyr_1.0.5      tools_4.0.3      parallel_4.0.3   grid_4.0.3      
[13] snow_0.4-3       utf8_1.2.1       DBI_1.1.1        iterators_1.0.13 ellipsis_0.3.2   assertthat_0.2.1
[19] tibble_3.1.4     lifecycle_1.0.0  crayon_1.4.1     doSNOW_1.0.19    Matrix_1.2-18    purrr_0.3.4     
[25] vctrs_0.3.8      codetools_0.2-16 glue_1.4.2       compiler_4.0.3   pillar_1.6.2     generics_0.1.0  
[31] pkgconfig_2.0.3 

Thank you very much for your help.

With kind regards, Robin

therneau commented 2 years ago

You have been caught by the aggressive weirdness of formulas; the coxph function is just an innocent bystander. (You will find exactly the same behavior if you replace coxph by lm in your examples above.)

You are expecting that when you call coxph inside the function, that the variables in the formula will be found first in the data= argument, then in the set of nested evaluation frames (local function, parent of function, parent of parent, ........ base). That's what any sensible person would expect from a functional language. But 1. formulas fix their worldview at the moment they are created and 2. the model.frame() function treats formulas special: it looks first in the data frame, then in the formula's universe, and No Where Else. Since all modeling functions call model.frame() under the covers, you are hosed: in the formula's world, nothing that was not present at it's own creation even exists.

It reminds me of lines from The Big Chill. Michael : Harold, don't you have any other music , you know, from this century?
Harold : There is no other music, not in my house.

Believe me, I have tried

RobinDenz1 commented 2 years ago

Thank you very much for the detailed explanation. I am sorry to have bothered you with this since it has nothing to do with your R-Package. It is always very interesting to learn some specific details like this. Thanks!