therneau / survival

Survival package for R
381 stars 104 forks source link

model.frame and model.matrix are not available when a formula is passed to coxph (or survreg) #231

Closed philsf closed 4 months ago

philsf commented 1 year ago

Hello! I came across what appears to be an idiosyncrasy of survival model functions (both coxph and survreg, but I only show it for the previous for simplicity) when compared with base functions like lm and glm.

It appears that coxph (and survreg FTM) do not include model.frame and model.matrix in the same (predictable?) way as lm and glm. I'm not sure of these differences are inherent to how survival models differ from linear and generalized models. Would it be possible to implement those outputs in a way similar to how the base functions work? If so, would you consider including this for a future milestone?

First of all, the issue only appears when I pass a formula to the model function (as opposed to specifying it explicitly in the formula).

Reproducible code:

library(tidyverse)
library(survival)

> f <- Surv(time, status) ~ factor(sex)
> cancer %>% coxph(f, data=.) %>% is.list() # model works
[1] TRUE

> cancer %>% coxph(f, data=.) %>% model.frame() # no model.frame
Error in eval(temp, environment(formula$terms), parent.frame()) : 
  object '.' not found

> cancer %>% coxph(f, data=.) %>% model.matrix() # no model.matrix
Error in eval(temp, environment(formula$terms), parent.frame()) : 
  object '.' not found

When the formula is passed directly to the function (as below), everything works as expected. This is how I will work around the issue for the time being for the analysis I am currently working on.

> cancer %>% coxph(Surv(time, status) ~ factor(sex), data=.) %>% model.frame() %>% is.data.frame()
[1] TRUE
> cancer %>% coxph(Surv(time, status) ~ factor(sex), data=.) %>% model.matrix() %>% is.matrix()
[1] TRUE

Why it appears to me as an idiosyncrasy? Because it works for lm()...

# lm
> f1 <- time ~ factor(sex)
> cancer %>% lm(f1, data=.) %>% model.frame() %>% is.data.frame()
[1] TRUE
> cancer %>% lm(f1, data=.) %>% model.matrix() %>% is.matrix()
[1] TRUE

and for glm().

# glm
> f2 <- status-1 ~ factor(sex)
> cancer %>% glm(f2, family = binomial, data=.) %>% model.frame() %>% is.data.frame()
[1] TRUE
> cancer %>% glm(f2, family = binomial, data=.) %>% model.matrix() %>% is.matrix()
[1] TRUE

Context I came across this issue while fitting many Cox models with purrr::map() and trying to pass them to gtsummary::tbl_regression() for formatting. That didn't work and while investigating why, I found an old issue reported in that package, which prompted me to investigate what was being passed to model.frame() and model.matrix(). So here I am :)

PS: I don't think this is a duplicate of #11, but feel free to correct me if I'm wrong.

philsf commented 1 year ago

Correction I was having trouble using cox.zph inside map calls (as opposed to calling tbl_regression).

Error in `mutate()`:
ℹ In argument: `res = map(model, cox.zph)`.
ℹ In group 1: `dataset = "cc"`.
Caused by error in `map()`:
ℹ In index: 1.
Caused by error:
! object '.x' not found
Run `rlang::last_trace()` to see where the error occurred.
> rlang::last_trace()
<error/dplyr:::mutate_error>
Error in `mutate()`:
ℹ In argument: `res = map(model, cox.zph)`.
ℹ In group 1: `dataset = "cc"`.
Caused by error in `map()`:
ℹ In index: 1.
Caused by error:
! object '.x' not found
---
Backtrace:
     ▆
  1. ├─... %>% mutate(summ = map(model, glance))
  2. ├─dplyr::mutate(., summ = map(model, glance))
  3. ├─dplyr::mutate(., res = map(model, cox.zph))
  4. ├─dplyr:::mutate.data.frame(., res = map(model, cox.zph))
  5. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
  6. │   ├─base::withCallingHandlers(...)
  7. │   └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
  8. │     └─mask$eval_all_mutate(quo)
  9. │       └─dplyr (local) eval()
 10. ├─purrr::map(model, cox.zph)
 11. │ └─purrr:::map_("list", .x, .f, ..., .progress = .progress)
 12. │   ├─purrr:::with_indexed_errors(...)
 13. │   │ └─base::withCallingHandlers(...)
 14. │   ├─purrr:::call_with_cleanup(...)
 15. │   └─survival (local) .f(.x[[i]], ...)
 16. │     └─survival:::coxph.getdata(...)
 17. │       ├─stats::model.frame(fit)
 18. │       └─survival:::model.frame.coxph(fit)
 19. │         └─base::eval(temp, environment(formula$terms), parent.frame())
 20. │           └─base::eval(temp, environment(formula$terms), parent.frame())
 21. ├─stats::model.frame(...)
 22. └─stats::model.frame.default(...)
 23.   └─base::is.data.frame(data)

After I moved the formula to the coxph function explicitly, the map call works as expected.

therneau commented 4 months ago

The simple answer is to add model=TRUE to the coxph call. In Splus the default for nearly all function was model=TRUE. That changed in R at some point (though not right away). It is one aspect of the debate about what to include and not include in a fit result, vs reproduce later. Do you carry around lots of stuff that we "might" need? I still vote for the smaller output, but increases in memory and storage are making that less compelling. If you call functions inside of functions then the reconstruction of the model frame can get confused. The exact conditions under which that happens would make a small book, the short answer is 'has the call chain has sufficiently modified the calling tree". (I sometimes think that the tidyverse treats such manipulations as a virtue. ) One day another will inherit the maintainance of survival, and the default may change.