therneau / survival

Survival package for R
394 stars 106 forks source link

survfit.coxph and missing values in newdata #137

Closed topepo closed 3 years ago

topepo commented 3 years ago

It looks like na.exclude() doesn't pad the results of survfit.coxph() with NA values (in the resulting matrix). This happens with or without strata.

library(survival)
# survival    * 3.2-7      2020-09-28 [1] CRAN (R 4.0.3) 

mod <- coxph(Surv(time, status) ~ age + ph.ecog, data = lung,
             na.action = na.exclude)

# lung$ph.ecog[15] is NA
new_x <- lung[1:15, c("ph.ecog", "age")]

length(predict(mod, new_x, na.action = na.exclude))
#> [1] 15

surv_estimates <- survfit(mod, newdata = new_x,
                          na.action = na.exclude)
dim(surv_estimates$surv)
#> [1] 185  14

Created on 2021-03-09 by the reprex package (v1.0.0.9000)

therneau commented 3 years ago

Why would it? Survfit returns a survival curve, not a per-subject value.

From: Max Kuhn @.> Reply-To: therneau/survival @.> Date: Tuesday, March 9, 2021 at 9:02 PM To: therneau/survival @.> Cc: Subscribed @.> Subject: [EXTERNAL] [therneau/survival] survfit.coxph and missing values in newdata (#137)

It looks like na.exclude() doesn't pad the results of survfit.coxph() with NA values (in the resulting matrix). This happens with or without strata.

library(survival)

survival * 3.2-7 2020-09-28 [1] CRAN (R 4.0.3)

mod <- coxph(Surv(time, status) ~ age + ph.ecog, data = lung,

         na.action = na.exclude)

lung$ph.ecog[15] is NA

new_x <- lung[1:15, c("ph.ecog", "age")]

length(predict(mod, new_x, na.action = na.exclude))

> [1] 15

surv_estimates <- survfit(mod, newdata = new_x,

                      na.action = na.exclude)

dim(surv_estimates$surv)

> [1] 185 14

Created on 2021-03-09 by the reprex packagehttps://reprex.tidyverse.org (v1.0.0.9000)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/therneau/survival/issues/137, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACJ3PGP73J6PUXT5XIXVA7TTC3HILANCNFSM4Y5AAJRQ.

topepo commented 3 years ago

It's very difficult to pad the values so that the expected dimensions are correct (regardless of the output type/format).

I think that it is reasonable to expect results (of any kind) for 15 samples when that na.action is used. At the minimum the current behavior is unexpected given what na.exclude does (in spirit or literally).

Would you be open to a PR that implements this behavior?

therneau commented 3 years ago

I still have no idea what you are talking about. For operations that return a value per subject, then na.action plays a role. But that is not what a survival curve is. You need to create a concrete example of what you desire to do.

topepo commented 3 years ago

In the example above, I'd like na.exclude() to pad the results with missing values as it does in the other situations. In other words, the results for surv_estimates would be 185x15 with a column of NA values for the row of newdata that had the missing value.

therneau commented 3 years ago

When the result of a prediction is a double, we can pad it with NA since R has an appropriate missing value for double. You can tuck it into the vector of doubles and it is still a vector of doubles. Every downstream function that accepts doubles needs to be aware of this and deal with it.

There is no "missing" of type survfit. There never has been one. When you call coxph with a newdata argument it returns a set of survival curves. To add an NA to that list involves creating a new NA type, and much more to the point, updating every routine in my package that accepts a survival curve as input, and other packages, to properly deal with this new feature. Example: ggsurvplot

What is the compeling use case that would justify months of work? You would need to start by finding all the downstream routines that would be affected, map out what they should do, then whether they now need an na.omit option. document it, and test it. The use case would need to be very strong.