tidymodels / censored

Parsnip wrappers for survival models
https://censored.tidymodels.org/
Other
123 stars 12 forks source link

aorsf - accelerated oblique random survival forest #211

Closed bcjaeger closed 2 years ago

bcjaeger commented 2 years ago

Hello! Following up on #172, aorsf has made it to CRAN and its API is stable.1

One possible concern is that aorsf fit and predict functions are designed to throw hard errors when the input data have missing values. This may not be in line with the design of parsnip predict functions, which propagate rows with missing data into the returned object as a missing value for the prediction. Would you like me to write some additional code into the aorsf wrappers for prediction to make it more consistent with the parsnip design?

A second possible concern is that aorsf predict functions don't let users compute predictions for times that are greater than the maximum observed time in the model's training data. E.g., for the lung data, the max time is 1022, so trying to predict a survival probability at time of 1023 or higher would trigger an error. Would you like me to write some additional code to prevent that sort of error from triggering?

1In #172, we agreed to consider this pull request when aorsf was on CRAN and the rOpenSci review was done. Technically, my review is still open at rOpenSci, but the review is finished in the sense that all updates to aorsf based on the reviewer feedback are complete. The issue is still open but the editor has let me know that it will be wrapped up in a week or so.

bcjaeger commented 2 years ago

Just wanted to share an update on the rOpenSci review. I did think it was about to be closed but yesterday I was assigned two more reviewers. My bad - it looks like the rOpenSci review will continue for at least another few weeks.

I am very excited to see what the new reviewers think and I will do everything I can to incorporate their feedback in a way that is compatible with the version of aorsf on CRAN. I'll let you know asap if breaking changes come up and I'll update this PR accordingly if they do.

bcjaeger commented 2 years ago

Thanks for the great review! The changes you described will definitely make aorsf more user-friendly, and I'm happy to do that. I will get things updated in aorsf first and then work on the comments above.

bcjaeger commented 2 years ago

Happy to share updates in aorsf based on your thoughts above. One follow-up question:

An option to allow NA values for out-of-range predictions instead of an error.

Were you referring to the case where times in predict() has at least one value over the maximum observed follow-up time? If so, great! That is allowed now with check_boundaries set to FALSE. The returned values are currently not NA (explanation in reprex comments).

Thank you again for the thoughts you shared - the changes definitely make the API friendlier. I'm happy to make more modifications in aorsf if that is helpful. Otherwise, I will work on updates in the censored code.

suppressPackageStartupMessages({
 library(aorsf)
 library(tidyverse)
})

# Allowing the standard flexibility of Surv() wrt the values of the 
# event indicator.

pbc_lgl <- mutate(pbc_orsf, status = as.logical(status))
pbc_12 <- mutate(pbc_orsf, status = status + 1)

set.seed(329)
fit_standard <- orsf(pbc_orsf, Surv(time, status) ~ . - id)
set.seed(329)
fit_lgl <- orsf(pbc_lgl, formula = Surv(time, status) ~ . - id)
set.seed(329)
fit_12 <- orsf(pbc_12, formula = Surv(time, status) ~ . - id)

all.equal(fit_standard$forest, fit_lgl$forest)
#> [1] TRUE
all.equal(fit_standard$forest, fit_12$forest)
#> [1] TRUE

# An option for how missing values are handled for predictions: For example, 
# predict.lm() has an argument na.action which determines what happens with 
# missing values. aorsf could still default to an error but in censored we 
# could request that it pads the results with NA. See also ?na.fail.

pbc_first10 <- pbc_na <- slice(pbc_orsf, 1:10)
pbc_na$age[c(1,5,10)] <- NA

# consistent w/tidymodels
predict(fit_standard, pbc_na, na_action = 'pass')
#>             [,1]
#>  [1,]         NA
#>  [2,] 0.10615858
#>  [3,] 0.77370799
#>  [4,] 0.43724275
#>  [5,]         NA
#>  [6,] 0.08677136
#>  [7,] 0.06302206
#>  [8,] 0.25354362
#>  [9,] 0.96007034
#> [10,]         NA
# consistent w/base R
predict(fit_standard, pbc_na, na_action = 'omit')
#>            [,1]
#> [1,] 0.10615858
#> [2,] 0.77370799
#> [3,] 0.43724275
#> [4,] 0.08677136
#> [5,] 0.06302206
#> [6,] 0.25354362
#> [7,] 0.96007034

# An option to allow NA values for out-of-range predictions 
# instead of an error.

## I allow out of range predictions but, instead of returning NA when
## boundary_checks is FALSE, return prediction at the max observed time
## (notice column 2 (t = 10,000) is the same as 3 (t = 100,000))
## I did this to be consistent with the way I implemented boundary_checks
## in the orsf_pd functions and also just in case a user really wants
## to make predictions this way.
predict(fit_standard, 
        new_data = pbc_first10, 
        boundary_checks = FALSE,
        pred_horizon = c(1000, 10000, 100000))
#>             [,1]      [,2]      [,3]
#>  [1,] 0.91966955 0.9870598 0.9870598
#>  [2,] 0.04281353 0.5127637 0.5127637
#>  [3,] 0.40236198 0.9077745 0.9077745
#>  [4,] 0.22127622 0.8731020 0.8731020
#>  [5,] 0.09070765 0.7597322 0.7597322
#>  [6,] 0.02118349 0.4610088 0.4610088
#>  [7,] 0.02227554 0.5664215 0.5664215
#>  [8,] 0.09007197 0.8394871 0.8394871
#>  [9,] 0.88790002 0.9849913 0.9849913
#> [10,] 0.04862403 0.6593526 0.6593526

# Allowing any order of time points for prediction, and sorting those 
# internally to orsf().
predict(fit_standard, 
        new_data = pbc_first10, 
        pred_horizon = c(1000, 500, 100))
#>             [,1]        [,2]        [,3]
#>  [1,] 0.91966955 0.785869003 0.142670246
#>  [2,] 0.04281353 0.010758594 0.000000000
#>  [3,] 0.40236198 0.135624711 0.016068543
#>  [4,] 0.22127622 0.095856957 0.014200541
#>  [5,] 0.09070765 0.027139137 0.001000000
#>  [6,] 0.02118349 0.005302408 0.000000000
#>  [7,] 0.02227554 0.010475093 0.000000000
#>  [8,] 0.09007197 0.025564204 0.001844517
#>  [9,] 0.88790002 0.744081557 0.270957115
#> [10,] 0.04862403 0.017911853 0.001200000

Created on 2022-09-04 by the reprex package (v2.0.1)

hfrick commented 2 years ago

todo for myself: add tests on case weights

bcjaeger commented 2 years ago

Thank you! I will review the examples you shared and open a PR on parsnip.

I noticed that my use of deparse1 causes our tests to fail on oldrel-3 since deparse1 was introduced in R 4.0.0. I think as.character can be used instead of deparse1, and will make sure this fix is included in 0.0.3.

hfrick commented 2 years ago

Great, thanks!

re deparse1(): if you'd like some inspiration, this is how it's being dealt with in testthat: https://github.com/r-lib/testthat/blob/426eb523f6772c76175d6e8d7ca04355b1ee7ad5/R/utils.R#L53

and there is also https://github.com/r-lib/backports

bcjaeger commented 2 years ago

Howdy! I just want to make sure I have not forgotten anything on my list. I have

There is no rush to complete this, I just want to make sure I'm not holding it up. If there is anything else I can do, I would be happy to help.

hfrick commented 2 years ago

Awesome, thank you! I'll get to this soon and also resolve the merge conflicts πŸ‘

bcjaeger commented 2 years ago

Great =] Thank you!

hfrick commented 2 years ago

Getting closer @bcjaeger ! Once the parsnip PR is good to go, this one is as well πŸ‘

bcjaeger commented 2 years ago

Howdy! Just wanted to let you know aorsf 0.0.4 is on its way to CRAN and should be available soon. (It includes support for Surv objects in the formula.)

bcjaeger commented 2 years ago

Thank you!

hfrick commented 2 years ago

Thank you for the great PR! πŸ™Œ

github-actions[bot] commented 2 years ago

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.