aorsf; accelerated oblique random survival forests

bcjaeger commented 2 years ago

Date accepted: 2022-09-22

Submitting Author Name: Byron C Jaeger Submitting Author Github Handle: !--author1-->@bcjaeger@tdhock<!--end-editor-- Reviewers: @chjackson, @mnwright, @jemus42

Due date for @chjackson: 2022-07-29 Due date for @mnwright: 2022-09-21 Due date for @jemus42: 2022-09-21

Archive: TBD Version accepted: TBD Language: en

Paste the full DESCRIPTION file inside a code block below:

Package: aorsf
Title: Accelerated Oblique Random Survival Forests
Version: 0.0.0.9000
Authors@R: c(
    person(given = "Byron",
           family = "Jaeger",
           role = c("aut", "cre"),
           email = "bjaeger@wakehealth.edu",
           comment = c(ORCID = "0000-0001-7399-2299")),
    person(given = "Nicholas",  family = "Pajewski", role = "ctb"),
    person(given = "Sawyer", family = "Welden", role = "ctb", email = "swelden@wakehealth.edu")
    )
Description: Fit, interpret, and make predictions with oblique random
    survival forests. Oblique decision trees are notoriously slow compared
    to their axis based counterparts, but 'aorsf' runs as fast or faster than 
    axis-based decision tree algorithms for right-censored time-to-event 
    outcomes.
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE, roclets = c ("namespace", "rd", "srr::srr_stats_roclet"))
RoxygenNote: 7.1.2
LinkingTo: 
    Rcpp,
    RcppArmadillo
Imports: 
    table.glue,
    Rcpp,
    data.table
URL: https://github.com/bcjaeger/aorsf,
    https://bcjaeger.github.io/aorsf
BugReports: https://github.com/bcjaeger/aorsf/issues
Depends: 
    R (>= 3.6)
Suggests: 
    survival,
    survivalROC,
    ggplot2,
    testthat (>= 3.0.0),
    knitr,
    rmarkdown,
    glmnet,
    covr,
    units
Config/testthat/edition: 3
VignetteBuilder: knitr

Pre-submission Inquiry

[X] A pre-submission inquiry has been approved in issue#525

General Information

Who is the target audience and what are scientific applications of this package?

Target audience: people who want to fit and interpret a risk prediction model, i.e., a prediction model for right-censored time-to-event outcomes.

Applications: fit an oblique random survival forest, compute predicted risk at a given time, estimate the importance of individual variables, and compute partial dependence to depict relationships between specific predictors and predicted risk.

Paste your responses to our General Standard G1.1 here, describing whether your software is:
- An improvement on other implementations of similar algorithms in R.
Please include hyperlinked references to all other relevant software. The obliqueRSF package was the original R package for oblique random forests. I wrote it and it is very slow. I wrote aorsf because I had multiple ideas about how to make obliqueRSF faster, specifically using a partial Newton Raphson algorithm instead of using penalized regression to derive linear combinations of variables in decision nodes. It would have been possible to rewrite obliqueRSF, but it would have been difficult to make the re-write backwards compatible with the version of obliqueRSF on CRAN.
(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?

Not applicable

Badging

What grade of badge are you aiming for? (bronze, silver, gold)

Gold

If aiming for silver or gold, describe which of the four aspects listed in the Guide for Authors chapter the package fulfils (at least one aspect for silver; three for gold)

Compliance with a good number of standards beyond those identified as minimally necessary. aorsf complies with over 100 combined standards in the general and ML categories.
Demonstrating excellence in compliance with multiple standards from at least two broad sub-categories. See 1. above
Internal aspects of package structure and design. aorsf uses an optimized routine to partially complete Newton Raphson scoring for the Cox proportional hazards model and also an optimized routine to compute likelihood ratio tests. Both of these routines are heavily used when fitting oblique random survival forests, and both demonstrate the exact same answers as corresponding functions in the survival package (see tests in aorsf) while running at least twice as fast (thanks to Rcpparmadillo).

Technical checks

Confirm each of the following by checking the box.

[X] I have read the rOpenSci packaging guide.
[X] I have read the author guide and I expect to maintain this package for at least 2 years or have another maintainer identified.
[X] I/we have read the Statistical Software Peer Review Guide for Authors.
[X] I/we have run autotest checks on the package, and ensured no tests fail.
[X] The srr_stats_pre_submit() function confirms this package may be submitted.
[X] The pkgcheck() function confirms this package may be submitted - alternatively, please explain reasons for any checks which your package is unable to pass.

I think aorsf is passing autotest and srr_stats_pre_submit(). I am having some issues running these on R 4.2. Currently, autotest is returning NULL, which I understand to be a good thing, and srr_stats_pre_submit is not able to run (not sure why; but it was fine before I updated to R 4.2).

This package:

[X] does not violate the Terms of Service of any service it interacts with.
[X] has a CRAN and OSI accepted license.
[X] contains a README with instructions for installing the development version.

Publication options

[X] Do you intend for this package to go on CRAN?
[ ] Do you intend for this package to go on Bioconductor?

Code of conduct

[X] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

ropensci-review-bot commented 2 years ago

:calendar: @chjackson you have 2 days left before the due date for your review (2022-07-29).

tdhock commented 2 years ago

@ropensci-review-bot submit review https://github.com/ropensci/software-review/issues/532#issuecomment-1187723800 time 5

ropensci-review-bot commented 2 years ago

Logged review for chjackson (hours: 5)

tdhock commented 2 years ago

In response to that timing figure, that looks like a good start. Is the Y axis seconds? It would be helpful to add error bands to see the variance in timings (geom_ribbon with ymin=min or lower quartile over 5 or 10 timings, etc). On the log-log plot the different asymptotic complexity classes show up as different slopes on the right, for large n_obs values. Looks like party and aorsf have the same complexity class (linear?). For the others, perhaps your n_obs is not large enough to see the asymptotic complexity class clearly, so you may try increasing the max n_obs to see that.

bcjaeger commented 2 years ago

Sorry for the delay - I am happy to share an update:

increased the max sample size to be slightly over 10k (100k caused rstudio to crash, likely b/c of memory)
reduced the number of trees per benchmark run to 1 (down from 50) in order to make memory crashes less likely.
increased the number of times we replicated the benchmark to 50 (up from 10)
increased the number of sample sizes we examined (now showing 8 different sample sizes: round(10^seq(from = 2, to = 4+1/3, by = 1/3)))
ribbons based on the 2.5th and 97.5th percentile of the run times and better labels for R packages, panels, and axes.

As far as my interpretation goes, I think ranger has different computational engines that get used depending on the dimension of the input data. aorsf, randomForestSRC, and party have pretty consistent patterns, and aorsf does well overall.

What do you think?

temp

tdhock commented 2 years ago

Hi, thanks for sharing. That figure/result looks very interesting, good job. Based on the figure, which has same slopes for each line on that log log plot, it looks like all of the packages have the same asymptotic complexity class, which should be linear, right?

bcjaeger commented 2 years ago

Yes, I think so! Thank you very much for your help improving the figure. I intend to include this work in the paper describing aorsf's methods and efficiency. I have two questions:

would it be reasonable/helpful for me to include a note to you or to rOpenSci in the acknowledgments section of the paper I have referred to above (particularly for your help guiding the timing benchmark above)?
do you think this review of aorsf is complete?

I am very happy with the feedback I've gotten and feel ready to proceed with submissions to JOSS and CRAN.

bcjaeger commented 2 years ago

Hi @tdhock, I am not sure if it's my job or yours to say when the review is complete. Do you know?

Also, the application form mentions a badge for the submission:

What grade of badge are you aiming for? (bronze, silver, gold)

I made a case for aorsf to receive the gold badge in my application. Do you have the final say on the badge it receives?

tdhock commented 2 years ago

Sorry for the delay. I will get to this in the next week or so.

bcjaeger commented 2 years ago

No trouble at all - thanks for the update!

danielskatz commented 2 years ago

👋 @tdhock - please ping me when this is complete so I can also complete the corresponding JOSS acceptance and publication (in https://github.com/openjournals/joss-reviews/issues/4705)

tdhock commented 2 years ago

@ropensci-review-bot add @mnwright to reviewers

ropensci-review-bot commented 2 years ago

@mnwright added to the reviewers list. Review due date is 2022-09-21. Thanks @mnwright for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

ropensci-review-bot commented 2 years ago

@mnwright: If you haven't done so, please fill this form for us to update our reviewers records.

tdhock commented 2 years ago

@ropensci-review-bot add @jemus42 to reviewers

ropensci-review-bot commented 2 years ago

@jemus42 added to the reviewers list. Review due date is 2022-09-21. Thanks @jemus42 for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

ropensci-review-bot commented 2 years ago

@jemus42: If you haven't done so, please fill this form for us to update our reviewers records.

tdhock commented 2 years ago

hi @bcjaeger at least two reviews are typically required, and luckily Marvin and Lukas have agreed to review, so I will wait to hear from them before official acceptance.

bcjaeger commented 2 years ago

Cool, thanks @jemus42 and @mnwright for agreeing to review! In case it is helpful, https://github.com/bcjaeger/aorsf-bench contains the paper and code to evaluate aorsf in terms of prediction accuracy, variable importance, and efficiency (paper is here: https://github.com/bcjaeger/aorsf-bench/blob/main/paper/jmlr/main.pdf)

Unfortunately, aorsf is now on CRAN. I am extremely grateful for your willingness to review aorsf, and I will try my best to incorporate all of your recommendations in a way that is compatible with the version of aorsf that is on CRAN. If you feel strongly about making a change that would not be compatible with the version of aorsf on CRAN, I am happy to discuss it.

mnwright commented 2 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.

None.

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s): demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions
[x] Examples: (that run successfully locally) for all exported functions
[x] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

I couldn't find contribution guidelines.

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing: 3

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Thanks for the great package. Every package should be so well implemented, organized and documented! In particular, the documentation is outstanding.

I only have a few questions/comments, mostly minor:

?orsf contains startup messages from randomForestSRC and riskRegression, maybe suppress them?
The function oobag_c_harrell() is repeated several times in the tests (presumable they are identical). Maybe reduce redundancy here?
In the OOB vignette, the example for the user-supplied OOB function is incorrect (as explained in the text). In my experience, some people copy&paste such things anyway, so I would try to avoid incorrect examples.
The changelog (https://bcjaeger.github.io/aorsf/news/index.html) starts with the oldest changes. Isn't that usually the other way around?
I think orsf can only be called with the formula parameter. Does this work for high dimensional data, e.g. p>10,000 or is it slow/failing?
The C++ code is lacking documentation, i.e., it is unclear what the functions in the C++ code are doing. I think all exported Rcpp functions are internal (not directly called by the user), so I won't expect documentation on the same levels as for exported R functions. Still, some documentation would help contributors work with the C++ code.
All C++ code is in one large file with many global objects. It is common practice in C++ to avoid global objects where possible. Would it make sense to use some OOP instead?

And one thing, which is beyond the scope of this review but still interesting: What about oblique RF for classification and regression? Do other packages implement the speedups used here? If not, why not have a fast oblique RF package for all outcomes instead of a specific one for survival?

In summary, this is a great submission, ready for publication. I think most of the issues above are quite easy to fix. The C++ OOP question would be a major re-design and is not necessary for acceptance of the submission.

mnwright commented 2 years ago

And a few notes on the runtime comparisons above:

ranger does not reduce the number of unique time points and that's what makes it slow and increases the memory usage. The bottleneck not the sample size but that the CHF is estimated at each unique time point. rfsrc for example uses a grid (ntime parameter). See also here https://github.com/imbs-hl/ranger/issues/410. I re-ran that experiment with a grid in ranger and then it was fastest (and didn't show the strange behavior shown in the plots above).
Growing only one tree might give misleading results because of computation overhead, e.g. with Rcpp.
Formula interfaces are quite slow for many features (I guess for all packages?). That might be a problem for p=1000 (see comment above).

bcjaeger commented 2 years ago

@mnwright, thank you so much for your review! The suggestions you've made to improve aorsf are incredibly helpful. I've reviewed each of your comments and written up a short description of actions I will take to incorporate your suggestions. If you think any of the actions I am proposing should be modified, just let me know, I am happy to coordinate a better approach.

Once I have finished all of the updates, I will post and tag you.

?orsf contains startup messages from randomForestSRC and riskRegression, maybe suppress them?

Whoops, sorry I missed that. I will update with suppressPackageStartupMessages.

The function oobag_c_harrell() is repeated several times in the tests (presumable they are identical). Maybe reduce redundancy here?

I like this idea. I think making oobag_c_harrell() an unexported R function in aorsf will resolve this.

In the OOB vignette, the example for the user-supplied OOB function is incorrect (as explained in the text). In my experience, some people copy&paste such things anyway, so I would try to avoid incorrect examples.

Agreed - I think using SurvMetrics::Brier instead of writing a simple (incorrect) Brier score function is a good fix for this.

The changelog (https://bcjaeger.github.io/aorsf/news/index.html) starts with the oldest changes. Isn't that usually the other way around?

Whoops - yes, that needs to be reversed. Thank you! 😅

I think orsf can only be called with the formula parameter. Does this work for high dimensional data, e.g. p>10,000 or is it slow/failing?

That is a great question - I think I need to include a high-dimensional case in the benchmark above to answer it. I'd also like to experiment with adding an x and y input for orsf, similar to ranger, to see how much using x and y rather than formula impacts aorsf's efficiency in high-dimensional settings.

The C++ code is lacking documentation, i.e., it is unclear what the functions in the C++ code are doing. I think all exported Rcpp functions are internal (not directly called by the user), so I won't expect documentation on the same levels as for exported R functions. Still, some documentation would help contributors work with the C++ code.

Agreed, this code needs better documentation. I will go through each function and add comments that include description of the function's purpose and a description of each input for the function, including global variables that the function depends on.

All C++ code is in one large file with many global objects. It is common practice in C++ to avoid global objects where possible. Would it make sense to use some OOP instead?

Yes, using OOP would be a much better design and would make generalizations to oblique forests for regression and classification much cleaner. The long term plan for aorsf absolutely includes a more modular re-write of the C++ file. The only reason aorsf didn't have this design from the beginning is that I am pretty inexperienced when it comes to performant C++ code. I am slowly getting there.

What about oblique RF for classification and regression? Do other packages implement the speedups used here? If not, why not have a fast oblique RF package for all outcomes instead of a specific one for survival?

Agreed! I focused on survival initially because I wasn't aware of any R packages for oblique survival trees (the RLT package fits oblique trees for regression and classification, and obliqueRF can fit oblique trees for classification). Developing aorsf into a more general package that fits all three types of trees will be a big undertaking, but should be much more feasible after updating the C++ code as described above.

ranger does not reduce the number of unique time points and that's what makes it slow and increases the memory usage. The bottleneck not the sample size but that the CHF is estimated at each unique time point. rfsrc for example uses a grid (ntime parameter). See also here https://github.com/imbs-hl/ranger/issues/410. I re-ran that experiment with a grid in ranger and then it was fastest (and didn't show the strange behavior shown in the plots above).

Ahh, I understand. Sorry I did not control for this in my initial benchmark. I'd like to update my benchmark code so that ranger can use a grid of timepoints for the CHF as you described. Would you be willing to share your code for this? I can also manually restrict the number of unique event times in the benchmark data if you think that would be a better approach.

Growing only one tree might give misleading results because of computation overhead, e.g. with Rcpp.

Agreed - I think I can revert to using 500 trees assuming the memory issue is fixed by restricting the number of unique event times.

Formula interfaces are quite slow for many features (I guess for all packages?). That might be a problem for p=1000 (see comment above).

Good point - if orsf() crashes due to its formula input with p = 1000, I will prioritize incorporating the x and y inputs into orsf() and reassess.

In summary, this is a great submission, ready for publication. I think most of the issues above are quite easy to fix.

Thank you! I sincerely appreciate the time you invested in reviewing aorsf and your thoughtful feedback. It will be my honor to include you as a package reviewer in the DESCRIPTION file.

bcjaeger commented 2 years ago

I am happy to share an updated response to the excellent review from @mnwright:

Examples in orsf() now suppress package start up messages (see here)
oobag_c_harrell() is now an unexported function in aorsf to avoid repeatedly defining it in testing scripts.
SurvMetrics::Brier is now used in place of the incorrect Brier score in the out-of-bag vignette (see here)
NEWS.md now shows updates in the usual order (see here).
C++ functions in src/orsf.cpp have documentation outlining the purpose of the function and its inputs (see here).

I am committed to re-writing the orsf.cpp functions with modular, object oriented design, using the ranger files as a guideline. This will, among other things, make it far more feasible to efficiently fit oblique regression and classification forests with this package. I'll be submitting a grant in the Spring and will hopefully get funding to work through this.

The benchmark of aorsf, ranger, and randomForestSRC has also undergone some updates:

500 trees are now used instead of 1.
the number of unique event times is now fixed at 100, 1000, or 10000 by modifying the time values in the data.
the number of predictors is now fixed at 100, 1000, or 10000 (it was originally 10, 100, or 1000)
To avoid high memory usage from having many leaves with many unique event times, I increase the minimal terminal node size as a function of the sample size, i.e., node_size = n_obs / 10.

With these updates, it is clear that aorsf runs slow compared to ranger and randomForestSRC in cases where the number of unique event times is 100 or 1000. It is also clear that aorsf does not scale as well to high dimensional settings - it does best relative to the other packages when the number of predictors is 100 and the number of unique event times is 10000 (i.e., each observation has a unique event time). In my prediction benchmark, the number of predictors is often < 100 and there are few instances of non-unique event times, so I can understand why benchmark results seem to favor aorsf in that comparison.

If you aggregate over all 9 panels (see second figure), ranger has the fastest expected time and all three methods appear to have a linear asymptotic complexity.

temp

temp2

jemus42 commented 2 years ago

Package Review

Briefly describe any working relationship you have (had) with the package authors. None.
[X] As the reviewer I confirm that there are no conflicts of interest for me to review this work

Documentation

The package includes all the following forms of documentation:

[X] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
- From my understanding the "gist" is "obliqueRSF, but fast", so it inherits its usefulness/audience from obliqueRSF.
- It clearly has a place between existing methods and implementations.
[X] Installation instructions: for the development version of package and any non-standard dependencies in README
[X] Vignette(s): demonstrating major functionality that runs successfully locally
[X] Function Documentation: for all exported functions
- The documentation is comprehensive and useful for both novice and experienced users.
[X] Examples: (that run successfully locally) for all exported functions
[X] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).
- I found contribution guidelines at .github/CONTRIBUTING.md.

Functionality

[X] Installation: Installation succeeds as documented.
[X] Functionality: Any functional claims of the software have been confirmed.
[X] Performance: Any performance claims of the software have been confirmed.
[X] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[X] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing: 4

[X] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

The majority of my review are based on the state of the package as of commit 2251066.

Overall the package features comprehensive documentation and examples.
Function names are consistent and discoverable as well as tab-completable.
User-supplied arguments are checked consistently via internal functions defined in R/check.R. I would like to suggest the checkmate package for argument checks, which could greatly reduce complexity of internal functions/checks. I do concede that now that the work is already done it's likely not worth re-implementing checks again. Also, the author wants to keep dependencies to a minimum, so I understand the manual implementation of argument checks.
The check_* functions mentioned above and internal roxy_* functions for documentation snippets both reduce code duplication and ensure documentation is easier to keep consistent and up to date.
- Some of the roxy_* documentation helpers may be exchangeable for roxygen2 @importFrom tags or templates etc., but that's merely a gut feeling about idiomatic use of roxygen2.
I have not found any instances of @srr tags/comments I would challenge.
I could not find any reference to parallel processing, i.e. using multiple CPU cores/threads to speed up training. That's not a direct issue I think, but maybe something to look into in the future? Especially given the application on large datasets.
The test suite seems comprehensive and checks for correctness against well-established methods, but I'm wondering if it would be useful to add an automated regression test against obliqueRSF (at another location, to avoid the Suggests:-dependency on it).

Overall my impression of the package is very positive!

Performance

I have not conducted my own exhaustive experiments and have no additional notes regarding the public benchmark at aorsf-bench.

The author also kindly contributed to mlr3extralearners, which simplifies the algorithm's inclusion in other benchmark experiments.
The initiative to connect to existing frameworks like mlr3 and tidymodels is very much appreciated and allowed me to run a quick (naive) benchmark against ranger and rfrsc, showing comparable performance regarding C-index and brier score, as well as significantly shorter run-time for low-dimensional tasks (caveat: no tuning and default parameters were used for all algorithms, it was merely a quick test-run of the mlr3 learner).

A very minor nitpicky comment regarding code efficiency:
fctr.R#17 creates an empty vector (c()) and grows it with each character variable. For small datasets this is likely irrelevant, but imagining a scenario with many character variables supplied intentionally or accidentally by the user, it may be more appropriate to use a vapply construct like in this example:

# Example data
data <- as.data.frame(matrix(rnorm(100), ncol = 10))
data$V4 <- as.character(data$V4)
data$V8 <- as.character(data$V8)
data$V10 <- as.character(data$V10)

# Recreating .names argument
.names <- names(data)

# character detection without a for loop and a growing vector
which_char <- vapply(data, is.character, logical(1), USE.NAMES = FALSE)
chrs <- .names[which(which_char)]
chrs
#> [1] "V4"  "V8"  "V10"

Also note that the output will likely be excessive if a user were to accidentally supply a very large dataset with thousands of character variables, but that's a small edge case with little impact I think.

Methods

The inclusion of multiple easy to discover/use variable importance methods is very nice to see!

Partial dependence plots: Due to their known limitations/assumptions, it may be appropriate to a) note those limitations in the documentation to ensure users unfamiliar with the technique can make an informed choice b) consider additionally supporting accumulated local effects, a "faster and unbiased alternative" to PDPs, if possible.

Permutation feature importance: This method has drawbacks as well, see Hooker et. al., so the previous comment applies with the exception that I cannot suggest a direct alternative.

I was not familiar with the ANOVA- and negation-based importance measures, and I suspect many users may feel the same way. In that regard I think it would be useful to elaborate on these methods slightly, e.g. in vignettes/aorsf.Rmd where they are mentioned next to the aforementioned methods, or in the help page at ?orsf_vi in the Details section which already outlines the methods. Reading the paper/preprint in aorsf-bench was helpful, but I think some rough guidance regarding pros/cons in the documentation would be beneficial for users, if one can even be provided.
Choosing interpretability methods is of course not trivial, and I don't expect the author to provide a conclusive heuristic here.

tdhock commented 2 years ago

Thanks for your review @jemus42! You are right it is not good practice for large data to grow vectors by over-writing the variable in a for loop. vapply is one option, and another option is to use a for loop, but write to a list element in each iteration (constant time operation), instead of over-writing the variable in each iteration (linear time operation), see https://tdhock.github.io/blog/2017/rbind-inside-outside/ for an example involving data frames, where the difference is noticeable even when there are only a few dozen things to combine.

bcjaeger commented 2 years ago

@jemus42, thank you! I will write a more complete response to your feedback and a list of proposed actions in the next few days, but for now I just wanted to say how much I appreciate your review. You found a very nice solution to speed up fctr_check and I've just pushed a commit to aorsf with your suggested use of vapply . As expected, it's much faster. =]

n <- 500
p <- 1000

data_check <- list()

characters <- letters[1:5]

for( i in seq(p) ) {
 data_check[[i]] <- sample(characters, size = n, replace = T)
 names(data_check)[i] <- paste('x', i, sep = '_')
}

data_check <- as.data.frame(data_check)

c_loop <- function(data, .names){

 chrs <- c()

 for( .name in .names ){

  if(is.character(data[[.name]])){
   chrs <- c(chrs, .name)
  }

 }

 chrs

}

v_apply <- function(data, .names){

 chr_index <- which(
  vapply(data[, .names], is.character, logical(1), USE.NAMES = FALSE)
 )

 .names[chr_index]

}

testthat::expect_equal(
 c_loop(data_check, names(data_check)),
 v_apply(data_check, names(data_check))
)

microbenchmark::microbenchmark(
 c_loop = c_loop(data_check, names(data_check)),
 v_apply = v_apply(data_check, names(data_check))
)
#> Unit: microseconds
#>     expr      min       lq    mean    median       uq       max neval cld
#>   c_loop 8948.500 9234.001 9747.49 9380.5010 9530.401 13457.202   100   b
#>  v_apply  328.001  339.701  484.56  346.6005  385.701  5247.201   100  a

^{Created on 2022-09-16 by the reprex package (v2.0.1)}

bcjaeger commented 2 years ago

@jemus42, here is the more complete response to your comments, with planned action items. If you feel like any of the proposed actions could be improved, or if I didn't address all of your comments in my response, just let me know. Otherwise, I should have updates in aorsf based on this response completed in the next couple of days.

Overall the package features comprehensive documentation and examples. Function names are consistent and discoverable as well as tab-completable.

Thank you! I will continue to prioritize clear documentation and tab-friendly names in future updates.

I would like to suggest the checkmate package for argument checks, which could greatly reduce complexity of internal functions/checks. I do concede that now that the work is already done it's likely not worth re-implementing checks again. Also, the author wants to keep dependencies to a minimum, so I understand the manual implementation of argument checks.

I agree regarding checkmate, it would have made the checks more routine (I found out checkmate existed shortly after I had already written a substantial amount of code in check.R). For now, I'd like to keep the checks as they are, with the acknowledgment that if the checks fail or need a substantial update in the future I will incorporate checkmate.

The check* functions mentioned above and internal roxy functions for documentation snippets both reduce code duplication and ensure documentation is easier to keep consistent and up to date. Some of the roxy_ documentation helpers may be exchangeable for roxygen2 @importFrom tags or templates etc., but that's merely a gut feeling about idiomatic use of roxygen2.

There probably is something I could do here but I also am not sure how to use roxygen2 optimally for the things that roxy_ functions currently do. I have learned through working with the mlr3extralearners package how to manage bibliographies better, so I think I could remove the roxy_cite code if I went that route. I am not sure if this is a high priority update though, since the roxy_cite functions appear to be working fine.

I have not found any instances of @srr tags/comments I would challenge.

Excellent! Thank you for checking these.

I could not find any reference to parallel processing, i.e. using multiple CPU cores/threads to speed up training. That's not a direct issue I think, but maybe something to look into in the future? Especially given the application on large datasets.

Good point - aorsf doesn't currently support parallel processing. I have initiated a branch (see https://github.com/bcjaeger/aorsf/tree/oop) where I am attempting to re-write aorsf's C++ in an object-oriented framework (thanks to Marvin's suggestion above), and I see parallel processing becoming far more feasible with this framework. I've designated parallel processing as a long-term objective tied to this branch (i.e., I don't think it will be done while this review is open, but it will be done).

The test suite seems comprehensive and checks for correctness against well-established methods, but I'm wondering if it would be useful to add an automated regression test against obliqueRSF (at another location, to avoid the Suggests:-dependency on it).

I like this idea. Would the test be something like graphing the predictions from aorsf in the x-axis against the predictions from obliqueRSF on the y-axis and expecting their correlation to be high? I'd propose to add a section to the aorsf::orsf() examples called "What about obliqueRSF?" and show there that aorsf is faster and provides very similar predictions using the pbc_orsf data. This shouldn't require adding obliqueRSF to Suggests: thanks to roxygen2's .Rmd options for examples.

The author also kindly contributed to https://github.com/mlr-org/mlr3extralearners/pull/233, which simplifies the algorithm's inclusion in other benchmark experiments. The initiative to connect to existing frameworks like mlr3 and tidymodels is very much appreciated and allowed me to run a quick (naive) benchmark against ranger and rfrsc, showing comparable performance regarding C-index and brier score, as well as significantly shorter run-time for low-dimensional tasks (caveat: no tuning and default parameters were used for all algorithms, it was merely a quick test-run of the mlr3 learner).

I'm glad to hear the mlr3 learner ran as expected in the experiment! Would you be willing to share your code? I would like to add a section into the examples of aorsf::orsf() that demonstrates a simple use case of aorsf in mlr3. A pull request would of course be welcome, but that seems like a bigger ask.

A very minor nitpicky comment regarding code efficiency: fctr.R#17 creates an empty vector (c()) and grows it with each character variable. For small datasets this is likely irrelevant, but imagining a scenario with many character variables supplied intentionally or accidentally by the user, it may be more appropriate to use a vapply construct.

Thank you! I found this very helpful and have implemented your suggestion.

Partial dependence plots: Due to their known limitations/assumptions, it may be appropriate to a) note those limitations in the documentation to ensure users unfamiliar with the technique can make an informed choice b) consider additionally supporting accumulated local effects, a "faster and unbiased alternative" to PDPs, if possible.

This is a good point. I will add notes in the documentation about the limitations of PD and will recommend users engage with the iml package to create ALE curves using aorsf. I was happy to write PD functions for aorsf because PD is relatively straightforward to compute, but I think ALE may be a little more complex so I'd prefer to lean on the iml package for ALE curves.

Permutation feature importance: This method has drawbacks as well, see Hooker et. al., so the previous comment applies with the exception that I cannot suggest a direct alternative.

I agree - I will add text to my documentation highlighting this and refer to the paper you linked as well as to the disadvantages of PD section in the iml book

I was not familiar with the ANOVA- and negation-based importance measures, and I suspect many users may feel the same way. In that regard I think it would be useful to elaborate on these methods slightly, e.g. in vignettes/aorsf.Rmd where they are mentioned next to the aforementioned methods, or in the help page at ?orsf_vi in the Details section which already outlines the methods. Reading the paper/preprint in aorsf-bench was helpful, but I think some rough guidance regarding pros/cons in the documentation would be beneficial for users, if one can even be provided. Choosing interpretability methods is of course not trivial, and I don't expect the author to provide a conclusive heuristic here.

That is a good point. ANOVA importance was developed and evaluated by Menze et al for oblique classification random forests. Negation importance is developed and evaluated in the ArXiv paper you mentioned. According to the simulation study in the ArXiv paper, all three methods (negation, ANOVA, and permutation) appear to be valid. It seems as if negation has an edge in the general comparison based on our simulation study, but that isn't quite enough evidence for me to recommend negation as a default approach. I will add some notes to the documentation about the origin of these three methods and will add pros/cons for each one.

Overall my impression of the package is very positive!

I am thrilled to hear that. =] Thank you for reviewing closely and providing so many useful ideas to improve aorsf!

jemus42 commented 2 years ago

roxy_*-functions: Sure, they serve their purpose and I don't see any reason to forcibly substitute all of them :) Might as well be consistent.
Parallel processing: Neat! Always nice to have some future enhancements already in the works.
Regression test: Yes, a scatterplot against obliqueRSF predictions would be a good way to check consistency. Wouldn't obliqueRSF then still be a build-time dependency for the docs though? I haven't had a chance to experiment with roxygen2's .Rmd option you're referring to, but I mean it has to be installed at some point to render it? Wouldn't CRAN require it as a dependency then?
Interpretability methods: All good, some documentation notes and leaning on existing implementations is fine :)

Regarding the mlr3extralearners test-run I did, my code is fairly bare-bones, and unfortunately due to mlr3proba being in a somewhat ill-defined state after its CRAN removal I'm not sure how to communicate that to users. Technically it works fine, just with an asterisk or two.

library(mlr3verse)
#> Loading required package: mlr3
library(mlr3proba)
library(mlr3extralearners)
library(mlr3viz)
library(aorsf)
library(ggplot2)

# Less logging during training
lgr::get_logger("mlr3")$set_threshold("warn")

# Some example tasks without missing values
tasks <- tsks(c("actg", "rats"))

# Learners with default parameters
learners <- lrns(
  c("surv.ranger", "surv.rfsrc", "surv.aorsf"),
  predict_sets = c("train", "test")
)

# Brier (Graf) score, c-index and training time as measures
measures <- msrs(c("surv.graf", "surv.cindex", "time_train"))

# Benchmark with 10-fold CV
design <- benchmark_grid(
  tasks = tasks,
  learners = learners,
  resamplings = rsmps("cv", folds = 10)
)

# Parallelization (optional)
future::plan("multisession", workers = 5)
benchmark_result <- benchmark(design)

bm_scores <- benchmark_result$score(measures, predict_sets = "test")

bm_scores |>
  dplyr::select(task_id, learner_id, surv.graf, surv.cindex, time_train) |>
  tidyr::pivot_longer(cols = surv.graf:time_train, names_to = "measure") |>
  ggplot(aes(x = learner_id, y = value)) +
  facet_wrap(vars(measure), nrow = 1, scales = "free_y") +
  geom_violin(draw_quantiles = c(.25, .5, .75)) +
  theme_minimal()


bm_summary <- benchmark_result$aggregate(measures)
bm_summary[, c("task_id", "learner_id", "surv.graf", "surv.cindex")]
#>    task_id  learner_id  surv.graf surv.cindex
#> 1:    actg surv.ranger 0.05933293   0.7265903
#> 2:    actg  surv.rfsrc 0.05776890   0.7373370
#> 3:    actg  surv.aorsf 0.05803357   0.7282324
#> 4:    rats surv.ranger 0.07233919   0.7992290
#> 5:    rats  surv.rfsrc 0.08087658   0.7636673
#> 6:    rats  surv.aorsf 0.08338740   0.7874255

^{Created on 2022-09-19 with reprex v2.0.2}

ropensci-review-bot commented 2 years ago

:calendar: @mnwright you have 2 days left before the due date for your review (2022-09-21).

ropensci-review-bot commented 2 years ago

:calendar: @jemus42 you have 2 days left before the due date for your review (2022-09-21).

tdhock commented 2 years ago

@ropensci-review-bot submit review https://github.com/ropensci/software-review/issues/532#issuecomment-1249184517 time 4

ropensci-review-bot commented 2 years ago

Logged review for jemus42 (hours: 4)

tdhock commented 2 years ago

@ropensci-review-bot submit review https://github.com/ropensci/software-review/issues/532#issuecomment-1235275312 time 3

ropensci-review-bot commented 2 years ago

Logged review for mnwright (hours: 3)

bcjaeger commented 2 years ago

Regression test: Yes, a scatterplot against obliqueRSF predictions would be a good way to check consistency. Wouldn't obliqueRSF then still be a build-time dependency for the docs though? I haven't had a chance to experiment with roxygen2's .Rmd option you're referring to, but I mean it has to be installed at some point to render it? Wouldn't CRAN require it as a dependency then?

The @includeRmd option in roxygen2 seems to be a valid backdoor to include packages in your examples without making them dependencies or suggested packages. I think this is because the .Rmd files are knitted prior to sending the package out, so the code is only run locally.

I am happy to share some updates!

I've added the mlr3 example to my examples in orsf.R (see here).
I've added documentation on limitations of partial dependence (see here and here).
I have placed the regression test for obliqueRSF into a testing file, and plan to run the test locally to prevent adding obliqueRSF as a suggested package. I am a little uncertain regarding the minimal correlation that would be acceptable, but I feel comfortable with 0.97 (i.e., the current correlation from the reprex below).

# Similar to obliqueRSF?
suppressPackageStartupMessages({
 library(aorsf)
 library(obliqueRSF)
})

set.seed(50)

fit_aorsf <- orsf(pbc_orsf, 
                  formula = Surv(time, status) ~ . - id,
                  n_tree = 100)
fit_obliqueRSF <- ORSF(pbc_orsf, ntree = 100, verbose = FALSE)

risk_aorsf <- predict(fit_aorsf, new_data = pbc_orsf, pred_horizon = 3500)
risk_obliqueRSF <- 1-predict(fit_obliqueRSF, newdata = pbc_orsf, times = 3500)

cor(risk_obliqueRSF, risk_aorsf)
#>           [,1]
#> [1,] 0.9747177
plot(risk_obliqueRSF, risk_aorsf)

^{Created on 2022-09-20 by the reprex package (v2.0.1)}

@jemus42, do you feel these updates (in addition to the earlier update using vapply instead of c()) have successfully incorporated the feedback from your review into aorsf?

jemus42 commented 2 years ago

Yes, I have nothing more to add I think :)

Regarding the regression test, I don't think there's a meaningful cutoff one could define, but if for example you were to introduce a bug on the aorsf C++ code that resulted in nonsense predictions you would most likely immediately recognize them in that plot, and that's the main appeal :)

bcjaeger commented 2 years ago

Excellent! Totally agree with you on the regression test. Thanks again for the suggestion!

tdhock commented 2 years ago

Since you have sufficiently addressed the reviewer comments, this package can now be accepted.

tdhock commented 2 years ago

@ropensci-review-bot approve aorsf

ropensci-review-bot commented 2 years ago

Approved! Thanks @bcjaeger for submitting and @chjackson, @mnwright, @jemus42 for your reviews! :grin:

To-dos:

[ ] Transfer the repo to rOpenSci's "ropensci" GitHub organization under "Settings" in your repo. I have invited you to a team that should allow you to do so. You will need to enable two-factor authentication for your GitHub account. This invitation will expire after one week. If it happens write a comment @ropensci-review-bot invite me to ropensci/<package-name> which will re-send an invitation.
[ ] After transfer write a comment @ropensci-review-bot finalize transfer of <package-name> where <package-name> is the repo/package name. This will give you admin access back.
[ ] Fix all links to the GitHub repo to point to the repo under the ropensci organization.
[ ] Delete your current code of conduct file if you had one since rOpenSci's default one will apply, see https://devguide.ropensci.org/collaboration.html#coc-file
[ ] If you already had a pkgdown website and are ok relying only on rOpenSci central docs building and branding,
- deactivate the automatic deployment you might have set up
- remove styling tweaks from your pkgdown config but keep that config file
- replace the whole current pkgdown website with a redirecting page
- replace your package docs URL with https://docs.ropensci.org/package_name
- In addition, in your DESCRIPTION file, include the docs link in the URL field alongside the link to the GitHub repository, e.g.: URL: https://docs.ropensci.org/foobar, https://github.com/ropensci/foobar
[ ] Fix any links in badges for CI and coverage to point to the new repository URL.
[ ] Increment the package version to reflect the changes you made during review. In NEWS.md, add a heading for the new version and one bullet for each user-facing change, and each developer-facing change that you think is relevant.
[ ] We're starting to roll out software metadata files to all rOpenSci packages via the Codemeta initiative, see https://docs.ropensci.org/codemetar/ for how to include it in your package, after installing the package - should be easy as running codemetar::write_codemeta() in the root of your package.
[ ] You can add this installation method to your package README install.packages("<package-name>", repos = "https://ropensci.r-universe.dev") thanks to R-universe.

Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"-type contributors in the Authors@R field (with their consent).

Welcome aboard! We'd love to host a post about your package - either a short introduction to it with an example for a technical audience or a longer post with some narrative about its development or something you learned, and an example of its use for a broader readership. If you are interested, consult the blog guide, and tag @ropensci/blog-editors in your reply. They will get in touch about timing and can answer any questions.

We maintain an online book with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding (with advice on releases, package marketing, GitHub grooming); the guide also feature CRAN gotchas. Please tell us what could be improved.

Last but not least, you can volunteer as a reviewer via filling a short form.

bcjaeger commented 2 years ago

@ropensci-review-bot finalize transfer of bcjaeger/aorsf

ropensci-review-bot commented 2 years ago

Can't find repository ropensci/bcjaeger/aorsf, have you forgotten to transfer it first?

bcjaeger commented 2 years ago

@ropensci-review-bot finalize transfer of aorsf

ropensci-review-bot commented 2 years ago

Transfer completed. The aorsf team is now owner of the repository and the author has been invited to the team

bcjaeger commented 2 years ago

Thanks everyone for your help improving aorsf! =]

ropensci / software-review