ropensci / software-review

rOpenSci Software Peer Review.
295 stars 104 forks source link

aorsf; accelerated oblique random survival forests #532

Closed bcjaeger closed 2 years ago

bcjaeger commented 2 years ago

Date accepted: 2022-09-22

Submitting Author Name: Byron C Jaeger Submitting Author Github Handle: !--author1-->@bcjaeger<!--end-author1-- Other Package Authors Github handles: (comma separated, delete if none) @nmpieyeskey, @sawyerWeld Repository: https://github.com/bcjaeger/aorsf Version submitted: 0.0.0.9000 Submission type: Stats Badge grade: gold Editor: !--editor-->@tdhock<!--end-editor-- Reviewers: @chjackson, @mnwright, @jemus42

Due date for @chjackson: 2022-07-29 Due date for @mnwright: 2022-09-21 Due date for @jemus42: 2022-09-21

Archive: TBD Version accepted: TBD Language: en

Package: aorsf
Title: Accelerated Oblique Random Survival Forests
Version: 0.0.0.9000
Authors@R: c(
    person(given = "Byron",
           family = "Jaeger",
           role = c("aut", "cre"),
           email = "bjaeger@wakehealth.edu",
           comment = c(ORCID = "0000-0001-7399-2299")),
    person(given = "Nicholas",  family = "Pajewski", role = "ctb"),
    person(given = "Sawyer", family = "Welden", role = "ctb", email = "swelden@wakehealth.edu")
    )
Description: Fit, interpret, and make predictions with oblique random
    survival forests. Oblique decision trees are notoriously slow compared
    to their axis based counterparts, but 'aorsf' runs as fast or faster than 
    axis-based decision tree algorithms for right-censored time-to-event 
    outcomes.
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE, roclets = c ("namespace", "rd", "srr::srr_stats_roclet"))
RoxygenNote: 7.1.2
LinkingTo: 
    Rcpp,
    RcppArmadillo
Imports: 
    table.glue,
    Rcpp,
    data.table
URL: https://github.com/bcjaeger/aorsf,
    https://bcjaeger.github.io/aorsf
BugReports: https://github.com/bcjaeger/aorsf/issues
Depends: 
    R (>= 3.6)
Suggests: 
    survival,
    survivalROC,
    ggplot2,
    testthat (>= 3.0.0),
    knitr,
    rmarkdown,
    glmnet,
    covr,
    units
Config/testthat/edition: 3
VignetteBuilder: knitr

Pre-submission Inquiry

General Information

Target audience: people who want to fit and interpret a risk prediction model, i.e., a prediction model for right-censored time-to-event outcomes.

Applications: fit an oblique random survival forest, compute predicted risk at a given time, estimate the importance of individual variables, and compute partial dependence to depict relationships between specific predictors and predicted risk.

Not applicable

Badging

Gold

  1. Compliance with a good number of standards beyond those identified as minimally necessary. aorsf complies with over 100 combined standards in the general and ML categories.
  2. Demonstrating excellence in compliance with multiple standards from at least two broad sub-categories. See 1. above
  3. Internal aspects of package structure and design. aorsf uses an optimized routine to partially complete Newton Raphson scoring for the Cox proportional hazards model and also an optimized routine to compute likelihood ratio tests. Both of these routines are heavily used when fitting oblique random survival forests, and both demonstrate the exact same answers as corresponding functions in the survival package (see tests in aorsf) while running at least twice as fast (thanks to Rcpparmadillo).

Technical checks

Confirm each of the following by checking the box.

I think aorsf is passing autotest and srr_stats_pre_submit(). I am having some issues running these on R 4.2. Currently, autotest is returning NULL, which I understand to be a good thing, and srr_stats_pre_submit is not able to run (not sure why; but it was fine before I updated to R 4.2).

This package:

Publication options

Code of conduct

ropensci-review-bot commented 2 years ago

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

ropensci-review-bot commented 2 years ago

:rocket:

The following problem was found in your submission template:

:wave:

bcjaeger commented 2 years ago

🚀

The following problem was found in your submission template:

  • 'statsgrade' variable must be one of [bronze, silver, gold] Editors: Please ensure these problems with the submission template are rectified. Package checks have been started regardless.

👋

Just updated to fix this. I'm not sure if reviewers will think that 'gold' is the right statsgrade, but might as well aim high.

ropensci-review-bot commented 2 years ago

Checks for aorsf (v0.0.0.9000)

git hash: c73bb98c

Package License: MIT + file LICENSE


1. rOpenSci Statistical Standards (srr package)

This package is in the following category:

:heavy_check_mark: All applicable standards [v0.1.0] have been documented in this package (102 complied with; 56 N/A standards)

Click to see the report of author-reported standards compliance of the package with links to associated lines of code, which can be re-generated locally by running the srr_report() function from within a local clone of the repository.


2. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate. |type |package | ncalls| |:----------|:-------------|------:| |internal |base | 341| |internal |aorsf | 152| |internal |utils | 13| |internal |stats | 11| |internal |methods | 1| |imports |table.glue | 3| |imports |Rcpp | NA| |imports |data.table | NA| |suggests |glmnet | 1| |suggests |survival | NA| |suggests |survivalROC | NA| |suggests |ggplot2 | NA| |suggests |testthat | NA| |suggests |knitr | NA| |suggests |rmarkdown | NA| |suggests |covr | NA| |suggests |units | NA| |linking_to |Rcpp | NA| |linking_to |RcppArmadillo | NA| Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats()', and examining the 'external_calls' table.

base

attr (40), names (35), for (23), length (21), c (20), paste (16), list (14), paste0 (12), mode (10), seq_along (10), vector (10), which (9), as.matrix (8), rep (7), as.integer (6), drop (6), seq (6), switch (6), order (5), match (4), min (4), setdiff (4), colnames (3), inherits (3), ncol (3), Sys.time (3), all (2), any (2), as.factor (2), cbind (2), data.frame (2), grepl (2), lapply (2), levels (2), matrix (2), nchar (2), nrow (2), rev (2), rle (2), row.names (2), sum (2), suppressWarnings (2), try (2), all.vars (1), as.data.frame (1), class (1), deparse (1), formals (1), grep (1), if (1), ifelse (1), intersect (1), is.na (1), max (1), mean (1), print (1), return (1), rownames (1), sapply (1), t (1), typeof (1), unique (1)

aorsf

paste_collapse (8), fctr_info (5), get_fctr_info (5), get_names_x (5), f_oobag_eval (3), get_n_obs (3), get_n_tree (3), get_names_y (3), get_numeric_bounds (3), last_value (3), orsf_fit (3), ref_code (3), unit_info (3), check_var_types (2), get_f_oobag_eval (2), get_importance (2), get_leaf_min_events (2), get_leaf_min_obs (2), get_max_time (2), get_mtry (2), get_n_events (2), get_n_leaves_mean (2), get_oobag_eval_every (2), get_type_oobag_eval (2), get_unit_info (2), is_empty (2), list_init (2), orsf_control_net (2), orsf_pd_summary (2), orsf_train_ (2), select_cols (2), check_arg_bound (1), check_arg_gt (1), check_arg_gteq (1), check_arg_is (1), check_arg_is_integer (1), check_arg_is_valid (1), check_arg_length (1), check_arg_lt (1), check_arg_lteq (1), check_arg_type (1), check_arg_uni (1), check_control_cph (1), check_control_net (1), check_new_data_fctrs (1), check_new_data_names (1), check_new_data_types (1), check_oobag_fun (1), check_orsf_inputs (1), check_pd_inputs (1), check_predict (1), check_units (1), contains_oobag (1), contains_vi (1), f_beta (1), fctr_check (1), fctr_check_levels (1), fctr_id_check (1), get_cph_do_scale (1), get_cph_eps (1), get_cph_iter_max (1), get_cph_method (1), get_cph_pval_max (1), get_f_beta (1), get_n_retry (1), get_n_split (1), get_net_alpha (1), get_net_df_target (1), get_oobag_pred (1), get_oobag_time (1), get_orsf_type (1), get_split_min_events (1), get_split_min_obs (1), get_tree_seeds (1), get_types_x (1), insert_vals (1), is_aorsf (1), is_error (1), is_trained (1), leaf_kaplan_testthat (1), lrt_multi_testthat (1), newtraph_cph_testthat (1), oobag_c_harrell_testthat (1), orsf (1), orsf_control_cph (1), orsf_oob_vi (1), orsf_pd_ (1), orsf_pd_ice (1), orsf_pred_multi (1), orsf_pred_uni (1), orsf_scale_cph (1), orsf_summarize_uni (1), orsf_time_to_train (1), orsf_train (1), orsf_unscale_cph (1), orsf_vi_ (1), x_node_scale_exported (1)

utils

data (13)

stats

formula (4), dt (2), terms (2), family (1), time (1), weights (1)

table.glue

round_spec (1), round_using_magnitude (1), table_value (1)

glmnet

glmnet (1)

methods

new (1)


3. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has: - code in C++ (48% in 2 files) and R (52% in 21 files) - 1 authors - 3 vignettes - 1 internal data file - 3 imported packages - 15 exported functions (median 15 lines of code) - 216 non-exported functions in R (median 3 lines of code) - 48 R functions (median 38 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by [the `checks_to_markdown()` function](https://docs.ropensci.org/pkgcheck/reference/checks_to_markdown.html) The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:------------------------|-----:|----------:|:----------| |files_R | 21| 82.3| | |files_src | 2| 79.1| | |files_vignettes | 3| 92.4| | |files_tests | 18| 95.7| | |loc_R | 2139| 85.4| | |loc_src | 1982| 75.6| | |loc_vignettes | 342| 67.9| | |loc_tests | 1532| 91.7| | |num_vignettes | 3| 94.2| | |data_size_total | 9034| 70.4| | |data_size_median | 9034| 78.5| | |n_fns_r | 231| 91.6| | |n_fns_r_exported | 15| 58.5| | |n_fns_r_not_exported | 216| 94.0| | |n_fns_src | 48| 66.0| | |n_fns_per_file_r | 7| 77.6| | |n_fns_per_file_src | 24| 94.8| | |num_params_per_fn | 3| 33.6| | |loc_per_fn_r | 3| 1.1|TRUE | |loc_per_fn_r_exp | 15| 35.6| | |loc_per_fn_r_not_exp | 3| 1.5|TRUE | |loc_per_fn_src | 38| 88.9| | |rel_whitespace_R | 50| 96.1|TRUE | |rel_whitespace_src | 58| 89.6| | |rel_whitespace_vignettes | 56| 83.5| | |rel_whitespace_tests | 45| 96.8|TRUE | |doclines_per_fn_exp | 46| 58.1| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 364| 93.5| | ---

3a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


4. goodpractice and other checks

Details of goodpractice and other checks (click to open)

#### 3a. Continuous Integration Badges [![R-CMD-check](https://github.com/bcjaeger/aorsf/workflows/R-CMD-check/badge.svg)](https://github.com/bcjaeger/aorsf/actions) [![pkgcheck](https://github.com/bcjaeger/aorsf/workflows/pkgcheck/badge.svg)](https://github.com/bcjaeger/aorsf/actions) **GitHub Workflow Results** |name |conclusion |sha |date | |:--------------------------|:----------|:------|:----------| |Commands |skipped |069021 |2022-04-22 | |pages build and deployment |success |c2efe9 |2022-04-28 | |pkgcheck |success |c73bb9 |2022-04-28 | |pkgdown |success |c73bb9 |2022-04-28 | |R-CMD-check |success |c73bb9 |2022-04-28 | |test-coverage |success |c73bb9 |2022-04-28 | --- #### 3b. `goodpractice` results #### `R CMD check` with [rcmdcheck](https://r-lib.github.io/rcmdcheck/) R CMD check generated the following note: 1. checking installed package size ... NOTE installed size is 6.9Mb sub-directories of 1Mb or more: libs 6.0Mb R CMD check generated the following check_fails: 1. no_import_package_as_a_whole 2. rcmdcheck_reasonable_installed_size #### Test coverage with [covr](https://covr.r-lib.org/) Package coverage: 97.13 #### Cyclocomplexity with [cyclocomp](https://github.com/MangoTheCat/cyclocomp) The following functions have cyclocomplexity >= 15: function | cyclocomplexity --- | --- orsf | 33 check_orsf_inputs | 28 orsf_pd_ | 22 ref_code | 17 check_new_data_names | 15 check_predict | 15 #### Static code analyses with [lintr](https://github.com/jimhester/lintr) [lintr](https://github.com/jimhester/lintr) found the following 223 potential issues: message | number of times --- | --- Avoid using sapply, consider vapply instead, that's type safe | 10 Lines should not be more than 80 characters. | 182 Use <-, not =, for assignment. | 31


Package Versions

|package |version | |:--------|:---------| |pkgstats |0.0.4.30 | |pkgcheck |0.0.3.11 | |srr |0.0.1.149 |


Editor-in-Chief Instructions:

This package is in top shape and may be passed on to a handling editor

jooolia commented 2 years ago

@ropensci-review-bot assign @tdhock as editor

ropensci-review-bot commented 2 years ago

Assigned! @tdhock is now the editor

tdhock commented 2 years ago

@ropensci-review-bot seeking reviewers

ropensci-review-bot commented 2 years ago

Please add this badge to the README of your package repository:

[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/532_status.svg)](https://github.com/ropensci/software-review/issues/532)

Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news

bcjaeger commented 2 years ago

Please add this badge to the README of your package repository:

[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/532_status.svg)](https://github.com/ropensci/software-review/issues/532)

Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news

Done! Looking forward to the review.

bcjaeger commented 2 years ago

Hi @tdhock,

Thank you for acting as the editor of this submission. I have recently added a paper.md file to the repository in hopes that aorsf can be eligible for an expedited review at the Journal of Open Source Software after the ropensci review. Can you let me know if any other files are needed or if other actions are needed from me to qualify for an expedited review at JOSS? Thank you!

tdhock commented 2 years ago

I'm not sure about the expedited JOSS review. You may ask @noamross or @mpadge

noamross commented 2 years ago

@bcjaeger The paper.md file should be sufficient. For JOSS submission, after RO review is complete you should submit to JOSS and link to this review thread in the pre-review JOSS thread. JOSS editors will review the paper.md and can opt to use RO's software reviews rather than finding additional reviewers.

bcjaeger commented 2 years ago

Thanks, @noamross! That makes sense.

bcjaeger commented 2 years ago

Hi @tdhock, I see we are still seeking reviewers. Is there anything I can do to help find reviewers for this submission?

tdhock commented 2 years ago

I have asked a few people to review but no one has agreed yet, do you have any idea for potential reviewers to ask?

bcjaeger commented 2 years ago

Thanks! Assuming folks in the ROpenSci circle have been asked already, I'll offer some names (username) that may not have been asked yet but would be good reviewers for aorsf.

Terry Therneau (therneau) would be a good reviewer - a lot of the C code in aorsf is based on his survival package coxph routine.

Torsten Hothorn (thothorn) would also be a good reviewer - Torsten is the author of the party package and I'd like aorsf to look like the party package in 10 years.

Hannah Frick (hfrick), Emil Hvitfeldt (EmilHvitfeldt), Max Kuhn (topepo), Davis Vaughan (DavisVaughan), and Julia Silge (juliasilge) would all be good reviewers - they are all developers/contributors to the censored package, and I'd like aorsf to contribute to that package.

Raphael Sonabend (RaphaelS1), Andreas Bender (adibender), Michel Lang (mllg), and Patrick Schratz (pat-s) would all be good reviewers - they are developers/contributors to the mlr3-proba package, and I'd like aorsf to contribute to that package.

bcjaeger commented 2 years ago

Hi @tdhock, I am thinking about reaching out to the folks I listed in my last comment. Before I try to reach them, I was wondering if you had already contacted them? If you have, I will not bother them again with a review request.

tdhock commented 2 years ago

Hi @bcjaeger actually I have not asked any of those reviewers yet, sorry! I just got back to the office from traveling. Yes that would be helpful if you could ask them to review. (although it is suggested to not use more than one of them, https://devguide.ropensci.org/editorguide.html?q=reviewers#where-to-look-for-reviewers)

bcjaeger commented 2 years ago

Hi @tdhock, thanks! I will reach out to the folks in my earlier post and let you know if anyone is willing to review aorsf. Hope you had a great trip!

tdhock commented 2 years ago

@ropensci-review-bot add @chjackson to reviewers

ropensci-review-bot commented 2 years ago

@chjackson added to the reviewers list. Review due date is 2022-07-29. Thanks @chjackson for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

ropensci-review-bot commented 2 years ago

@chjackson: If you haven't done so, please fill this form for us to update our reviewers records.

chjackson commented 2 years ago

@tdhock Is this the right link for the reviewer guide: (https://devguide.ropensci.org/reviewerguide.html). The link in the bot post gives a 404.

tdhock commented 2 years ago

@nicholasjhorton yes thank you very much for agreeing to review.

bcjaeger commented 2 years ago

@nicholasjhorton and @chjackson, thank you for agreeing to review!! I am thrilled and looking forward to talking about aorsf with you.

nicholasjhorton commented 2 years ago

I'm not sure that I was asked to review. Can you please resend the invite @tdhock?

tdhock commented 2 years ago

@ropensci-review-bot add @nicholasjhorton to reviewers

ropensci-review-bot commented 2 years ago

@nicholasjhorton added to the reviewers list. Review due date is 2022-08-01. Thanks @nicholasjhorton for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

ropensci-review-bot commented 2 years ago

@nicholasjhorton: If you haven't done so, please fill this form for us to update our reviewers records.

chjackson commented 2 years ago

Package Review

Documentation

The package includes all the following forms of documentation:

Functionality

Estimated hours spent reviewing: 5


Review Comments

While I have a lot of experience with parametric statistical modelling of survival data, I am not knowledgeable about machine learning. So I can’t go into depth on the ML methods side of things. Though the vignettes explained clearly what the method is supposed to do - I felt like I learnt some useful things from them.

As far as I can tell, this is a polished, easily accessible package that implements a widely useful method. The differences and advances from other packages and methods are explained concisely.

I tried the package out by using it on a survival dataset that I had been using to test some of my own work, the colon data from the survival package. This is a simple randomised trial that compares time to cancer recurrence or death for three treatment groups, and I considered treatment as the only predictor. This may not be an ideal application for a random forest model, but hopefully it is helpful to see how the method does on this case.

library(survival)
library(dplyr)
colonc <- colon |> 
 filter(etype == 1) |>
 mutate(lev = factor(rx=="Lev"),
        lev5fu = factor(rx=="Lev+5FU"))

When using orsf with a single factor predictor, I got "Error: formula must have at least 2 predictors". Perhaps it would be helpful to explain the restriction to 2+ predictors somewhere, though I guess that it is rare that people will want to use a machine learning method with only one predictor.

fit <- orsf(data_train = colonc, formula = Surv(time, status) ~ rx)

When I recoded the data so the formula included two binary predictors, it worked. Though I realise that in this case the data contain only three out of the four possible combinations of the two binary variables, so it is impossible to identify whether there is an interaction.

I then calculated predicted survival at 2000 and 3000 days, under the two methods, for each of the four combinations of predictors.

library(aorsf)
fit <- orsf(data_train = colonc, formula = Surv(time, status) ~ lev + lev5fu)
fit <- orsf(data_train = colonc, formula = Surv(time, status) ~ lev + lev5fu, n_tree = 100, n_split=3)
pd_spec <- list(lev=c("FALSE","TRUE"), 
                lev5fu=c("FALSE","TRUE"))
pd_data <- orsf_pd_summary(object = fit, pd_spec = pd_spec,
                           pred_horizon = c(2000, 3000),
                           oobag = TRUE, expand_grid = TRUE, 
                           risk=FALSE)
pd_data

##    pred_horizon   lev lev5fu      mean       lwr      medn       upr
## 1:         2000 FALSE  FALSE 0.4511464 0.4457626 0.4510625 0.4572469
## 2:         2000  TRUE  FALSE 0.4496973 0.4451570 0.4496190 0.4549601
## 3:         2000 FALSE   TRUE 0.6138173 0.6072743 0.6138223 0.6205446
## 4:         2000  TRUE   TRUE 0.5574858 0.5362698 0.5575599 0.5803773
## 5:         3000 FALSE  FALSE 0.4239031 0.4180159 0.4237805 0.4305380
## 6:         3000  TRUE  FALSE 0.4224044 0.4174938 0.4224400 0.4274585
## 7:         3000 FALSE   TRUE 0.6016161 0.5949703 0.6015791 0.6084788
## 8:         3000  TRUE   TRUE 0.5405302 0.5183706 0.5406791 0.5655176

I compared the osrf model to a flexible parametric spline-based, non-proportional hazards survival model from the flexsurv package.

library(flexsurv)
spl <- flexsurvspline(Surv(time, status) ~ lev + lev5fu + gamma1(lev) + gamma1(lev5fu), data=colonc, k=5)
summary(spl, newdata=expand.grid(pd_spec), t=c(2000,3000), tidy=TRUE) |> 
 arrange(time, lev5fu, lev)

##   time       est       lcl       ucl   lev lev5fu
## 1 2000 0.4418226 0.3930134 0.4876635 FALSE  FALSE
## 2 2000 0.4495771 0.3991124 0.4943800  TRUE  FALSE
## 3 2000 0.6143594 0.5584856 0.6625904 FALSE   TRUE
## 4 2000 0.6207677 0.5397314 0.6971653  TRUE   TRUE
## 5 3000 0.4084440 0.3582102 0.4573942 FALSE  FALSE
## 6 3000 0.4186286 0.3673111 0.4650971  TRUE  FALSE
## 7 3000 0.5871624 0.5340537 0.6394870 FALSE   TRUE
## 8 3000 0.5958252 0.5095980 0.6783956  TRUE   TRUE

The predictions agreed for each combination except the one with both factor levels TRUE. This is understandable because these combinations do not appear in the data, and the two methods will be relying on different assumptions to extrapolate to this combination.

At this point I wondered how the intervals in the output were determined, and what do they mean? Do they reflect uncertainty about an expected value, or variability in observed values, or something else? The help page (argument prob_values) explains them as quantiles, but of what? It can’t be quantiles of some subset of the data with that combination of predictors, as there isn’t any such data in the training set. Perhaps this is obvious to someone with more experience with random forests or related ML methods.

Observations from the Introduction vignette

bcjaeger commented 2 years ago

@chjackson, thank you! All of the suggestions from your review make sense and are actionable. I plan to make changes to aorsf's documentation based on your feedback.

Regarding performance, I decided to put almost all of the benchmarking results in a separate project: https://github.com/bcjaeger/aorsf-bench. If you would like to review the performance claims, I recommend checking this repo out and in particular paper/jmlr/main.pdf, which is the first draft of a paper introducing the methods (not the software) involved in this R package. The results in this paper verify claims about performance in the aorsf documentation.

nicholasjhorton commented 2 years ago

Dear @tdhock: I'm sorry but I'm feeling quite confused. What did I sign up? I seem to have been added but don't know what I was asked to do. Can you please resend the invitation?

tdhock commented 2 years ago

Sorry for the confusion. You did not agree to review the aorsf package? If you can write the review please see instructions, https://devguide.ropensci.org/reviewerguide.html Otherwise please tell me if you would like me to remove you from the reviewer list.

nicholasjhorton commented 2 years ago

I didn't agree. The first I heard of the request was https://github.com/ropensci/software-review/issues/532#issuecomment-1179231500.

Unfortunately, I'm not able to review at this time. Can you please remove me from the reviewer list?

tdhock commented 2 years ago

@ropensci-review-bot remove @nicholasjhorton from reviewers

ropensci-review-bot commented 2 years ago

@nicholasjhorton removed from the reviewers list!

tdhock commented 2 years ago

Sorry for the trouble!

chjackson commented 2 years ago

Thanks, @bcjaeger - I'm happy with the performance claims from a quick scan of that paper.

bcjaeger commented 2 years ago

Thank you! I pushed some changes to the main branch of aorsf today based on the feedback from your review. I really appreciate your time and thoughts on the package. You are more than welcome to review the added clarifications to the documentation and let me know if there is anything you'd recommend modifying. Seeing your response above, I will add you as 'rev' to the DESCRIPTION file.

chjackson commented 2 years ago

That 's fine to add me, thanks. The doc changes look good.

Is it worth adding to the help pages what the functions do you omit pred_horizon, and add in the corresponding output data frame the specific prediction time being used? Even if it's sensible to use an explicit prediction time, perhaps someone would have a genuine reason for using a default based on a summary of the data, or they may leave pred_horizon out by accident then be curious about what the output means.

For my own interest, I'm still not sure what the intervals mean. When you say "The quantiles are calculated based on predictions from object at each set of values indicated by pd_spec.", does that mean that the predicted risk for a given set of covariate values is stochastic? Then how is the point prediction defined, and does this stochastic variation describe uncertainty about the expected risk (which can be reduced by collecting more data), or variation in observed outcomes (which is a characteristic of the population that the data are drawn from)?

bcjaeger commented 2 years ago

Thanks! I think it would be worth noting that pred_horizon plays a role in all prediction functions. Perhaps I could write that statement into the details or directly into the description of the pred_horizon input?

It is also worth mentioning that pred_horizon can be pre-determined by the object returned by orsf(). When you run orsf(), you can specify an input called oobag_time that will set pred_horizon for your out-of-bag predictions.

possible API change I am wondering if it would make sense to change the name of the input oobag_time in orsf() to oobag_pred_horizon for consistency within the package. Do you think that would be helpful?

As far as partial dependence goes, suppose we have an outcome Y modeled as a function of X and we want to compute partial dependence of Y at X = 1. We first set X equal to 1 for all rows in the training data (or whatever data we are using to compute partial dependence) and then compute the predicted value of Y for each row in the modified data, leaving us with N predictions if we have N observations. The point estimate for partial dependence is the mean of the predictions we compute, but you could also compute the 25th percentile of those predictions, the median, the 75th percentile, or whatever summary you like. I definitely wouldn't think of those percentiles as confidence bounds, but they do a nice job of quantifying how much the predictions vary across the training data when X is fixed at 1. Does that make sense? Perhaps we could add a definition like this into the partial dependence vignette?

chjackson commented 2 years ago

Thanks - I was forgetting that the partial dependencies on one predictor were marginalised over the other predictors. So I'd say those intervals are describing variability in the predicted outcome for a specific (or "focused"?) predictor value, due to variation in the other predictors. Whereas the "partial dependence value" is the predicted outcome for a specific predictor value, averaged over values of the other predictors.

Naming consistency is usually good!

bcjaeger commented 2 years ago

Excellent! Just pushed the API changes to the main branch. In addition to changing oobag_time to oobag_pred_horizon, I also changed data_train to data in orsf(). The second change was done to have consistency between orsf() and other modeling functions, most of which denote training data as data in their inputs.

Are there any other changes we should consider?

bcjaeger commented 2 years ago

Hi @chjackson, I checked on some other submissions and it looks like final approval is given by each reviewer before a submission is closed. Do you approve of {aorsf}'s current state? If not, I am happy to make additional changes.

tdhock commented 2 years ago

I'm not sure if this is in scope for the package review, but I had a look at your aorsf-bench slides/figures, and I saw one figure that compared the computation time of various packages/algos, I think for a single data set/size. That is a good start, but I was wondering if you have done any analysis of the computation time of the packages/algos as a function of data size? (plot time vs number of features and number of observations, is it expected to scale linear for each algo, or..?)

bcjaeger commented 2 years ago

Hi @tdhock - thank you for checking that out!

The figure you are referring to is https://bcjaeger.github.io/aorsf-bench/#32, right? The slide does not make this clear, but the times in the figure for each algo are aggregated across all of the risk prediction tasks on this slide: https://bcjaeger.github.io/aorsf-bench/#29. So I'd say the timing estimates are more general than timing estimates on one dataset, but they still do not show patterns on how each algo scales with more observations or more features.

I very much like the idea of showing computation time as a function of the number of observations and as a function of the number of features. Would it make sense to do this by focusing on one of the larger benchmark datasets, and then using subsets of that data with progressively higher dimensions? e.g., suppose one dataset has 100 features and 10,000 rows. We could run:

  1. using the first 1000 observations and all features
  2. using the first 2500 observations and all features
  3. using the first 5000 observations and all features
  4. using the first 7500 observations and all features
  5. using the first 10000 observations and all features

And then do a similar routine for features, holding the number of observations fixed.

tdhock commented 2 years ago

Right, https://bcjaeger.github.io/aorsf-bench/#32 is the figure (which shows combined time for training and prediction). I would suggest doing separate timings of the training and prediction steps. I suppose training is usually the bottleneck/slower step which would be more important to analyze/prove you are fast enough, right? Hi @bcjaeger yes that sounds good, but I would suggest avoiding linear scales (2500, 5000, 7500). Typically for this kind of analysis I would suggest varying the number of rows/columns on a log scale, like (100, 1000, 10000) or N=as.integer(seq(2, 4, by=0.5)) For an example of what kind of figure I am thinking of see image which is from Fig 7 of https://jmlr.org/papers/volume21/18-843/18-843.pdf

bcjaeger commented 2 years ago

Ahh, that makes perfect sense. I will add this to the paper I am working on in the aorsf-bench repo.

chjackson commented 2 years ago

Hi all, yes, happy to approve the current version of aorsf.

bcjaeger commented 2 years ago

So it will take me awhile to get this result written into the paper but here is a very rough figure to be responsive to @tdhock's earlier suggestion. Each facet in the figure has the number of predictor variables printed at the top strip (10, 100, or 1,000). Each point is the mean value of time to fit the corresponding model, taken over 10 independent runs and allowing up to 4 CPUs for parallel computing. Overall, the ranger package tends to do best with lower values of n, but its computing time scales up very fast. This may be due to memory usage (I noticed my working memory was completely used up while ranger was running). the aorsf package tends to be similar to randomForestSRC (rsf_rfsrc in the figure), with aorsf having a narrow advantage when p = 10 or 100 and randomForestSRC winning out when p = 1000. the party package scales consistently but is generally slower than ranger, randomForestSRC, and aorsf. I am puzzled by the way that randomForestSRC scales better when p is 1000 but not when p is 100 or 10. Perhaps a special computational routine is activated in randomForestSRC when p is very large? It seems like a different computation may be taking place in randomForestSRC with n>1000 as well.

The data used here were simulated, and all predictors were continuous.

image