Closed bcjaeger closed 2 years ago
Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help
for help.
:rocket:
The following problem was found in your submission template:
:wave:
🚀
The following problem was found in your submission template:
- 'statsgrade' variable must be one of [bronze, silver, gold] Editors: Please ensure these problems with the submission template are rectified. Package checks have been started regardless.
👋
Just updated to fix this. I'm not sure if reviewers will think that 'gold' is the right statsgrade, but might as well aim high.
git hash: c73bb98c
Package License: MIT + file LICENSE
srr
package)This package is in the following category:
:heavy_check_mark: All applicable standards [v0.1.0] have been documented in this package (102 complied with; 56 N/A standards)
Click to see the report of author-reported standards compliance of the package with links to associated lines of code, which can be re-generated locally by running the srr_report()
function from within a local clone of the repository.
The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.
|type |package | ncalls|
|:----------|:-------------|------:|
|internal |base | 341|
|internal |aorsf | 152|
|internal |utils | 13|
|internal |stats | 11|
|internal |methods | 1|
|imports |table.glue | 3|
|imports |Rcpp | NA|
|imports |data.table | NA|
|suggests |glmnet | 1|
|suggests |survival | NA|
|suggests |survivalROC | NA|
|suggests |ggplot2 | NA|
|suggests |testthat | NA|
|suggests |knitr | NA|
|suggests |rmarkdown | NA|
|suggests |covr | NA|
|suggests |units | NA|
|linking_to |Rcpp | NA|
|linking_to |RcppArmadillo | NA|
Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(
attr (40), names (35), for (23), length (21), c (20), paste (16), list (14), paste0 (12), mode (10), seq_along (10), vector (10), which (9), as.matrix (8), rep (7), as.integer (6), drop (6), seq (6), switch (6), order (5), match (4), min (4), setdiff (4), colnames (3), inherits (3), ncol (3), Sys.time (3), all (2), any (2), as.factor (2), cbind (2), data.frame (2), grepl (2), lapply (2), levels (2), matrix (2), nchar (2), nrow (2), rev (2), rle (2), row.names (2), sum (2), suppressWarnings (2), try (2), all.vars (1), as.data.frame (1), class (1), deparse (1), formals (1), grep (1), if (1), ifelse (1), intersect (1), is.na (1), max (1), mean (1), print (1), return (1), rownames (1), sapply (1), t (1), typeof (1), unique (1)
paste_collapse (8), fctr_info (5), get_fctr_info (5), get_names_x (5), f_oobag_eval (3), get_n_obs (3), get_n_tree (3), get_names_y (3), get_numeric_bounds (3), last_value (3), orsf_fit (3), ref_code (3), unit_info (3), check_var_types (2), get_f_oobag_eval (2), get_importance (2), get_leaf_min_events (2), get_leaf_min_obs (2), get_max_time (2), get_mtry (2), get_n_events (2), get_n_leaves_mean (2), get_oobag_eval_every (2), get_type_oobag_eval (2), get_unit_info (2), is_empty (2), list_init (2), orsf_control_net (2), orsf_pd_summary (2), orsf_train_ (2), select_cols (2), check_arg_bound (1), check_arg_gt (1), check_arg_gteq (1), check_arg_is (1), check_arg_is_integer (1), check_arg_is_valid (1), check_arg_length (1), check_arg_lt (1), check_arg_lteq (1), check_arg_type (1), check_arg_uni (1), check_control_cph (1), check_control_net (1), check_new_data_fctrs (1), check_new_data_names (1), check_new_data_types (1), check_oobag_fun (1), check_orsf_inputs (1), check_pd_inputs (1), check_predict (1), check_units (1), contains_oobag (1), contains_vi (1), f_beta (1), fctr_check (1), fctr_check_levels (1), fctr_id_check (1), get_cph_do_scale (1), get_cph_eps (1), get_cph_iter_max (1), get_cph_method (1), get_cph_pval_max (1), get_f_beta (1), get_n_retry (1), get_n_split (1), get_net_alpha (1), get_net_df_target (1), get_oobag_pred (1), get_oobag_time (1), get_orsf_type (1), get_split_min_events (1), get_split_min_obs (1), get_tree_seeds (1), get_types_x (1), insert_vals (1), is_aorsf (1), is_error (1), is_trained (1), leaf_kaplan_testthat (1), lrt_multi_testthat (1), newtraph_cph_testthat (1), oobag_c_harrell_testthat (1), orsf (1), orsf_control_cph (1), orsf_oob_vi (1), orsf_pd_ (1), orsf_pd_ice (1), orsf_pred_multi (1), orsf_pred_uni (1), orsf_scale_cph (1), orsf_summarize_uni (1), orsf_time_to_train (1), orsf_train (1), orsf_unscale_cph (1), orsf_vi_ (1), x_node_scale_exported (1)
data (13)
formula (4), dt (2), terms (2), family (1), time (1), weights (1)
round_spec (1), round_using_magnitude (1), table_value (1)
glmnet (1)
new (1)
base
aorsf
utils
stats
table.glue
glmnet
methods
This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.
The package has: - code in C++ (48% in 2 files) and R (52% in 21 files) - 1 authors - 3 vignettes - 1 internal data file - 3 imported packages - 15 exported functions (median 15 lines of code) - 216 non-exported functions in R (median 3 lines of code) - 48 R functions (median 38 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by [the `checks_to_markdown()` function](https://docs.ropensci.org/pkgcheck/reference/checks_to_markdown.html) The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:------------------------|-----:|----------:|:----------| |files_R | 21| 82.3| | |files_src | 2| 79.1| | |files_vignettes | 3| 92.4| | |files_tests | 18| 95.7| | |loc_R | 2139| 85.4| | |loc_src | 1982| 75.6| | |loc_vignettes | 342| 67.9| | |loc_tests | 1532| 91.7| | |num_vignettes | 3| 94.2| | |data_size_total | 9034| 70.4| | |data_size_median | 9034| 78.5| | |n_fns_r | 231| 91.6| | |n_fns_r_exported | 15| 58.5| | |n_fns_r_not_exported | 216| 94.0| | |n_fns_src | 48| 66.0| | |n_fns_per_file_r | 7| 77.6| | |n_fns_per_file_src | 24| 94.8| | |num_params_per_fn | 3| 33.6| | |loc_per_fn_r | 3| 1.1|TRUE | |loc_per_fn_r_exp | 15| 35.6| | |loc_per_fn_r_not_exp | 3| 1.5|TRUE | |loc_per_fn_src | 38| 88.9| | |rel_whitespace_R | 50| 96.1|TRUE | |rel_whitespace_src | 58| 89.6| | |rel_whitespace_vignettes | 56| 83.5| | |rel_whitespace_tests | 45| 96.8|TRUE | |doclines_per_fn_exp | 46| 58.1| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 364| 93.5| | ---
Click to see the interactive network visualisation of calls between objects in package
goodpractice
and other checks#### 3a. Continuous Integration Badges [![R-CMD-check](https://github.com/bcjaeger/aorsf/workflows/R-CMD-check/badge.svg)](https://github.com/bcjaeger/aorsf/actions) [![pkgcheck](https://github.com/bcjaeger/aorsf/workflows/pkgcheck/badge.svg)](https://github.com/bcjaeger/aorsf/actions) **GitHub Workflow Results** |name |conclusion |sha |date | |:--------------------------|:----------|:------|:----------| |Commands |skipped |069021 |2022-04-22 | |pages build and deployment |success |c2efe9 |2022-04-28 | |pkgcheck |success |c73bb9 |2022-04-28 | |pkgdown |success |c73bb9 |2022-04-28 | |R-CMD-check |success |c73bb9 |2022-04-28 | |test-coverage |success |c73bb9 |2022-04-28 | --- #### 3b. `goodpractice` results #### `R CMD check` with [rcmdcheck](https://r-lib.github.io/rcmdcheck/) R CMD check generated the following note: 1. checking installed package size ... NOTE installed size is 6.9Mb sub-directories of 1Mb or more: libs 6.0Mb R CMD check generated the following check_fails: 1. no_import_package_as_a_whole 2. rcmdcheck_reasonable_installed_size #### Test coverage with [covr](https://covr.r-lib.org/) Package coverage: 97.13 #### Cyclocomplexity with [cyclocomp](https://github.com/MangoTheCat/cyclocomp) The following functions have cyclocomplexity >= 15: function | cyclocomplexity --- | --- orsf | 33 check_orsf_inputs | 28 orsf_pd_ | 22 ref_code | 17 check_new_data_names | 15 check_predict | 15 #### Static code analyses with [lintr](https://github.com/jimhester/lintr) [lintr](https://github.com/jimhester/lintr) found the following 223 potential issues: message | number of times --- | --- Avoid using sapply, consider vapply instead, that's type safe | 10 Lines should not be more than 80 characters. | 182 Use <-, not =, for assignment. | 31
|package |version | |:--------|:---------| |pkgstats |0.0.4.30 | |pkgcheck |0.0.3.11 | |srr |0.0.1.149 |
This package is in top shape and may be passed on to a handling editor
@ropensci-review-bot assign @tdhock as editor
Assigned! @tdhock is now the editor
@ropensci-review-bot seeking reviewers
Please add this badge to the README of your package repository:
[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/532_status.svg)](https://github.com/ropensci/software-review/issues/532)
Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news
Please add this badge to the README of your package repository:
[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/532_status.svg)](https://github.com/ropensci/software-review/issues/532)
Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news
Done! Looking forward to the review.
Hi @tdhock,
Thank you for acting as the editor of this submission. I have recently added a paper.md file to the repository in hopes that aorsf
can be eligible for an expedited review at the Journal of Open Source Software after the ropensci review. Can you let me know if any other files are needed or if other actions are needed from me to qualify for an expedited review at JOSS? Thank you!
I'm not sure about the expedited JOSS review. You may ask @noamross or @mpadge
@bcjaeger The paper.md
file should be sufficient. For JOSS submission, after RO review is complete you should submit to JOSS and link to this review thread in the pre-review JOSS thread. JOSS editors will review the paper.md and can opt to use RO's software reviews rather than finding additional reviewers.
Thanks, @noamross! That makes sense.
Hi @tdhock, I see we are still seeking reviewers. Is there anything I can do to help find reviewers for this submission?
I have asked a few people to review but no one has agreed yet, do you have any idea for potential reviewers to ask?
Thanks! Assuming folks in the ROpenSci circle have been asked already, I'll offer some names (username) that may not have been asked yet but would be good reviewers for aorsf
.
Terry Therneau (therneau) would be a good reviewer - a lot of the C code in aorsf is based on his survival
package coxph
routine.
Torsten Hothorn (thothorn) would also be a good reviewer - Torsten is the author of the party
package and I'd like aorsf
to look like the party
package in 10 years.
Hannah Frick (hfrick), Emil Hvitfeldt (EmilHvitfeldt), Max Kuhn (topepo), Davis Vaughan (DavisVaughan), and Julia Silge
(juliasilge) would all be good reviewers - they are all developers/contributors to the censored
package, and I'd like aorsf
to contribute to that package.
Raphael Sonabend (RaphaelS1), Andreas Bender (adibender), Michel Lang (mllg), and Patrick Schratz (pat-s) would all be good reviewers - they are developers/contributors to the mlr3-proba
package, and I'd like aorsf to contribute to that package.
Hi @tdhock, I am thinking about reaching out to the folks I listed in my last comment. Before I try to reach them, I was wondering if you had already contacted them? If you have, I will not bother them again with a review request.
Hi @bcjaeger actually I have not asked any of those reviewers yet, sorry! I just got back to the office from traveling. Yes that would be helpful if you could ask them to review. (although it is suggested to not use more than one of them, https://devguide.ropensci.org/editorguide.html?q=reviewers#where-to-look-for-reviewers)
Hi @tdhock, thanks! I will reach out to the folks in my earlier post and let you know if anyone is willing to review aorsf
. Hope you had a great trip!
@ropensci-review-bot add @chjackson to reviewers
@chjackson added to the reviewers list. Review due date is 2022-07-29. Thanks @chjackson for accepting to review! Please refer to our reviewer guide.
rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.
@chjackson: If you haven't done so, please fill this form for us to update our reviewers records.
@tdhock Is this the right link for the reviewer guide: (https://devguide.ropensci.org/reviewerguide.html). The link in the bot post gives a 404.
@nicholasjhorton yes thank you very much for agreeing to review.
@nicholasjhorton and @chjackson, thank you for agreeing to review!! I am thrilled and looking forward to talking about aorsf
with you.
I'm not sure that I was asked to review. Can you please resend the invite @tdhock?
@ropensci-review-bot add @nicholasjhorton to reviewers
@nicholasjhorton added to the reviewers list. Review due date is 2022-08-01. Thanks @nicholasjhorton for accepting to review! Please refer to our reviewer guide.
rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.
@nicholasjhorton: If you haven't done so, please fill this form for us to update our reviewers records.
Briefly describe any working relationship you have (had) with the package authors.
☒ As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).
The package includes all the following forms of documentation:
☒ A statement of need: clearly stating problems the software is designed to solve and its target audience in README
☒ Installation instructions: for the development version of package and any non-standard dependencies in README
☒ Vignette(s): demonstrating major functionality that runs successfully locally
☒ Function Documentation: for all exported functions
help(package="aorsf")
. I couldn’t see a help("aorsf-package")
from R, is this because it it labelled with keyword “internal”? I don’t know if a package overview help page is strictly necessary given the DESCRIPTION and README files, but I sometimes find them helpful (is this the “multiple points of entry” principle?).☒ Examples: (that run successfully locally) for all exported functions
☒ Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL
, BugReports
and Maintainer
(which may be autogenerated via Authors@R
).
☒ Installation: Installation succeeds as documented.
☒ Functionality: Any functional claims of the software been confirmed.
☐ Performance: Any performance claims of the software been confirmed.
☒ Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
☒ Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.
Estimated hours spent reviewing: 5
While I have a lot of experience with parametric statistical modelling of survival data, I am not knowledgeable about machine learning. So I can’t go into depth on the ML methods side of things. Though the vignettes explained clearly what the method is supposed to do - I felt like I learnt some useful things from them.
As far as I can tell, this is a polished, easily accessible package that implements a widely useful method. The differences and advances from other packages and methods are explained concisely.
I tried the package out by using it on a survival dataset that I had been using to test some of my own work, the colon
data from the survival
package. This is a simple randomised trial that compares time to cancer recurrence or death for three treatment groups, and I considered treatment as the only predictor. This may not be an ideal application for a random forest model, but hopefully it is helpful to see how the method does on this case.
library(survival)
library(dplyr)
colonc <- colon |>
filter(etype == 1) |>
mutate(lev = factor(rx=="Lev"),
lev5fu = factor(rx=="Lev+5FU"))
When using orsf
with a single factor predictor, I got "Error: formula must have at least 2 predictors"
. Perhaps it would be helpful to explain the restriction to 2+ predictors somewhere, though I guess that it is rare that people will want to use a machine learning method with only one predictor.
fit <- orsf(data_train = colonc, formula = Surv(time, status) ~ rx)
When I recoded the data so the formula included two binary predictors, it worked. Though I realise that in this case the data contain only three out of the four possible combinations of the two binary variables, so it is impossible to identify whether there is an interaction.
I then calculated predicted survival at 2000 and 3000 days, under the two methods, for each of the four combinations of predictors.
library(aorsf)
fit <- orsf(data_train = colonc, formula = Surv(time, status) ~ lev + lev5fu)
fit <- orsf(data_train = colonc, formula = Surv(time, status) ~ lev + lev5fu, n_tree = 100, n_split=3)
pd_spec <- list(lev=c("FALSE","TRUE"),
lev5fu=c("FALSE","TRUE"))
pd_data <- orsf_pd_summary(object = fit, pd_spec = pd_spec,
pred_horizon = c(2000, 3000),
oobag = TRUE, expand_grid = TRUE,
risk=FALSE)
pd_data
## pred_horizon lev lev5fu mean lwr medn upr
## 1: 2000 FALSE FALSE 0.4511464 0.4457626 0.4510625 0.4572469
## 2: 2000 TRUE FALSE 0.4496973 0.4451570 0.4496190 0.4549601
## 3: 2000 FALSE TRUE 0.6138173 0.6072743 0.6138223 0.6205446
## 4: 2000 TRUE TRUE 0.5574858 0.5362698 0.5575599 0.5803773
## 5: 3000 FALSE FALSE 0.4239031 0.4180159 0.4237805 0.4305380
## 6: 3000 TRUE FALSE 0.4224044 0.4174938 0.4224400 0.4274585
## 7: 3000 FALSE TRUE 0.6016161 0.5949703 0.6015791 0.6084788
## 8: 3000 TRUE TRUE 0.5405302 0.5183706 0.5406791 0.5655176
I compared the osrf
model to a flexible parametric spline-based, non-proportional hazards survival model from the flexsurv
package.
library(flexsurv)
spl <- flexsurvspline(Surv(time, status) ~ lev + lev5fu + gamma1(lev) + gamma1(lev5fu), data=colonc, k=5)
summary(spl, newdata=expand.grid(pd_spec), t=c(2000,3000), tidy=TRUE) |>
arrange(time, lev5fu, lev)
## time est lcl ucl lev lev5fu
## 1 2000 0.4418226 0.3930134 0.4876635 FALSE FALSE
## 2 2000 0.4495771 0.3991124 0.4943800 TRUE FALSE
## 3 2000 0.6143594 0.5584856 0.6625904 FALSE TRUE
## 4 2000 0.6207677 0.5397314 0.6971653 TRUE TRUE
## 5 3000 0.4084440 0.3582102 0.4573942 FALSE FALSE
## 6 3000 0.4186286 0.3673111 0.4650971 TRUE FALSE
## 7 3000 0.5871624 0.5340537 0.6394870 FALSE TRUE
## 8 3000 0.5958252 0.5095980 0.6783956 TRUE TRUE
The predictions agreed for each combination except the one with both factor levels TRUE
. This is understandable because these combinations do not appear in the data, and the two methods will be relying on different assumptions to extrapolate to this combination.
At this point I wondered how the intervals in the output were determined, and what do they mean? Do they reflect uncertainty about an expected value, or variability in observed values, or something else? The help page (argument prob_values
) explains them as quantiles, but of what? It can’t be quantiles of some subset of the data with that combination of predictors, as there isn’t any such data in the training set. Perhaps this is obvious to someone with more experience with random forests or related ML methods.
The y axis for “partial dependence of bilirubin and edema” should start at 0. I’d also end it at 1, but this is perhaps only necessary if the graph is intended to be compared with other graphs of predicted risk.
It wasn’t clear what the package was doing when the pred_horizon
argument was not supplied to the functions that did prediction. What horizon is being used? I found that orsf_pd_summary
didn’t show in the output what prediction time was being used, in cases where the prediction time wasn’t supplied by the user. Though from orsf_summarize_uni
, I deduced this was the median follow-up time in the data. I’d recommend showing good practice in the worked examples, e.g. using a meaningful prediction time that someone doing the analysis might use (e.g. 5 years). As a developer, I’ve found that it’s common for users to do silly things by copying from worked examples in manuals without thinking.
@chjackson, thank you! All of the suggestions from your review make sense and are actionable. I plan to make changes to aorsf
's documentation based on your feedback.
Regarding performance, I decided to put almost all of the benchmarking results in a separate project: https://github.com/bcjaeger/aorsf-bench. If you would like to review the performance claims, I recommend checking this repo out and in particular paper/jmlr/main.pdf, which is the first draft of a paper introducing the methods (not the software) involved in this R package. The results in this paper verify claims about performance in the aorsf
documentation.
Dear @tdhock: I'm sorry but I'm feeling quite confused. What did I sign up? I seem to have been added but don't know what I was asked to do. Can you please resend the invitation?
Sorry for the confusion. You did not agree to review the aorsf package? If you can write the review please see instructions, https://devguide.ropensci.org/reviewerguide.html Otherwise please tell me if you would like me to remove you from the reviewer list.
I didn't agree. The first I heard of the request was https://github.com/ropensci/software-review/issues/532#issuecomment-1179231500.
Unfortunately, I'm not able to review at this time. Can you please remove me from the reviewer list?
@ropensci-review-bot remove @nicholasjhorton from reviewers
@nicholasjhorton removed from the reviewers list!
Sorry for the trouble!
Thanks, @bcjaeger - I'm happy with the performance claims from a quick scan of that paper.
Thank you! I pushed some changes to the main branch of aorsf
today based on the feedback from your review. I really appreciate your time and thoughts on the package. You are more than welcome to review the added clarifications to the documentation and let me know if there is anything you'd recommend modifying. Seeing your response above, I will add you as 'rev' to the DESCRIPTION file.
That 's fine to add me, thanks. The doc changes look good.
Is it worth adding to the help pages what the functions do you omit pred_horizon
, and add in the corresponding output data frame the specific prediction time being used? Even if it's sensible to use an explicit prediction time, perhaps someone would have a genuine reason for using a default based on a summary of the data, or they may leave pred_horizon
out by accident then be curious about what the output means.
For my own interest, I'm still not sure what the intervals mean. When you say "The quantiles are calculated based on predictions from object at each set of values indicated by pd_spec.", does that mean that the predicted risk for a given set of covariate values is stochastic? Then how is the point prediction defined, and does this stochastic variation describe uncertainty about the expected risk (which can be reduced by collecting more data), or variation in observed outcomes (which is a characteristic of the population that the data are drawn from)?
Thanks! I think it would be worth noting that pred_horizon
plays a role in all prediction functions. Perhaps I could write that statement into the details or directly into the description of the pred_horizon
input?
It is also worth mentioning that pred_horizon
can be pre-determined by the object returned by orsf()
. When you run orsf()
, you can specify an input called oobag_time
that will set pred_horizon
for your out-of-bag predictions.
possible API change I am wondering if it would make sense to change the name of the input oobag_time
in orsf()
to oobag_pred_horizon
for consistency within the package. Do you think that would be helpful?
As far as partial dependence goes, suppose we have an outcome Y modeled as a function of X and we want to compute partial dependence of Y at X = 1. We first set X equal to 1 for all rows in the training data (or whatever data we are using to compute partial dependence) and then compute the predicted value of Y for each row in the modified data, leaving us with N predictions if we have N observations. The point estimate for partial dependence is the mean of the predictions we compute, but you could also compute the 25th percentile of those predictions, the median, the 75th percentile, or whatever summary you like. I definitely wouldn't think of those percentiles as confidence bounds, but they do a nice job of quantifying how much the predictions vary across the training data when X is fixed at 1. Does that make sense? Perhaps we could add a definition like this into the partial dependence vignette?
Thanks - I was forgetting that the partial dependencies on one predictor were marginalised over the other predictors. So I'd say those intervals are describing variability in the predicted outcome for a specific (or "focused"?) predictor value, due to variation in the other predictors. Whereas the "partial dependence value" is the predicted outcome for a specific predictor value, averaged over values of the other predictors.
Naming consistency is usually good!
Excellent! Just pushed the API changes to the main branch. In addition to changing oobag_time
to oobag_pred_horizon
, I also changed data_train
to data
in orsf()
. The second change was done to have consistency between orsf()
and other modeling functions, most of which denote training data as data
in their inputs.
Are there any other changes we should consider?
Hi @chjackson, I checked on some other submissions and it looks like final approval is given by each reviewer before a submission is closed. Do you approve of {aorsf}'s current state? If not, I am happy to make additional changes.
I'm not sure if this is in scope for the package review, but I had a look at your aorsf-bench slides/figures, and I saw one figure that compared the computation time of various packages/algos, I think for a single data set/size. That is a good start, but I was wondering if you have done any analysis of the computation time of the packages/algos as a function of data size? (plot time vs number of features and number of observations, is it expected to scale linear for each algo, or..?)
Hi @tdhock - thank you for checking that out!
The figure you are referring to is https://bcjaeger.github.io/aorsf-bench/#32, right? The slide does not make this clear, but the times in the figure for each algo are aggregated across all of the risk prediction tasks on this slide: https://bcjaeger.github.io/aorsf-bench/#29. So I'd say the timing estimates are more general than timing estimates on one dataset, but they still do not show patterns on how each algo scales with more observations or more features.
I very much like the idea of showing computation time as a function of the number of observations and as a function of the number of features. Would it make sense to do this by focusing on one of the larger benchmark datasets, and then using subsets of that data with progressively higher dimensions? e.g., suppose one dataset has 100 features and 10,000 rows. We could run:
And then do a similar routine for features, holding the number of observations fixed.
Right, https://bcjaeger.github.io/aorsf-bench/#32 is the figure (which shows combined time for training and prediction). I would suggest doing separate timings of the training and prediction steps. I suppose training is usually the bottleneck/slower step which would be more important to analyze/prove you are fast enough, right?
Hi @bcjaeger yes that sounds good, but I would suggest avoiding linear scales (2500, 5000, 7500). Typically for this kind of analysis I would suggest varying the number of rows/columns on a log scale, like (100, 1000, 10000) or N=as.integer(seq(2, 4, by=0.5))
For an example of what kind of figure I am thinking of see
which is from Fig 7 of https://jmlr.org/papers/volume21/18-843/18-843.pdf
Ahh, that makes perfect sense. I will add this to the paper I am working on in the aorsf-bench repo.
Hi all, yes, happy to approve the current version of aorsf.
So it will take me awhile to get this result written into the paper but here is a very rough figure to be responsive to @tdhock's earlier suggestion. Each facet in the figure has the number of predictor variables printed at the top strip (10, 100, or 1,000). Each point is the mean value of time to fit the corresponding model, taken over 10 independent runs and allowing up to 4 CPUs for parallel computing. Overall, the ranger
package tends to do best with lower values of n, but its computing time scales up very fast. This may be due to memory usage (I noticed my working memory was completely used up while ranger was running). the aorsf
package tends to be similar to randomForestSRC
(rsf_rfsrc
in the figure), with aorsf
having a narrow advantage when p = 10 or 100 and randomForestSRC
winning out when p = 1000. the party
package scales consistently but is generally slower than ranger
, randomForestSRC
, and aorsf
. I am puzzled by the way that randomForestSRC
scales better when p is 1000 but not when p is 100 or 10. Perhaps a special computational routine is activated in randomForestSRC
when p is very large? It seems like a different computation may be taking place in randomForestSRC
with n>1000 as well.
The data used here were simulated, and all predictors were continuous.
Date accepted: 2022-09-22
Submitting Author Name: Byron C Jaeger Submitting Author Github Handle: !--author1-->@bcjaeger<!--end-author1-- Other Package Authors Github handles: (comma separated, delete if none) @nmpieyeskey, @sawyerWeld Repository: https://github.com/bcjaeger/aorsf Version submitted: 0.0.0.9000 Submission type: Stats Badge grade: gold Editor: !--editor-->@tdhock<!--end-editor-- Reviewers: @chjackson, @mnwright, @jemus42
Due date for @chjackson: 2022-07-29 Due date for @mnwright: 2022-09-21 Due date for @jemus42: 2022-09-21Archive: TBD Version accepted: TBD Language: en
Pre-submission Inquiry
General Information
Target audience: people who want to fit and interpret a risk prediction model, i.e., a prediction model for right-censored time-to-event outcomes.
Applications: fit an oblique random survival forest, compute predicted risk at a given time, estimate the importance of individual variables, and compute partial dependence to depict relationships between specific predictors and predicted risk.
Paste your responses to our General Standard G1.1 here, describing whether your software is:
Please include hyperlinked references to all other relevant software. The obliqueRSF package was the original R package for oblique random forests. I wrote it and it is very slow. I wrote
aorsf
because I had multiple ideas about how to makeobliqueRSF
faster, specifically using a partial Newton Raphson algorithm instead of using penalized regression to derive linear combinations of variables in decision nodes. It would have been possible to rewriteobliqueRSF
, but it would have been difficult to make the re-write backwards compatible with the version ofobliqueRSF
on CRAN.(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?
Not applicable
Badging
Gold
aorsf
complies with over 100 combined standards in the general and ML categories.aorsf
uses an optimized routine to partially complete Newton Raphson scoring for the Cox proportional hazards model and also an optimized routine to compute likelihood ratio tests. Both of these routines are heavily used when fitting oblique random survival forests, and both demonstrate the exact same answers as corresponding functions in thesurvival
package (see tests inaorsf
) while running at least twice as fast (thanks to Rcpparmadillo).Technical checks
Confirm each of the following by checking the box.
autotest
checks on the package, and ensured no tests fail.srr_stats_pre_submit()
function confirms this package may be submitted.pkgcheck()
function confirms this package may be submitted - alternatively, please explain reasons for any checks which your package is unable to pass.I think
aorsf
is passingautotest
andsrr_stats_pre_submit()
. I am having some issues running these on R 4.2. Currently, autotest is returning NULL, which I understand to be a good thing, and srr_stats_pre_submit is not able to run (not sure why; but it was fine before I updated to R 4.2).This package:
Publication options
Code of conduct