[REVIEW]: robustHD: An R package for robust regression with high-dimensional data

whedon commented 3 years ago

Submitting author: @aalfons (Andreas Alfons) Repository: https://github.com/aalfons/robustHD/ Version: v0.7.1 Editor: @mikldk Reviewer: @valentint, @msalibian Archive: 10.25397/eur.16802596.v1

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

Status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/3929d5aaa6df61cea6e470253df4a258"><img src="https://joss.theoj.org/papers/3929d5aaa6df61cea6e470253df4a258/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/3929d5aaa6df61cea6e470253df4a258/status.svg)](https://joss.theoj.org/papers/3929d5aaa6df61cea6e470253df4a258)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@valentint & @msalibian, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

Make sure you're logged in to your GitHub account
Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @mikldk know.

✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨

Review checklist for @valentint

✨ Important: Please do not use the Convert to issue functionality when working through this checklist, instead, please open any new issues associated with your review in the software repository associated with the submission. ✨

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@aalfons) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
[x] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Does the paper have a section titled 'Statement of Need' that clearly states what problems the software is designed to solve and who the target audience is?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

Review checklist for @msalibian

✨ Important: Please do not use the Convert to issue functionality when working through this checklist, instead, please open any new issues associated with your review in the software repository associated with the submission. ✨

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@aalfons) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
[x] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Does the paper have a section titled 'Statement of Need' that clearly states what problems the software is designed to solve and who the target audience is?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

whedon commented 3 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @valentint, @msalibian it looks like you're currently assigned to review this paper :tada:.

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf

whedon commented 3 years ago

Wordcount for paper.md is 1546

whedon commented 3 years ago

Software report (experimental):

github.com/AlDanial/cloc v 1.88  T=0.08 s (1004.9 files/s, 158530.3 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
R                               56            583           3734           3863
C++                              8            139            699           1261
SVG                              2              0              0           1053
Markdown                         2             86              0            460
TeX                              1             14              0            137
C/C++ Header                     8             46             60             95
C                                1              3              1             40
Rmd                              1             67             85             37
-------------------------------------------------------------------------------
SUM:                            79            938           4579           6946
-------------------------------------------------------------------------------

Statistical information for the repository '97cbb9e386fb99318e2109db' was
gathered on 2021/10/01.
The following historical commit information, by author, was found:

Author                     Commits    Insertions      Deletions    % of changes
Andreas Alfons                  45          4408           2075          100.00

Below are the number of rows from each author that have survived and are still
intact in the current revision:

Author                     Rows      Stability          Age       % in comments
Andreas Alfons             2344           53.2         92.2               34.56

whedon commented 3 years ago

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1198/016214507000000950 is OK
- 10.1214/12-AOAS575 is OK
- 10.1016/j.csda.2015.02.007 is OK
- 10.1111/j.2517-6161.1996.tb02080.x is OK
- 10.1016/j.csda.2017.02.002 is OK
- 10.1016/j.chemolab.2017.11.017 is OK
- 10.1080/00401706.2017.1305299 is OK
- 10.1214/19-AOAS1269 is OK
- 10.1186/s12877-020-01644-2 is OK
- 10.3389/fevo.2020.583831 is OK
- 10.1158/0008-5472.can-12-1370 is OK

MISSING DOIs

- None

INVALID DOIs

- None

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

mikldk commented 3 years ago

@valentint, @msalibian: Thanks for agreeing to review. Please carry out your review in this issue by updating the checklist above and giving feedback in this issue. The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. If possible create issues (and cross-reference) in the submission's repository to avoid too specific discussions in this review thread.

If you have any questions or concerns please let me know.

Please note that I asked about the length of the paper in the pre-review. Both @aalfons and I would like your input in that. Bear in mind the JOSS guidelines of what the paper should contain: https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain.

msalibian commented 3 years ago

@mikldk I'm happy with the submission (documentation & repository are good to go, as far as I can see). For the paper I have a few suggestions below.

The issue of length is a tricky one: yes, the paper is long, but the two examples illustrate different functionality of the software. One way around this could be to reduce the text output of the examples in the paper (e.g. remove the estimated regression coefficients, etc.) and shorten the description of the data, etc. I think Figure 1 may also go, but I'd keep Figure 2 (the main difference between them being that in Fig 1 one does not get the whole 'path of solutions', because either sparseLTS doesn't compute it, or because it's not computed when using prediction error (as opposed to BIC) to select the optimal model, which is something that I'd like @aalfons to clarify, see my comment below).

Specific suggestions:

Summary: I'd avoid saying "cleaning the data" (lines 18-19 of page 1), specially when advocating for the use of robust estimation methods. Such a phrase opens the door to considering cleaning the data and applying "classical" methods as an alternative "robust" methodology, which is well-known to not work well in general;
Second paragraph of "Statement of need", rather than referring to the historical value of the package, I'd focus on what it does: implements regularized trimmed-squares regression estimators, the plug-in LARS and its group-wise variant;
line 57 (page 2), does sparseLTS estimate the smallest lambda that sets all coefficients to zero, or does it compute it?
line 58, what other options are available for the arguments mode and crit?
line 60, presumably the cross-validated prediction error also uses some kind of trimming or robustification, this should be mentioned when discussing how the prediction error is estimated;
line 66, although the argument K of foldControl seems fairly self explanatory, I think it'd be helpful to mention what R is or does;
lines 68-79, it'd also be helpful to say something about what is in the "reweighted" column (as compared with the "raw" one);
line 104, comparing this with the robust lars example, can we also get a full path of solutions with sparseLTS if we use crit = "BIC"? the different default behaviour is a bit intriguing;
lines 130-137: how are groups formed? It does not seem to be specified explicitly in the call to rgrplars below; is there a default behaviour? can the user override it?
lines 135-137: I don't think I'm understanding this sentence correctly: "The robust regression estimator used to estimate the submodels uses an initial subsampling of observations(...)" Should "estimate the submodels" be "estimate the regression coefficients corresponding to each submodel" (or something along those lines)? Or is it really trying to determine which groups of features (and thus submodels) to use?
Also regarding the sentence just above: I seem to remember that the "original" Robust LARS simply run the classical LARS algorithm using robust correlation coefficients, and thus it did not need iterations or initial values, etc. Am I remembering it wrong?

mikldk commented 3 years ago

@valentint, @msalibian: Can you confirm that you have finished the review and recommend that this paper is now published?

@aalfons: Please let me know when you have addressed (in one way or another) @msalibian's suggestions

aalfons commented 3 years ago

@msalibian - Thanks for the review, in particular your comments about how to reduce the length are very helpful. It also seems that some of my efforts to keep the text in the examples short left some clarity on the table, but those are all easy fixes. I can already clarify a few things below, which of course doesn't mean that there won't be changes in the paper as well.

@mikldk - I'll wait for clarification of @valentint before I make changes to the paper.

ad 3) For the classical lasso, we can of course compute the smallest lambda that sets all coefficients to zero in advance. For the sparse LTS, this would only be possible if we knew the optimal subset in advance. So we cannot compute it in advance, but we can get a decent enough estimate.

ad 8) Yes, when crit = "BIC", we always get a full solution path, also for the sparse LTS, since the solution for all values of lambda is computed on the full sample. On the other hand, when crit = "PE", also robust (groupwise) LARS only estimates the optimal submodel on the full sample. This behavior only depends on how we select the final model, not on the modeling technique.

ad 9) The default behavior is that each factor is taken as a group of dummy variables, and all other variables are taken individually. Of course this default can be overriden by the user with argument assign to specify group assignments of the variables.

ad 10) The submodels along the sequence are estimated with lmrob() from package robustbase. As you know (since it is based on your work), this estimator is non-deterministic and requires some initial subsamples, which is why we set the seed of the random number generator. The point is that sequencing the variables is deterministic, but that fitting the submodels along the sequence with lmrob() is not.

ad 11) Robust LARS (not the groupwise version) can run the classical LARS algorithm with robust correlations to determine the sequence of candidate predictors, but it cannot update the coefficients along the path as the classical LARS algorithm does. Also robust LARS needs a second step to fit robust regressions along the sequence in order to estimate the submodels. See Khan, Van Aelst & Zamar (2007, JASA) for details.

valentint commented 3 years ago

@mikldk I have completed the review and can confirm that all requirements for the package, its documentation, and maintenance are fulfilled.

Regarding the length of the paper - the last sentence in Statement of need triggers the addition of six references - may be changing this could help.

@aalfons One minor comment to the installation: when I build from sources on Windows, I get errors about other packages failing to load, e.g. ggplot2. To work around this I added INSTALL_opts=c("--no-multiarch") to the call to devtools::install_github().

@aalfons in the paper - running the sparseLTS() example with cross-validation takes quite some time (five minutes on my laptop), maybe you could mention this for the user.

aalfons commented 3 years ago

@whedon generate pdf

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

aalfons commented 3 years ago

@mikldk @msalibian @valentint - I created a revised version of the manuscript which is reduced in length and fits comfortably on 5 pages. In addition to your suggestions, I also made minor changes here and there to make the writing more concise. Furthermore, I carried over the relevant changes to the examples in the README.

I'll give detailed responses to each reviewer in separate comments below.

aalfons commented 3 years ago

Reply to @msalibian:

Thanks again for your helpful suggestions on reducing the length. I decided to keep the output on the regression coefficients, but for sparseLTS() I excluded the output regarding the cross-validation. This also avoided a discussion on the raw and reweighted estimators (see also my reply to one of the comments below). Moreover, I followed your suggestion and removed Figure 1 but kept Figure 2. With this change, it seemed more appropriate to present the rgrplars() example first, as it is more detailed. Furthermore, I think that in this case the length of the paper can be further reduced by avoiding the discussion on when the coefficient plot is available (see my reply to one of the comments below for details).

ad 1) (avoid saying "cleaning the data") -> Done (see page 1, line 19).

ad 2) (shortening the Statement of Need) -> As suggested, I removed two sentences on the historical value of the package, but I kept the sentence referring to the use of robustHD in the literature, as it underlines its usefulness (page 1, lines 29-32). But I'd be happy for some direction by @mikldk on this (see also my reply below to a related point by @valentint).

ad 3) (estimate or compute smallest lambda that sets everything to zero) -> As clarified this in my earlier reply, the word "estimate" is correct here.

ad 4) (other options for the arguments mode and crit) -> In the spirit of not (unnecessarily) extending the length of the paper, I think it is better not to include a discussion on other options for the arguments mode and crit. Interested readers can find this information in the help file.

ad 5) (cross-validated prediction error) -> I added a sentence that for sparse LTS, the default prediction loss function is the root trimmed mean squared prediction error (page 3, lines 114-115).

ad 6) (clarification of arguments of foldControl()) -> I took greater care to explain the different arguments in the function call, including an explanation that splits = foldControl(K = 5, R = 1) defines 5-fold cross-validation with one replication (page 3, lines 112-114).

ad 7) (clarifying "reweighted" and "raw" columns of output) -> To shorten the paper, I followed your suggestion to remove some of the text output in the sparseLTS() example. The cross-validation results are no longer shown, avoiding a discussion on the "raw" and "reweighted" columns altogether.

ad 8) (whether we can get a full path of sparseLTS() solutions with crit = "BIC") -> As clarified in my earlier reply, this behavior has to do with how the final model is selected, irrespective of the modeling technique. With switching the order of the examples and removing the plots for the sparseLTS() example (see the discussion above), I found it better to avoid this discussion altogether to shorten the paper. Regarding plots of results from sparseLTS(), I now simply write: "Similar plots as in Figure 1 are available to visualize the results." (page 3, lines 140-141) This is a bit of an oversimplification, as the coefficient plot is not available for this specific example, but I think that this simplification acceptable since the plot is indeed available if crit = "BIC" is used in sparseLTS().

ad 9) (how groups are formed in rgrplars()) -> I rewrote the corresponding paragraph. Specifically, I now write: "Through the formula interface, function rgrplars() by default takes each categorical variable (factor) as a group of dummy variables while all remaining variables are taken individually. However, the group assignment can be defined by the user through argument assign." (page 2, lines 53-56)

ad 10) (confusing sentence regarding the robust regression estimator in rgrplars()) -> In line with the clarification from my earlier reply, I now write: "Note that each submodel along the sequence is fitted using a robust regression estimator with a non-deterministic algorithm, hence the seed of the random number generator is supplied for reproducibility." (page 2, lines 58-60)

ad 11) (clarification on robust LARS) -> As already clarified in my earlier reply, also robust LARS uses a two-step strategy where first the variables are sequences, and then submodel along the sequence are estimated with robust regressions.

Thanks again for the helpful comments, and please let me know if anything is still unclear.

aalfons commented 3 years ago

Reply to @valentint:

Regarding the last sentence in the Statement of Need triggering the addition of six references: For now, I kept this sentence as is, as I think that the references demonstrate the usefulness of the package (see also my reply above to a related comment by @msalibian). I could remove some of the references, but this would only reduce the length from slightly under 5 pages to maybe 4.5 pages, and I find this reduction somewhat artificial since it doesn't reduce the main text. @mikldk Could you please provide some guidance whether some of the references should be removed?

Regarding your installation issue: I could not reproduce this issue on my Windows machine, for me installing from source with devtools::install_github("aalfons/robustHD") worked without issues on Windows and Mac. Could this be a local issue on your machine, for example that one of the required packages is only installed for the main architecture? Does the issue persist if you re-install all the required packages as well?

Regarding the computation time of the sparseLTS() example: Five minutes seems indeed excessive to me. On my laptop, which is 2-3 years old and not particularly powerful, it took 30 seconds (including loading the packages and pre-processing), which I find reasonable. @mikldk Could you please provide some guidance whether a warning about the computation time should be included in the paper?

Thanks again for the helpful comments, and please let me know if anything is still unclear.

msalibian commented 3 years ago

Thank you @aalfons for addressing my comments carefully. The only "contentious" one (10 & 11) was really due to my not remembering how robust LARS estimates the regression coefficients of each model once the variables are "sequenced". It is of course the right thing to do to set the RNG seed to get a reproducible result from lmrob.

My experience regarding running times and installation: I ran all my tests on a Surface "tablet" and didn't get any installation errors. The sparseLTS example consistently takes around 60 seconds to run in my 4-5 year old machine.

@mikldk I've looked at the revised version of the manuscript and I'm ready to recommend that it be published.

aalfons commented 3 years ago

Thanks @msalibian - also for reporting on installation and computation time of the sparseLTS() example.

valentint commented 3 years ago

@mikldk First of all I would like to recommend the revised paper for publishing. The technical issues I had mentioned before could be discussed out of this review.

@aalfons Installation: yes, it seems that it is my local problem, but it occurs on all my PCs :) and I still cannot find a way out of it. Since I have had the issue with other R packages, it is clear that it does not concern robustHD.

@aalfons @msalibian The timing: I got curious now :) I am on a relatively new Surface Book, i7 processor, and 8GB RAM. Please find below the script with my approximate timings and let me know if I am doing something wrong.

library("robustHD") library(tictoc)

data("nci60")

y <- protein[, 92] correlations <- apply(gene, 2, corHuber, y) keep <- partialOrder(abs(correlations), 100, decreasing = TRUE) X <- gene[, keep]

lambda <- seq(0.01, 0.5, length.out=10)

First I run sparseLTS() with the sequence of 10 lamdas, as in the example on GitHub - it takes approx 5 minutes.

lambda <- seq(0.01, 0.5, length.out=10) tic() fit <- sparseLTS(X, y, lambda=lambda, mode = "fraction", crit = "PE", splits = foldControl(K = 5, R = 1), seed = 20210507) toc()

319.12 sec elapsed

2. Then I run three times sparseLTS() for three different values of lambda - it is quite fast, less then 10 seconds each

tic() fit <- sparseLTS(X, y, lambda=0.01, mode = "fraction", crit = "PE", splits = foldControl(K = 5, R = 1), seed = 20210507) fit <- sparseLTS(X, y, lambda=0.255, mode = "fraction", crit = "PE", splits = foldControl(K = 5, R = 1), seed = 20210507) fit <- sparseLTS(X, y, lambda=0.05, mode = "fraction", crit = "PE", splits = foldControl(K = 5, R = 1), seed = 20210507) toc()

28.01 sec elapsed

2. And finally I run sparseLTS() with a grid containing the same three lambdas - the elapsed time is more than five times larger

lambda <- c(0.01, 0.255, 0.5) tic() fit <- sparseLTS(X, y, lambda=lambda, mode = "fraction", crit = "PE", splits = foldControl(K = 5, R = 1), seed = 20210507) toc()

186.92 sec elapsed

msalibian commented 3 years ago

@valentint This is intriguing. Below I copy my Surface system information, and here's the output of the first part of your script. I don't know what may explain the difference.

> library("robustHD")
Loading required package: ggplot2
Loading required package: perry
Loading required package: parallel
Loading required package: robustbase
Warning messages:
1: package ‘perry’ was built under R version 4.1.1 
2: package ‘robustbase’ was built under R version 4.1.1 
> library(tictoc)
Warning message:
package ‘tictoc’ was built under R version 4.1.1 
> data("nci60")
> y <- protein[, 92]
> correlations <- apply(gene, 2, corHuber, y)
> keep <- partialOrder(abs(correlations), 100, decreasing = TRUE)
> X <- gene[, keep]
> lambda <- seq(0.01, 0.5, length.out=10)
> tic()
> fit <- sparseLTS(X, y, lambda=lambda, mode = "fraction", crit = "PE", splits = foldControl(K = 5, R = 1), seed = 20210507)
> toc()
28.5 sec elapsed
> 
> system.time(
+   fit <- sparseLTS(X, y, lambda=lambda, mode = "fraction", crit = "PE", splits = foldControl(K = 5, R = 1), seed = 20210507)
+ )
   user  system elapsed 
  26.91    0.01   27.98 
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252   
[3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C                   
[5] LC_TIME=English_Canada.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tictoc_1.0.1      robustHD_0.7.1    robustbase_0.93-9 perry_0.3.0       ggplot2_3.3.5    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7       magrittr_2.0.1   MASS_7.3-54      tidyselect_1.1.1 munsell_0.5.0   
 [6] colorspace_2.0-2 R6_2.5.1         rlang_0.4.11     fansi_0.5.0      dplyr_1.0.7     
[11] tools_4.1.0      grid_4.1.0       gtable_0.3.0     utf8_1.2.2       withr_2.4.2     
[16] ellipsis_0.3.2   tibble_3.1.5     lifecycle_1.0.1  crayon_1.4.1     purrr_0.3.4     
[21] vctrs_0.3.8      glue_1.4.2       compiler_4.1.0   DEoptimR_1.0-9   pillar_1.6.3    
[26] generics_0.1.0   scales_1.1.1     pkgconfig_2.0.3

mikldk commented 3 years ago

@aalfons Maybe it is due to underlying BLAS/LAPACK? Anyway, please let me know once the software and paper is ready (you can better judge which of the above issues are resolved or not). Once software and paper is ready, please:

Please have a final read though of the paper, checking language etc.
Have a final check of the proofs with @whedon generate pdf
In general, be sure that all versions are correct and that the repo is in the state that should be published.
Please make a tagged release and archive (e.g. with Zenodo) as described here, and report the version number and archive DOI in this thread. Please verify that the archive deposit has the correct metadata (title and author list), or edit these if that is not the case.

aalfons commented 3 years ago

@valentint Thanks for your reply and for suggesting to look into your technical issues outside of this review. I certainly agree to that.

Could you please open an issue and post the example from the paper with its computation time on your system, as well as your session info? It's certainly possible that this has to do with BLAS/LAPACK, thanks for the suggestion @mikldk

About running sparseLTS() separately with different lambda values: If you call sparseLTS() with only one value of lambda, there's no need to run the cross-validation for selecting the optimal lambda, hence it doesn't. I noticed now that this behavior is not properly documented in the help file, so I'll fix this. Thanks for bringing this to my attention.

Then when you put the three values of lambda together, sparseLTS() runs the fivefold cross-validation (plus an additional fit for the optimal value on the full data), hence the computation time is more than five times larger.

@mikldk - Since @valentint suggested to look at the issue with computation time on his system outside of this review, the software will be ready once I update the documentation regarding the behavior of sparseLTS() with one value of lambda. I'll then have a final read of the paper and proceed with preparing the release and archive as per your instructions.

valentint commented 3 years ago

@msalibian Thanks a lot for repeating the computation and providing your system info. What I see, the only essential difference is your 16GB RAM to mine only 8GB. I have noticed in other cases that in some R applications the RAM is more important than the processor power. I have expected that an R package written in C++ will not be much affected by this, but it seems that Rcpp and RcppArmadillo use the R memory management. To be on par I installed the official R version 4.1.1 and refreshed all packages (while doing this I fixed my problem with no-multiarch :)).

I have at home a desktop with 12GB and will try it on it when I am back home. Meanwhile, I will open an issue and will post the script and the timing.

aalfons commented 3 years ago

@valentint - Thanks. It's my understanding that Rcpp indeed uses R's memory management, as for instance it takes advantage of R's garbage collection.

aalfons commented 3 years ago

@whedon generate pdf

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

aalfons commented 3 years ago

@mikldk - I carefully went through the paper and I changed only one or two words before creating the proofs above. I then carefully read the proofs again and checked all references. From my side, the paper is good to go.

In the documentation of sparseLTS(), the behavior in case of only one value of lambda is now clearly documented (see the discussion above).

The new version of the package is 0.7.1 - it's already on CRAN.

I created a tagged release on GitHub, and I archived this tagged release on my institution's data repository on figshare. Someone from my institution now needs to review this before it will be published on the data repository. I'll share the DOI as soon as it's ready.

aalfons commented 3 years ago

@mikldk - The DOI for the archive is 10.25397/eur.16802596.v1

mikldk commented 2 years ago

@whedon set 10.25397/eur.16802596.v1 as archive

whedon commented 2 years ago

OK. 10.25397/eur.16802596.v1 is the archive.

mikldk commented 2 years ago

@whedon set v0.7.1 as version

whedon commented 2 years ago

OK. v0.7.1 is the version.

mikldk commented 2 years ago

@aalfons I cannot see the version 0.7.1 at 10.25397/eur.16802596.v1 unless I download it. Is that correct? Or am I missing it somewhere?

aalfons commented 2 years ago

@mikldk - Correct. There is no specific metadata field for the version number on my institutions data repository. Is that an issue?

mikldk commented 2 years ago

@whedon check references

whedon commented 2 years ago

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1198/016214507000000950 is OK
- 10.1214/12-AOAS575 is OK
- 10.1016/j.csda.2015.02.007 is OK
- 10.1111/j.2517-6161.1996.tb02080.x is OK
- 10.1016/j.csda.2017.02.002 is OK
- 10.1016/j.chemolab.2017.11.017 is OK
- 10.1080/00401706.2017.1305299 is OK
- 10.1214/19-AOAS1269 is OK
- 10.1186/s12877-020-01644-2 is OK
- 10.3389/fevo.2020.583831 is OK
- 10.1158/0008-5472.can-12-1370 is OK

MISSING DOIs

- None

INVALID DOIs

- None

mikldk commented 2 years ago

@whedon generate pdf

whedon commented 2 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

mikldk commented 2 years ago

@whedon recommend-accept

whedon commented 2 years ago

Attempting dry run of processing paper acceptance...

whedon commented 2 years ago

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1198/016214507000000950 is OK
- 10.1214/12-AOAS575 is OK
- 10.1016/j.csda.2015.02.007 is OK
- 10.1111/j.2517-6161.1996.tb02080.x is OK
- 10.1016/j.csda.2017.02.002 is OK
- 10.1016/j.chemolab.2017.11.017 is OK
- 10.1080/00401706.2017.1305299 is OK
- 10.1214/19-AOAS1269 is OK
- 10.1186/s12877-020-01644-2 is OK
- 10.3389/fevo.2020.583831 is OK
- 10.1158/0008-5472.can-12-1370 is OK

MISSING DOIs

- None

INVALID DOIs

- None

whedon commented 2 years ago

:wave: @openjournals/joss-eics, this paper is ready to be accepted and published.

Check final proof :point_right: https://github.com/openjournals/joss-papers/pull/2709

If the paper PDF and Crossref deposit XML look good in https://github.com/openjournals/joss-papers/pull/2709, then you can now move forward with accepting the submission by compiling again with the flag deposit=true e.g.

@whedon accept deposit=true

kthyng commented 2 years ago

I'm going to help wrap up this submission.

@aalfons I see that your software archive is of the most recent version of your software, but doesn't indicate this. Could you add a little suffix to the title (assuming there isn't a better place to put it) with something like ", v0.7.1" so that people know which version is associated with the JOSS paper?

kthyng commented 2 years ago

Everything else looks good!

aalfons commented 2 years ago

@kthyng - As stated above, the data repository of my institution does not have a metadata field for the version number. I added the following sentence to the description:

This is version 0.7.1 corresponding to the publication in the Journal of Open Source Software.

Someone from my institution now needs to review this change before it will be published on the data repository. It may take a few days before this change is publicly visible.

Also note that this could change the DOI of the archive. The data repository keeps track of each change (even if it's only to the metadata), and I think that each change triggers a separate DOI.

kthyng commented 2 years ago

Ok no problem just let me know when you're ready.

aalfons commented 2 years ago

@kthyng - The change in the metadata of the archive is now publicly visible, including the sentence regarding the version number.

It turns out that the change in the metadata didn't trigger a new DOI after all, the current DOI is still correct. So everything should now be ready to go.

kthyng commented 2 years ago

great!

kthyng commented 2 years ago

@whedon accept deposit=true

whedon commented 2 years ago

Doing it live! Attempting automated processing of paper acceptance...

whedon commented 2 years ago

🐦🐦🐦 👉 Tweet for this paper 👈 🐦🐦🐦

openjournals / joss-reviews