Closed whedon closed 3 years ago
Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @msalibian, @aalfons it looks like you're currently assigned to review this paper :tada:.
:warning: JOSS reduced service mode :warning:
Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.
:star: Important :star:
If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿
To fix this do the following two things:
For a list of things I can do to help you, just type:
@whedon commands
For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:
@whedon generate pdf
Software report (experimental):
github.com/AlDanial/cloc v 1.88 T=0.23 s (190.0 files/s, 37362.7 lines/s)
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
TeX 4 280 173 3276
C 8 302 696 1307
R 15 115 122 850
HTML 1 95 5 739
Markdown 3 39 0 139
C/C++ Header 9 23 17 132
Rmd 1 119 159 39
YAML 1 1 0 18
DOS Batch 1 0 2 2
SVG 1 0 0 1
-------------------------------------------------------------------------------
SUM: 44 974 1174 6503
-------------------------------------------------------------------------------
Statistical information for the repository 'a9b8cecb1778454a282290cc' was
gathered on 2021/05/04.
The following historical commit information, by author, was found:
Author Commits Insertions Deletions % of changes
Tobias Schoch 36 6003 3526 100.00
Below are the number of rows from each author that have survived and are still
intact in the current revision:
Author Rows Stability Age % in comments
Tobias Schoch 2477 41.3 2.4 31.09
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):
OK DOIs
- 10.1016/S0167-9473(99)00101-2 is OK
- 10.1145/567806.567807 is OK
- 10.1137/1.9780898719604 is OK
- 10.1016/j.csda.2004.09.009 is OK
- 10.1080/01621459.1990.10474920 is OK
- 10.1198/jcgs.2009.0005 is OK
MISSING DOIs
- None
INVALID DOIs
- None
@msalibian and @aalfons - we're now starting the review. You can see above your review checklist. Please check the boxes as you proceed through the review. For any boxes that you can't check right now, please leave comments in this thread. You may also choose to open issues in the package repository. If you do open issues, please be sure to reference them in this thread. Please let me know if you have any questions as you work through the review. Thanks again!
@msalibian and @aalfons - how is the review going? Is there anything that I can assist with? Please feel free to check off the boxes above as you proceed through the review. Thanks!
This is a nice contribution (both package and manuscript). I have some minor suggestions about the text, and noted that some items from the checklist above are missing.
R
is "package" (rather than "library"). @msalibian - Thank you for the very helpful remarks. Before I’m going to address the points raised, I have some questions for the editor.
@fboehm - I have the following questions:
I’m planning to do the following (while working through the issues raised by the reviewers):
“Example usage”: The package has a vignette. In the vignette, the methods are applied to example data. Should I include the examples or some of the examples in the paper as well? Or should I just note in the paper that there is a vignette?
“Community guidelines”: These guidelines can be found in the README.md on https://github.com/tobiasschoch/wbacon. I put them there such that they show up on the GitHub site (this is the place where most people will “land” while browsing/ searching the internet). Should I include the “community guidelines” in the paper as well?
"Performance":
robustX::BACON
; see folder test (focus: does my implementation give the same results?).robustX::BACON
in terms of performance/ speed. I’m planning to add such a benchmark. Then, I note in the paper that such benchmark is available. Is that ok?@tobiasschoch - Thanks for your questions. Below I try to answer them.
Don't worry about making the tagged release or interacting with whedon. I'll guide you through that when it's time. For now, just concentrate on your first bullet point. If you want to see how the pdf looks after your changes are made, you can tell whedon to "generate pdf", but don't create a new release until I ask you to do so.
I'm not sure about this one. I think the paper.md might be better off with a mention of the vignette. You probably don't need to include examples in the paper.md file. @msalibian - do you have thoughts on this?
I think that including guidelines in the README.md is sufficient. However, you might also want to add a CONTRIBUTING.md file to the repository. For example, see this: https://github.com/tidyverse/dplyr/blob/master/.github/CONTRIBUTING.md. You don't need to include community guidelines in the paper.md. There is a function in the R package usethis
that can create such a file from a template: usethis::use_tidy_contributing
. You might want to edit the resulting file.
Yes, I think that mentioning in paper.md that you have the tests and benchmarks in the package is sufficient. You might also include unit tests, possibly by using the R package testthat
. @msalibian - do you agree? Do you have more specific comments to add on this point?
@aalfons - how is the review going? Please let me know if you encounter any difficulties.
@fboehm @tobiasschoch About the above:
2 - I don't have a strong opinion on this. A reference to the vignette would probably be enough. Although in that case I think it would be helpful to include in the paper a few sentences describing / quantifying the performance gain that can be expected with this implementation. Something like "When the sample size is larger than XXX, using this implementation often results in a speed gain of YYY% compared with such and such..."
4 - Agreed.
@fboehm The only difficulty is finding time. I'll get to it by the end of the week. My apologies for the delay, I was not aware that it needs to be done so quickly.
@fboehm @msalibian @aalfons
@aalfons - I apologize if I implied that I expected the review to be done by now. That wasn't my intention. I merely wanted to ask if you'd gotten started or encountered difficulties. Sometimes reviewers are unsure how to start the review, especially since we use a nontraditional format at JOSS. If any questions arise, please let me know. Thanks again!!
@tobiasschoch - that all sounds good. Please feel free to address the issues that @msalibian raised as the next step. Thanks!!
:wave: @msalibian, please update us on how your review is going (this is an automated reminder).
:wave: @aalfons, please update us on how your review is going (this is an automated reminder).
The package seems to provide useful functionality and is well documented, and the manuscript describes the functionality and the need for it nicely. I still have some comments, some of which I realized afterwards are also brought up by @msalibian.
In file included from fitwls.c:20: ./fitwls.h:5:10: fatal error: 'omp.h' file not found
The latest version of R for Mac now rely only on the compilers shipped with Apple's XCode developer platform, and unfortunately it seems that Apple has disabled OpenMP support there by default, see https://mac.r-project.org/openmp/ . I was not aware of that.
I have of course fixed OpenMP support now as per instructions on the above website, and the package installed as per the documentation. But there may be other users who run into the same issue and give up on your package.
I believe the recommended way to include OpenMP in the C++ header files is the following:
This is a simple fix, so please go ahead and implement it. While users without OpenMP installed then do not have access to parallel computing when using your package, at least they can install and use it.
Performance: line 36: What size of the data set (number of observations, number of variables) is approximately necessary for implementation inefficiencies to matter?
Automated tests: There are some scripts in the tests/ directory that seem to reproduce the results of a paper and contain some speed tests of the computational performance, but I did not find any unit tests that the functions produce the correct output and so on. On the other hand, the documentation contains some examples, and it can be automatically checked whether these produce warnings or errors via R CMD check. So in that sense, there would be some minimal automated testing of the functionality. @fboehm Could you please provide some guidance on this point?
State of the field: The authors mention only other packages that implement the same algorithms. Maybe it would be good to also mention some other popular packages that provide other algorithms for anomaly detection? (Although this could quickly go out of hand, as there are many algorithms for this purpose.)
General comments:
line 12: the statement that the BACON algorithms are "superior" would imply that they are always better than other anomaly detection algorithms. Please rephrase this.
lines 15-17: Even though the function to load a package is called library(), which causes some confusion about the terminology, the preferred term is "package" in R. Please change this throughout the paper.
line 66/67: How should one proceed if the diagnostic tools indicate that the "good" observations violate the structure that is required by the algorithm?
I modified the paper according to @msalibian suggestions (I'm going to address the points raised by @aalfons later)
@whedon generate pdf
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
@aalfons - Thank you for the very helpful remarks. Your comments reached me while I was responding to the points raised by msalibian. I will now address the points you have raised.
Reviewer: Installation instructions: I received the following error message when trying to install the package [...].
Answer: Yes, indeed I forgot to add the (conditional) include guards. This is now fixed. I added the include guards in the C header files.
Reviewer: Performance: line 36: What size of the data set (number of observations, number of variables) is approximately necessary for implementation inefficiencies to matter?
Answer: In the mean time I have updated the paper. In the "Benchmarking" section, I compare my implementation with the reference implementation robustX
in terms of computation time for data sets in various sizes. From the benchmarking, we can learn that my implementation outperforms the reference on medium to large data sets.
Reviewer: Automated tests: There are some scripts in the tests/ directory that seem to reproduce the results of a paper and contain some speed tests of the computational performance, but I did not find any unit tests that the functions produce the correct output and so on. On the other hand, the documentation contains some examples, and it can be automatically checked whether these produce warnings or errors via R CMD check. So in that sense, there would be some minimal automated testing of the functionality. @fboehm Could you please provide some guidance on this point?
Answer: I am a believer in test-driven development. I distinguish between two (idealized) types of tests:
As a pragmatic programmer I take the following view: Some functions are suitable for unit testing, others less so.
wquantile
and wselect
. Because these functions are also used in other R packages, they reside in their own GitHub repo called "wquantile". There, you can find unit tests based on the C language CHEAT test library of Guillermo Freschi and Sampsa Kiiskinen; see https://github.com/tobiasschoch/wquantile/blob/master/tests/test.c. The vast majority of code in the "wbacon" repo is tested under the "functional testing" paradigm for the following reason: The BACON algorithms are based on the basic "building blocks": mean, covariance matrix, linear regression, etc. In my package, these building blocks are computed with the help of the LAPACK and BLAS subroutines. More precisely,
dsyrk
=> computes the lower triangular scatter matrix (see wbacon.c).dgels
; the residuals are computed with dgemv
(see fitwls.c). We could, in principle, follow the unit-test paradigm and write separate unit tests for the covariance matrix, regression, and so on. However, this would ultimately boil down to testing the subroutines in BLAS and LAPACK. Neither is my package the place to test these subroutines, nor would such a testing strategy increase confidence in my functions. Of course, that does not rule out the possibility that I mixed up the order of the arguments when calling the subroutines. But the compiler (and valgrind) would have told me if I had actually done it. So I decided to take a "functional test" approach and test the ensemble of building blocks. In total, the BACON algorithms are tested on 159 different (real) test sets to match the results of the (reference implementation) in the robustX
package.
Reviewer: State of the field: The authors mention only other packages that implement the same algorithms. Maybe it would be good to also mention some other popular packages that provide other algorithms for anomaly detection? (Although this could quickly go out of hand, as there are many algorithms for this purpose.)
Answer: Yes, this could indeed go out of hand quickly... I focused on the BACON algorithms and did not intend to write a review article on multivariate outlier detection. @fboehm: Shall I discuss other methods?
Reviewer: line 12: the statement that the BACON algorithms are "superior" would imply that they are always better than other anomaly detection algorithms. Please rephrase this.
Answer: Yes, this has been fixed; see response to points raised by msalibian.
Reviewer: lines 15-17: Even though the function to load a package is called library(), which causes some confusion about the terminology, the preferred term is "package" in R. Please change this throughout the paper.
Answer: Yes, this has been fixed; see response to points raised by msalibian.
Reviewer: line 66/67: How should one proceed if the diagnostic tools indicate that the "good" observations violate the structure that is required by the algorithm?
Answer: Well... all I can do (see also vignette) is to quote the developpers of the BACON algorithms: “Although the algorithms will often do something reasonable even when these assumptions are violated, it is hard to say what the results mean.” Billor et al. (2000, p. 290). It is better to be "safe than sorry" and apply another method. I think I have made it sufficiently clear when the method can and cannot be used; see vignette.
Thank you for addressing the issues raised by the reviewers. @aalfons and @msalibian - do you feel that the authors have made sufficient changes to meet the requirements? Please feel free to continue to check boxes from the checklist once you're satisfied.
@fboehm Thanks @tobiasschoch for addressing my concerns and suggestions. I have no pending items in the check list. I am satisfied with the new version of the manuscript, and only have a few minor suggestions about writing style, which I leave here in case they are helpful.
plot
method for lmrob
objects in package robustbase
;wBACON_reg_reg()
, should this be wBACON_reg()
instead?@fboehm I'm also satisfied with the new version of the manuscript and have no more unchecked items. Thanks @tobiasschoch for addressing all the comments.
I also have a few minor suggestions regarding the text to add to those of @msalibian which I leave here in the same spirit in case they are helpful:
@fboehm @aalfons @msalibian - Thank you for the very helpful comments (and the typos you spotted). I have followed all your suggestions and modified the paper accordingly.
@msalibian - I have clarified the statement about the "distances / discrepancies". The BACON algorithm for robust regression uses what Billor et al. (2000) call discrepancies: that is, on the set of "good" observations (in the current iteration), the discrepancies are defined as the scaled residuals. For the "bad" observations, the discrepancies are taken to be the scaled (out-of-sample) prediction error. Then, for the next iteration, all observations whose discrepancies are smaller (in absolute value) than the cutoff value (which is defined as a rather "special" Student t-quantile) are selected into the new set of "good" observations. This procedure is repeated until convergence. From the point of rigorous statistical reasoning, the t-quantile cutoff value suggested by Billor et al. (2000) is weird or questionable... but it works well in practice...
@whedon generate pdf
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
@tobiasschoch - the reviewers have recommended your manuscript for publication. Thank you to @msalibian and @aalfons for thorough reviews.
Before we proceed, I have a few small suggestions for the paper.md.
Once you've addressed these minor points, we will proceed towards publication. Thanks again, @tobiasschoch !
@whedon generate pdf
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
@fboehm - Thank you for proofreading the paper. Highly appreciated!
See new pdf above this post. In addition, I updated the package version. Do I need to call [at]whedon set [new version] as version
now?
@tobiasschoch - thanks so much! We now need you to archive the package, for example, with zenodo. Please report the archive doi, along with the new version number, here. I'll then tell whedon to set the archive doi and version number. I believe that authors don't have permissions to do so via whedon.
@fboehm - Thank you very much, Frederick! The version is tagged "v0.5" and the DOI is
and the target URL is https://zenodo.org/badge/latestdoi/258328250
@aalfons and @msalibian Thank you for reviewing the paper!
@whedon set version as v0.5
I'm sorry human, I don't understand that. You can see what commands I support by typing:
@whedon commands
@whedon set v0.5 as version
OK. v0.5 is the version.
@whedon check references
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):
OK DOIs
- 10.1016/S0167-9473(99)00101-2 is OK
- 10.1002/0470055464 is OK
- 10.1145/567806.567807 is OK
- 10.1137/1.9780898719604 is OK
- 10.1002/9781119214656 is OK
- 10.1016/j.csda.2004.09.009 is OK
- 10.1080/01621459.1990.10474920 is OK
- 10.1198/jcgs.2009.0005 is OK
MISSING DOIs
- None
INVALID DOIs
- None
@whedon set 10.5281/zenodo.4895167 as archive
OK. 10.5281/zenodo.4895167 is the archive.
@whedon accept
To recommend a paper to be accepted use @whedon recommend-accept
@whedon recommend-accept
Attempting dry run of processing paper acceptance...
:wave: @openjournals/joss-eics, this paper is ready to be accepted and published.
Check final proof :point_right: https://github.com/openjournals/joss-papers/pull/2356
If the paper PDF and Crossref deposit XML look good in https://github.com/openjournals/joss-papers/pull/2356, then you can now move forward with accepting the submission by compiling again with the flag deposit=true
e.g.
@whedon accept deposit=true
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):
OK DOIs
- 10.1016/S0167-9473(99)00101-2 is OK
- 10.1002/0470055464 is OK
- 10.1145/567806.567807 is OK
- 10.1137/1.9780898719604 is OK
- 10.1002/9781119214656 is OK
- 10.1016/j.csda.2004.09.009 is OK
- 10.1080/01621459.1990.10474920 is OK
- 10.1198/jcgs.2009.0005 is OK
MISSING DOIs
- None
INVALID DOIs
- None
@tobiasschoch - I just recognized that the title of the archive doesn't match the title of the manuscript. Can you please fix this?
Submitting author: @tobiasschoch (Tobias Schoch) Repository: https://github.com/tobiasschoch/wbacon Version: v0.5 Editor: @fboehm Reviewer: @msalibian, @aalfons Archive: 10.5281/zenodo.4895167
:warning: JOSS reduced service mode :warning:
Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.
Status
Status badge code:
Reviewers and authors:
Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)
Reviewer instructions & questions
@msalibian & @aalfons, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:
The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @fboehm know.
✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨
Review checklist for @msalibian
Conflict of interest
Code of Conduct
General checks
Functionality
Documentation
Software paper
Review checklist for @aalfons
Conflict of interest
Code of Conduct
General checks
Functionality
Documentation
Software paper