openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
720 stars 38 forks source link

[REVIEW]: wbacon: Weighted BACON algorithms for multivariate outlier nomination (detection) and robust linear regression #3238

Closed whedon closed 3 years ago

whedon commented 3 years ago

Submitting author: @tobiasschoch (Tobias Schoch) Repository: https://github.com/tobiasschoch/wbacon Version: v0.5 Editor: @fboehm Reviewer: @msalibian, @aalfons Archive: 10.5281/zenodo.4895167

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/6218ee213e8b00273645233145e96cfa"><img src="https://joss.theoj.org/papers/6218ee213e8b00273645233145e96cfa/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/6218ee213e8b00273645233145e96cfa/status.svg)](https://joss.theoj.org/papers/6218ee213e8b00273645233145e96cfa)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@msalibian & @aalfons, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

  1. Make sure you're logged in to your GitHub account
  2. Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @fboehm know.

Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest

Review checklist for @msalibian

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Review checklist for @aalfons

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

whedon commented 3 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @msalibian, @aalfons it looks like you're currently assigned to review this paper :tada:.

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

  1. Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

  1. You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf
whedon commented 3 years ago
Software report (experimental):

github.com/AlDanial/cloc v 1.88  T=0.23 s (190.0 files/s, 37362.7 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
TeX                              4            280            173           3276
C                                8            302            696           1307
R                               15            115            122            850
HTML                             1             95              5            739
Markdown                         3             39              0            139
C/C++ Header                     9             23             17            132
Rmd                              1            119            159             39
YAML                             1              1              0             18
DOS Batch                        1              0              2              2
SVG                              1              0              0              1
-------------------------------------------------------------------------------
SUM:                            44            974           1174           6503
-------------------------------------------------------------------------------

Statistical information for the repository 'a9b8cecb1778454a282290cc' was
gathered on 2021/05/04.
The following historical commit information, by author, was found:

Author                     Commits    Insertions      Deletions    % of changes
Tobias Schoch                   36          6003           3526          100.00

Below are the number of rows from each author that have survived and are still
intact in the current revision:

Author                     Rows      Stability          Age       % in comments
Tobias Schoch              2477           41.3          2.4               31.09
whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

whedon commented 3 years ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1016/S0167-9473(99)00101-2 is OK
- 10.1145/567806.567807 is OK
- 10.1137/1.9780898719604 is OK
- 10.1016/j.csda.2004.09.009 is OK
- 10.1080/01621459.1990.10474920 is OK
- 10.1198/jcgs.2009.0005 is OK

MISSING DOIs

- None

INVALID DOIs

- None
fboehm commented 3 years ago

@msalibian and @aalfons - we're now starting the review. You can see above your review checklist. Please check the boxes as you proceed through the review. For any boxes that you can't check right now, please leave comments in this thread. You may also choose to open issues in the package repository. If you do open issues, please be sure to reference them in this thread. Please let me know if you have any questions as you work through the review. Thanks again!

fboehm commented 3 years ago

@msalibian and @aalfons - how is the review going? Is there anything that I can assist with? Please feel free to check off the boxes above as you proceed through the review. Thanks!

msalibian commented 3 years ago

This is a nice contribution (both package and manuscript). I have some minor suggestions about the text, and noted that some items from the checklist above are missing.

tobiasschoch commented 3 years ago

@msalibian - Thank you for the very helpful remarks. Before I’m going to address the points raised, I have some questions for the editor.

@fboehm - I have the following questions:

  1. I’m planning to do the following (while working through the issues raised by the reviewers):

    • Make changes in the documents/ code (in the repository)
    • Add a new version tag to the repository
    • Call [at]whedon set [new version] as version
    • Call [at]whedon generate pdf
    • Is that right?
  2. “Example usage”: The package has a vignette. In the vignette, the methods are applied to example data. Should I include the examples or some of the examples in the paper as well? Or should I just note in the paper that there is a vignette?

  3. “Community guidelines”: These guidelines can be found in the README.md on https://github.com/tobiasschoch/wbacon. I put them there such that they show up on the GitHub site (this is the place where most people will “land” while browsing/ searching the internet). Should I include the “community guidelines” in the paper as well?

  4. "Performance":

    • I test my implementation against (the existing code) robustX::BACON; see folder test (focus: does my implementation give the same results?).
    • In addition, there are some benchmarks; see test/benchmark (used to “see” the impact of the OpenMP parallelization).
    • However, there are yet no benchmarks, where I compare my implementation against robustX::BACON in terms of performance/ speed. I’m planning to add such a benchmark. Then, I note in the paper that such benchmark is available. Is that ok?
fboehm commented 3 years ago

@tobiasschoch - Thanks for your questions. Below I try to answer them.

  1. Don't worry about making the tagged release or interacting with whedon. I'll guide you through that when it's time. For now, just concentrate on your first bullet point. If you want to see how the pdf looks after your changes are made, you can tell whedon to "generate pdf", but don't create a new release until I ask you to do so.

  2. I'm not sure about this one. I think the paper.md might be better off with a mention of the vignette. You probably don't need to include examples in the paper.md file. @msalibian - do you have thoughts on this?

  3. I think that including guidelines in the README.md is sufficient. However, you might also want to add a CONTRIBUTING.md file to the repository. For example, see this: https://github.com/tidyverse/dplyr/blob/master/.github/CONTRIBUTING.md. You don't need to include community guidelines in the paper.md. There is a function in the R package usethis that can create such a file from a template: usethis::use_tidy_contributing. You might want to edit the resulting file.

  4. Yes, I think that mentioning in paper.md that you have the tests and benchmarks in the package is sufficient. You might also include unit tests, possibly by using the R package testthat. @msalibian - do you agree? Do you have more specific comments to add on this point?

fboehm commented 3 years ago

@aalfons - how is the review going? Please let me know if you encounter any difficulties.

msalibian commented 3 years ago

@fboehm @tobiasschoch About the above:

2 - I don't have a strong opinion on this. A reference to the vignette would probably be enough. Although in that case I think it would be helpful to include in the paper a few sentences describing / quantifying the performance gain that can be expected with this implementation. Something like "When the sample size is larger than XXX, using this implementation often results in a speed gain of YYY% compared with such and such..."

4 - Agreed.

aalfons commented 3 years ago

@fboehm The only difficulty is finding time. I'll get to it by the end of the week. My apologies for the delay, I was not aware that it needs to be done so quickly.

tobiasschoch commented 3 years ago

@fboehm @msalibian @aalfons

fboehm commented 3 years ago

@aalfons - I apologize if I implied that I expected the review to be done by now. That wasn't my intention. I merely wanted to ask if you'd gotten started or encountered difficulties. Sometimes reviewers are unsure how to start the review, especially since we use a nontraditional format at JOSS. If any questions arise, please let me know. Thanks again!!

fboehm commented 3 years ago

@tobiasschoch - that all sounds good. Please feel free to address the issues that @msalibian raised as the next step. Thanks!!

whedon commented 3 years ago

:wave: @msalibian, please update us on how your review is going (this is an automated reminder).

whedon commented 3 years ago

:wave: @aalfons, please update us on how your review is going (this is an automated reminder).

aalfons commented 3 years ago

The package seems to provide useful functionality and is well documented, and the manuscript describes the functionality and the need for it nicely. I still have some comments, some of which I realized afterwards are also brought up by @msalibian.

In file included from fitwls.c:20: ./fitwls.h:5:10: fatal error: 'omp.h' file not found

The latest version of R for Mac now rely only on the compilers shipped with Apple's XCode developer platform, and unfortunately it seems that Apple has disabled OpenMP support there by default, see https://mac.r-project.org/openmp/ . I was not aware of that.

I have of course fixed OpenMP support now as per instructions on the above website, and the package installed as per the documentation. But there may be other users who run into the same issue and give up on your package.

I believe the recommended way to include OpenMP in the C++ header files is the following:

ifdef _OPENMP

include

endif

This is a simple fix, so please go ahead and implement it. While users without OpenMP installed then do not have access to parallel computing when using your package, at least they can install and use it.

General comments:

tobiasschoch commented 3 years ago

I modified the paper according to @msalibian suggestions (I'm going to address the points raised by @aalfons later)

Issue 1

Issue 2

Issue 3

Issue 4

Issue 5

Issue 6

Issue 7

Issue 8

tobiasschoch commented 3 years ago

@whedon generate pdf

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

tobiasschoch commented 3 years ago

@aalfons - Thank you for the very helpful remarks. Your comments reached me while I was responding to the points raised by msalibian. I will now address the points you have raised.

Issue 1

Reviewer: Installation instructions: I received the following error message when trying to install the package [...].

Answer: Yes, indeed I forgot to add the (conditional) include guards. This is now fixed. I added the include guards in the C header files.

Issue 2

Reviewer: Performance: line 36: What size of the data set (number of observations, number of variables) is approximately necessary for implementation inefficiencies to matter?

Answer: In the mean time I have updated the paper. In the "Benchmarking" section, I compare my implementation with the reference implementation robustX in terms of computation time for data sets in various sizes. From the benchmarking, we can learn that my implementation outperforms the reference on medium to large data sets.

Issue 3

Reviewer: Automated tests: There are some scripts in the tests/ directory that seem to reproduce the results of a paper and contain some speed tests of the computational performance, but I did not find any unit tests that the functions produce the correct output and so on. On the other hand, the documentation contains some examples, and it can be automatically checked whether these produce warnings or errors via R CMD check. So in that sense, there would be some minimal automated testing of the functionality. @fboehm Could you please provide some guidance on this point?

Answer: I am a believer in test-driven development. I distinguish between two (idealized) types of tests:

As a pragmatic programmer I take the following view: Some functions are suitable for unit testing, others less so.

We could, in principle, follow the unit-test paradigm and write separate unit tests for the covariance matrix, regression, and so on. However, this would ultimately boil down to testing the subroutines in BLAS and LAPACK. Neither is my package the place to test these subroutines, nor would such a testing strategy increase confidence in my functions. Of course, that does not rule out the possibility that I mixed up the order of the arguments when calling the subroutines. But the compiler (and valgrind) would have told me if I had actually done it. So I decided to take a "functional test" approach and test the ensemble of building blocks. In total, the BACON algorithms are tested on 159 different (real) test sets to match the results of the (reference implementation) in the robustX package.

Issue 4

Reviewer: State of the field: The authors mention only other packages that implement the same algorithms. Maybe it would be good to also mention some other popular packages that provide other algorithms for anomaly detection? (Although this could quickly go out of hand, as there are many algorithms for this purpose.)

Answer: Yes, this could indeed go out of hand quickly... I focused on the BACON algorithms and did not intend to write a review article on multivariate outlier detection. @fboehm: Shall I discuss other methods?

Issue 5

Reviewer: line 12: the statement that the BACON algorithms are "superior" would imply that they are always better than other anomaly detection algorithms. Please rephrase this.

Answer: Yes, this has been fixed; see response to points raised by msalibian.

Issue 6

Reviewer: lines 15-17: Even though the function to load a package is called library(), which causes some confusion about the terminology, the preferred term is "package" in R. Please change this throughout the paper.

Answer: Yes, this has been fixed; see response to points raised by msalibian.

Issue 7

Reviewer: line 66/67: How should one proceed if the diagnostic tools indicate that the "good" observations violate the structure that is required by the algorithm?

Answer: Well... all I can do (see also vignette) is to quote the developpers of the BACON algorithms: “Although the algorithms will often do something reasonable even when these assumptions are violated, it is hard to say what the results mean.” Billor et al. (2000, p. 290). It is better to be "safe than sorry" and apply another method. I think I have made it sufficiently clear when the method can and cannot be used; see vignette.

fboehm commented 3 years ago

Thank you for addressing the issues raised by the reviewers. @aalfons and @msalibian - do you feel that the authors have made sufficient changes to meet the requirements? Please feel free to continue to check boxes from the checklist once you're satisfied.

msalibian commented 3 years ago

@fboehm Thanks @tobiasschoch for addressing my concerns and suggestions. I have no pending items in the check list. I am satisfied with the new version of the manuscript, and only have a few minor suggestions about writing style, which I leave here in case they are helpful.

aalfons commented 3 years ago

@fboehm I'm also satisfied with the new version of the manuscript and have no more unchecked items. Thanks @tobiasschoch for addressing all the comments.

I also have a few minor suggestions regarding the text to add to those of @msalibian which I leave here in the same spirit in case they are helpful:

tobiasschoch commented 3 years ago

@fboehm @aalfons @msalibian - Thank you for the very helpful comments (and the typos you spotted). I have followed all your suggestions and modified the paper accordingly.

@msalibian - I have clarified the statement about the "distances / discrepancies". The BACON algorithm for robust regression uses what Billor et al. (2000) call discrepancies: that is, on the set of "good" observations (in the current iteration), the discrepancies are defined as the scaled residuals. For the "bad" observations, the discrepancies are taken to be the scaled (out-of-sample) prediction error. Then, for the next iteration, all observations whose discrepancies are smaller (in absolute value) than the cutoff value (which is defined as a rather "special" Student t-quantile) are selected into the new set of "good" observations. This procedure is repeated until convergence. From the point of rigorous statistical reasoning, the t-quantile cutoff value suggested by Billor et al. (2000) is weird or questionable... but it works well in practice...

tobiasschoch commented 3 years ago

@whedon generate pdf

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

fboehm commented 3 years ago

@tobiasschoch - the reviewers have recommended your manuscript for publication. Thank you to @msalibian and @aalfons for thorough reviews.

fboehm commented 3 years ago

Before we proceed, I have a few small suggestions for the paper.md.

  1. Lines 78-79: This sentence feels awkward. Can you rewrite it to flow a little better while still conveying the message? Perhaps something like this: "We recommend that the user examine both the data structure and the "good" observations to verify that the assumptions hold."
  2. Line 84: Replace "library" with "package".
  3. Section "Community guidelines": I think that this should be removed from the manuscript. I erred in overlooking your above discussion about this. I apologize for that. What you currently have in the README.md is sufficient.
  4. Just a point to consider - it seems that you don't define "breakdown point". Is this something that you want to define in the manuscript?

Once you've addressed these minor points, we will proceed towards publication. Thanks again, @tobiasschoch !

tobiasschoch commented 3 years ago

@whedon generate pdf

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

tobiasschoch commented 3 years ago

@fboehm - Thank you for proofreading the paper. Highly appreciated!

  1. Lines 78-79: This sentence feels awkward. Can you rewrite it to flow a little better while still conveying the message? Perhaps something like this: "We recommend that the user examine both the data structure and the "good" observations to verify that the assumptions hold." Answer: Yes, I have modified the sentence.
  2. Line 84: Replace "library" with "package". Answer: Yes, I have replaced "library" with "package". 3 . Section "Community guidelines": I think that this should be removed from the manuscript. I erred in overlooking your above discussion about this. I apologize for that. What you currently have in the README.md is sufficient. Answer: No problem. I have removed the Section "Community guidelines" from the paper.
  3. Just a point to consider - it seems that you don't define "breakdown point". Is this something that you want to define in the manuscript? Answer: Thank you for pointing this out. I have added a footnote that explains what the breakdown point is.

See new pdf above this post. In addition, I updated the package version. Do I need to call [at]whedon set [new version] as version now?

fboehm commented 3 years ago

@tobiasschoch - thanks so much! We now need you to archive the package, for example, with zenodo. Please report the archive doi, along with the new version number, here. I'll then tell whedon to set the archive doi and version number. I believe that authors don't have permissions to do so via whedon.

tobiasschoch commented 3 years ago

@fboehm - Thank you very much, Frederick! The version is tagged "v0.5" and the DOI is

DOI

and the target URL is https://zenodo.org/badge/latestdoi/258328250

@aalfons and @msalibian Thank you for reviewing the paper!

fboehm commented 3 years ago

@whedon set version as v0.5

whedon commented 3 years ago

I'm sorry human, I don't understand that. You can see what commands I support by typing:

@whedon commands
fboehm commented 3 years ago

@whedon set v0.5 as version

whedon commented 3 years ago

OK. v0.5 is the version.

fboehm commented 3 years ago

@whedon check references

whedon commented 3 years ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1016/S0167-9473(99)00101-2 is OK
- 10.1002/0470055464 is OK
- 10.1145/567806.567807 is OK
- 10.1137/1.9780898719604 is OK
- 10.1002/9781119214656 is OK
- 10.1016/j.csda.2004.09.009 is OK
- 10.1080/01621459.1990.10474920 is OK
- 10.1198/jcgs.2009.0005 is OK

MISSING DOIs

- None

INVALID DOIs

- None
fboehm commented 3 years ago

@whedon set 10.5281/zenodo.4895167 as archive

whedon commented 3 years ago

OK. 10.5281/zenodo.4895167 is the archive.

fboehm commented 3 years ago

@whedon accept

whedon commented 3 years ago

To recommend a paper to be accepted use @whedon recommend-accept

fboehm commented 3 years ago

@whedon recommend-accept

whedon commented 3 years ago
Attempting dry run of processing paper acceptance...
whedon commented 3 years ago

:wave: @openjournals/joss-eics, this paper is ready to be accepted and published.

Check final proof :point_right: https://github.com/openjournals/joss-papers/pull/2356

If the paper PDF and Crossref deposit XML look good in https://github.com/openjournals/joss-papers/pull/2356, then you can now move forward with accepting the submission by compiling again with the flag deposit=true e.g.

@whedon accept deposit=true
whedon commented 3 years ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1016/S0167-9473(99)00101-2 is OK
- 10.1002/0470055464 is OK
- 10.1145/567806.567807 is OK
- 10.1137/1.9780898719604 is OK
- 10.1002/9781119214656 is OK
- 10.1016/j.csda.2004.09.009 is OK
- 10.1080/01621459.1990.10474920 is OK
- 10.1198/jcgs.2009.0005 is OK

MISSING DOIs

- None

INVALID DOIs

- None
fboehm commented 3 years ago

@tobiasschoch - I just recognized that the title of the archive doesn't match the title of the manuscript. Can you please fix this?