whedon commented 3 years ago

Submitting author: @sylvaticus (Antonello Lobianco) Repository: https://github.com/sylvaticus/BetaML.jl Version: v0.2.2 Editor: @terrytangyuan Reviewer: @ablaom, @ppalmes Archive: 10.5281/zenodo.4730205

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

Status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/27dfe1b0d25d1925af5424f0706f728f"><img src="https://joss.theoj.org/papers/27dfe1b0d25d1925af5424f0706f728f/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/27dfe1b0d25d1925af5424f0706f728f/status.svg)](https://joss.theoj.org/papers/27dfe1b0d25d1925af5424f0706f728f)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@ablaom & @ppalmes, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

Make sure you're logged in to your GitHub account
Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @terrytangyuan know.

✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨

Review checklist for @ablaom

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@sylvaticus) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
[ ] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

Review checklist for @ppalmes

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@sylvaticus) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
[x] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[ ] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

whedon commented 3 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @ablaom, @ppalmes it looks like you're currently assigned to review this paper :tada:.

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

whedon commented 3 years ago

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1007/s10107-010-0420-4 is OK
- 10.5281/zenodo.3541505 is OK
- 10.21105/joss.00602 is OK

MISSING DOIs

- None

INVALID DOIs

- None

ablaom commented 3 years ago

Okay, here's an update of my review from the pre-review thread

What the package provides

The package under review provides pure-julia implementations of two tree-based models, three clustering models, a perceptron model (with 3 variations) and a basic neural network model. In passing, it should be noted that all or almost all of these algorithms have existing julia implementations (e.g., DecisionTree.jl, Clustering.jl, Flux.jl). The package is used in a course on Machine Learning but integration between the package and the course is quite loose, as far as I could ascertain (more on this below).

~~Apart from a library of loss functions, the package provides no other tools.~~ In addition to the models the package provides a number of loss functions, as well as activation functions for the neural network models, and some tools to rescale data. I did not see tools to automate resampling (such as cross-validation), hyper parameter optimization, and no model composition (pipelining). The quality of the model implementations looks good to me, although the author warns us that "the code is not heavily optimized and GPU [for neural networks] is not supported "

Existing machine learning toolboxes in Julia

For context, consider the following multi-paradigm ML toolboxes written in Julia which are relatively mature, by Julia standards:

package	number of models	resampling	hyper-parameter optimization	composition
ScikitLearn.jl	> 150	yes	yes	basic
AutoMLPipeline.jl	> 100	no	no	medium
MLJ.jl	151	yes	yes	advanced

In addition to these are several excellent and mature packages dedicated to neural-networks, the most popular being the AD-driven Flux.jl package. So far, these provide limited meta-functionality, although MLJ now provides an interface to certain classes of Flux models (MLJFlux) and ScikitLearn.jl provides interfaces to python neural network models sufficient for small datasets and pedagogical use.

Disclaimer: I am a designer/contributor to MLJ.

According to the JOSS requirements, Submissions should "Have an obvious research application." In its current state of maturity, BetaML is not a serious competitor to the frameworks above, for contributing directly to research. However, the author argues that it has pedagogical advantages over existing tools.

Value as pedagogical tool

I don't think there are many rigorous machine learning courses or texts closely integrated with models and tools implemented in julia and it would be useful to have more of these. ~~The degree of integration in this case was difficult for me to ascertain because I couldn't see how to access the course notes without formally registering for the course (which is, however, free).~~ I was also disappointed to find only one link from doc-strings to course materials; from this "back door" to the course notes I could find no reference back to the package, however. Perhaps there is better integration in course exercises? I couldn't figure this out.

edit Okay, I see that I missed the link to the course notes, as opposed to the course itself. However the notes make only references to python code and so do not appear to be directly integrated with the package BetaML.

The remaining argument for BetaML's pedagogical value rests on a number of perceived drawbacks of existing toolboxes, for the beginner. Quoting from the JOSS manuscript:

"For example the popular Deep Learning library Flux (Mike Innes, 2018), while extremely performant and flexible, adopts some designing choices that for a beginner could appear odd, for example avoiding the neural network object from the training process, or requiring all parameters to be explicitly defined. In BetaML we made the choice to allow the user to experiment with the hyperparameters of the algorithms learning them one step at the time. Hence for most functions we provide reasonable default parameters that can be overridden when needed."
"To help beginners, many parameters and functions have pretty longer but more explicit names than usual. For example the Dense layer is a DenseLayer, the RBF kernel is radialKernel, etc."
"While avoiding the problem of “reinventing the wheel”, the wrapping level unin- tentionally introduces some complications for the end-user, like the need to load the models and learn MLJ-specific concepts as model or machine. We chose instead to bundle the main ML algorithms directly within the package. This offers a complementary approach that we feel is more beginner-friendly."

Let me respond to these:

These cricitism only apply to dedicated neural network packages, such as Flux.jl; all of the toolboxes listed above provide default hyper parameters for every model. In the case of neural networks, user-friendly interaction close to the kind sought here is available either by using the MLJFlux.jl models (available directly through MLJ) or by using the python models provided through ScikitLearn.jl.
Yes, shorter names are obstacles for the beginner but hardly insurmountable. For example, one could provide a cheat sheet summarizing the models and other functionality needed for the machine learning course (and omitting all the rest).
Yes, not needing to load in model code is slightly more friendly. On the other hand, in MLJ for example, one can load and instantiate a model with a single macro. So the main complication is having to ensure relevant libraries are in your environment. But this could be solved easily with a BeginnerPackage which curates all the necessary dependencies. I am not convinced beginners should find the idea of separating hyper-parameters and learned parameters (the "machines" in MLJ) that daunting. I suggest the author's criticism may have more to do with their lack of familiarity than a difficulty for newcomers, who do not have the same preconceptions from using other frameworks. In any case, the point is moot, as one can interact with MLJ models directly via a "model" interface and ignore machines. To see this, I have translated part of a BetaML notebook into MLJ syntax. There's hardly any difference - if anything the presentation is simpler (less hassle when splitting data horizontally and vertically).

In summary, while existing toolboxes might present a course instructor with a few challenges, these are hardly game-changers. The advantages of introducing a student to a powerful, mature, professional toolbox ab initio far outweigh any drawbacks, in my view.

Conclusions

To meet the requirements of JOSS, I think either: (i) The BetaML package needs to demonstrate tighter integration with ~~easily accessible~~ course materials; or (ii) BetaML needs very substantial enhancements to make it competitive with existing toolboxes.

Frankly, a believe a greater service to the Julia open-source software community would be to integrate the author's course materials with one of the mature ML toolboxes. In the case of MLJ, I would be more than happy to provide guidance for such a project.

Sundry comments

I didn't have too much trouble installing the package or running the demos, except when running a notebook on top of an existing Julia environment (see commment below).

added The repository states quite clearly that the primary purpose of the package is dilectic (for teaching purposes). If this is true, the paper should state this clearly in the "Summary" (not just that it was developed in response to the course).
added The authors should reference for comparison the toolboxes ScitkitLearn.jl and AutoMLPipeline.jl
The README.md should provide links to the toolboxes listed in the table above, for the student who "graduates" from BetaML.
Some or most intended users will be new to Julia, so I suggest including with the installation instructions something about how to set up a julia environment that includes BetaML. Something like this, for example.
I found it weird that the front-facing demo is an unsupervised model. A more "Hello World" example might be to train a Decision Tree.
The way users load the built-in datasets seems pretty awkward. Maybe just define some functions to do this? E.g., load_bike_sharing(). Might be instructive to have examples where data is pulled in using RDatasets, UrlDownload or similar?
A cheat-sheet summarizing the model fitting functions and the loss functions would be helpful. Or you could have functions models() and loss_functions() that list these.
I found it pretty annoying to split data by hand the way this is done in the notebooks and even beginners might find this annoying. One utility function here would go a long way to making life easier here (something like the partition function in the MLJ, which you are welcome to lift).
The notebooks are not portable as they do not come with a Manifest.toml. One suggestion on how to handle this is here but you should add a comment in the notebook explaining that the notebook is only valid if it is accompanied by the Manifest.toml. I think an even better solution is provided by InstantiateFromUrl.jl but I haven't tried this yet.
The name em for the expectation-maximization clustering algorithm is very terse, and likely to conflict with a user variable. I admit, I had to dig up the doc-string to find out what it was.

whedon commented 3 years ago

:wave: @ppalmes, please update us on how your review is going.

whedon commented 3 years ago

:wave: @ablaom, please update us on how your review is going.

ablaom commented 3 years ago

I would consider my initial review finished. I have left unchecked, " Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines" as I would prefer the editor make this call, based on my comments. I would say "yes", but according to the guidelines there should be "obvious research application". If research includes education, then yes, definitely scholarly.

The other minor installation item needs addressing from author; I make some suggestions.

ppalmes commented 3 years ago

I'll start my review next week.

sylvaticus commented 3 years ago

Dear, I am pretty new to this open-format of reviewing papers. Please let me know if and when I am supposed to reply, in particular if I need to wait for the second reviewer and/or the editors, thank you :-)

ppalmes commented 3 years ago

I would suggest you can reply to @ablaom review questions/comments because it can help also hasten the review process so that I can just focus on those issues not covered by both of your conversations if I still need more clarifications.

terrytangyuan commented 3 years ago

Yes, @sylvaticus please respond existing feedback while we are waiting for additioinal feedback from @ppalmes. Thanks.

sylvaticus commented 3 years ago

Author's response to @ablaom review 1

Above all, I would like to thanks the reviewer for having taken the time to provide the review and the useful suggestions he brings. I have implemented most of them, as they helped improving the software.

My detailed response is below.

Okay, here's an **update** of my review from the [pre-review thread](https://github.com/openjournals/joss-reviews/issues/2512)

## What the package provides

The package under review provides pure-julia implementations of two
tree-based models, three clustering models, a perceptron model (with 3
variations) and a basic neural network model. In passing, it should be
noted that all or almost all of these algorithms have existing julia
implementations (e.g., DecisionTree.jl, Clustering.jl, Flux.jl).

While "most" of the functionality is indeed already present, from the user point of view, they are not necessarily accessed in the same way and for some functionality, like missing imputation using GMM models, I am not aware of implementations in Julia. Also the kind of output is often different from current implementations. For example most classifiers in BetaML report the whole PMF of the various items rather than the mode. Together with the fact that the function accuracy has an extra optional parameter for selecting the range of items to consider the estimate correct, one can train a classifier that is best in returning a correct value for example within the most probable 2 results (rather than the single most probable one). This can be useful in some applications where the second-best is also an acceptable value.

The package
is used in a course on Machine Learning but integration between the
package and the course is quite loose, as far as I could ascertain
(more on this below).

I am sorry for the misunderstanding here. I am not affiliated with that course. The course referenced uses Python to teach the algorithms, while I believe a Julia approach, when dealing with the internals of the algorithms (conversely to "just" using some API), is more appropriate, this is why I translated, and generalised, the code in Julia.

~~Apart from a library of loss functions, the package provides no
other tools.~~ In addition to the models the package provides a number
of loss functions, as well as activation functions for the neural
network models, and some tools to rescale data. I did not see tools to
automate resampling (such as cross-validation), hyper parameter
optimization, and no model composition (pipelining). The quality of
the model implementations looks good to me, although the author warns
us that "the code is not heavily optimized and GPU [for neural
networks] is not supported "

While tools for automatic sampling and cross-validation may be in scope with BetaML, I believe that the added value for pipeling in a language like Julia is not so strong like it is for other programming languages. In R and Python for example loops are slow, and it definitely helps having a fast library implementing for example hyper-parameters tuning. Julia is instead highly expressive and has fast loops at the same time. The computational and convenience benefits to use a specific framework to build a chain of models or tune the hyper-parameters balance again the flexibility and easiness of using just the "core" Julia functionalities to do the same, so that the advantage is partially shaded and depends from the situation.

## Existing machine learning toolboxes in Julia

For context, consider the following multi-paradigm ML
toolboxes written in Julia which are relatively mature, by Julia standards:

package          | number of models | resampling  | hyper-parameter optimization | composition
-----------------|------------------|-------------|------------------------------|-------------
[ScikitLearn.jl](https://github.com/cstjean/ScikitLearn.jl)   | > 150            | yes         | yes                          | basic
[AutoMLPipeline.jl](https://github.com/IBM/AutoMLPipeline.jl)| > 100            | no          | no                           | medium
[MLJ.jl](https://joss.theoj.org/papers/10.21105/joss.02704)           | 151              | yes         | yes                          | advanced

In addition to these are several excellent and mature packages
dedicated to neural-networks, the most popular being the AD-driven
Flux.jl package. So far, these provide limited meta-functionality,
although MLJ now provides an interface to certain classes of Flux
models ([MLJFlux](https://github.com/alan-turing-institute/MLJFlux.jl)) and
ScikitLearn.jl provides interfaces to python neural network models
sufficient for small datasets and pedagogical use.

Disclaimer: I am a designer/contributor to MLJ.

**According to the [JOSS requirements](https://joss.theoj.org/about),
Submissions should "Have an obvious research application."**  In its
current state of maturity, BetaML is not a serious competitor to the
frameworks above, for contributing directly to research. However, the
author argues that it has pedagogical advantages over existing tools.

## Value as pedagogical tool

I don't think there are many rigorous machine learning courses or
texts closely integrated with models and tools implemented in julia
and it would be useful to have more of these. ~~The degree of
integration in this case was difficult for me to ascertain because I
couldn't see how to access the course notes without formally
registering for the course (which is, however, free).~~ I was also
disappointed to find only one link from doc-strings to course
materials; from this "back door" to the course notes I could find no
reference back to the package, however. Perhaps there is better
integration in course exercises? I couldn't figure this out.

**edit** Okay, I see that I missed the link to the course notes, as
opposed to the course itself. However the notes make only references
to python code and so do not appear to be directly integrated with the
package BetaML.

The remaining argument for BetaML's pedagogical value rests on a
number of perceived drawbacks of existing toolboxes, for the
beginner. Quoting from the JOSS manuscript:

1. "For example the popular Deep Learning library Flux (Mike Innes,
   2018), while extremely performant and flexible, adopts some
   designing choices that for a beginner could appear odd, for example
   avoiding the neural network object from the training process, or
   requiring all parameters to be explicitly defined. In BetaML we
   made the choice to allow the user to experiment with the
   hyperparameters of the algorithms learning them one step at the
   time. Hence for most functions we provide reasonable default
   parameters that can be overridden when needed."

2. "To help beginners, many parameters and functions have pretty
   longer but more explicit names than usual. For example the Dense
   layer is a DenseLayer, the RBF kernel is radialKernel, etc."

3. "While avoiding the problem of “reinventing the wheel”, the
   wrapping level unin- tentionally introduces some complications for
   the end-user, like the need to load the models and learn
   MLJ-specific concepts as model or machine.  We chose instead to
   bundle the main ML algorithms directly within the package. This
   offers a complementary approach that we feel is more
   beginner-friendly."

Let me respond to these:

1. These cricitism only apply to dedicated neural network
   packages, such as Flux.jl; all of the toolboxes listed
   above provide default hyper parameters for every model. In the case
   of neural networks, user-friendly interaction close to the kind
   sought here is available either by using the MLJFlux.jl models
   (available directly through MLJ) or by using the python models
   provided through ScikitLearn.jl.

2. Yes, shorter names are obstacles for the beginner but hardly
   insurmountable. For example, one could provide a cheat sheet
   summarizing the models and other functionality needed for the
   machine learning course (and omitting all the rest).

3. Yes, not needing to load in model code is slightly more
   friendly. On the other hand, in MLJ for example, one can load and
   instantiate a model with a single macro. So the main complication
   is having to ensure relevant libraries are in your environment. But
   this could be solved easily with a `BeginnerPackage` which curates
   all the necessary dependencies. I am not convinced beginners should
   find the idea of separating hyper-parameters and learned parameters
   (the "machines" in MLJ) that daunting. I suggest the author's
   criticism may have more to do with their lack of familiarity than a
   difficulty for newcomers, who do not have the same preconceptions
   from using other frameworks. In any case, the point is moot, as one
   can interact with MLJ models directly via a "model" interface and
   ignore machines. To see this, I have
   [translated](https://github.com/ablaom/ForBetaMLReview) part of a
   BetaML notebook into MLJ syntax. There's hardly any difference - if
   anything the presentation is simpler (less hassle when splitting
   data horizontally and vertically).

In summary, while existing toolboxes might present a course instructor
with a few challenges, these are hardly game-changers. The advantages of
introducing a student to a powerful, mature, professional toolbox *ab*
*initio* far outweigh any drawbacks, in my view.

I rephrased the readme.md of the package, as the project evolved from being a mere "rewriting" of algorithms in Julia. The focus of the package is on the accessibility to people from different backgrounds, and consequently different interests, than researchers or practitioners in computer sciences. The current ML ecosystem in Julia is out of scope for some kind of PhD students and researchers, for example many in my lab. They have different research interests and don't have the time to deep into ML so much, "just" applying it (often to small datasets) for their concrete problems. So the way to access the algorithms is particularly important. This is why, for example, both the decision trees / GMM algorithms in BetaML accept data with missing values, or it is not necessarily to specify in the decision tree algorithm the kind of job (regression/classification), as this is automatically inferred by the type of the labels (this is also true for DecisionTrees, but using two different API, DecisionTreeRegressor/DecisionTreeClassifier on one side and build_tree on the other). This is an example where we explicitly traded simplicity for efficiency, as adding support for missing data directly in the algorithms considerably reduces their performances (and this is the reason, I assume, the leading packages don't implement it).

## Conclusions

To meet the requirements of JOSS, I think either: (i) The BetaML
package needs to demonstrate tighter integration with ~~easily
accessible~~ course materials; or (ii) BetaML needs very substantial
enhancements to make it competitive with existing toolboxes.

Frankly, a believe a greater service to the Julia open-source software
community would be to integrate the author's course materials with one
of the mature ML toolboxes. In the case of MLJ, I would be more than
happy to provide guidance for such a project.

I do appreciate both the Reviewer comments and the MLJ as a mature, state-of-the art framework, I just believe that there is space for a different approach with different user cases.

## Sundry comments

I didn't have too much trouble installing the package or running the
demos, except when running a notebook on top of an existing Julia
environment (see commment below).

- **added** The repository states quite clearly that the primary
  purpose of the package is dilectic (for teaching purposes). If this
  is true, the paper should state this clearly in the "Summary" (not
  just that it was developed in response to the course).

As specified on a previous comment, the focus is on usability, whether this is important for didactic or applied research purposes.

- **added** The authors should reference for comparison the toolboxes
  ScitkitLearn.jl and AutoMLPipeline.jl

- The README.md should provide links to the toolboxes listed in
  the table above, for the student who "graduates" from BetaML.

I added an "Alternative packages" section that lists the most relevant and mature Julia packages in the topics covered by BetaML.

- Some or most intended users will be new to Julia, so I suggest
  including with the installation instructions something about how to
  set up a julia environment that includes BetaML. Something like
  [this](https://alan-turing-institute.github.io/MLJ.jl/dev/#Installation-1), for example.
- A cheat-sheet summarizing the model fitting functions and the loss
  functions would be helpful. Or you could have functions `models()` and
  `loss_functions()` that list these.

Being a much smaller package than MLJ, I believe the "Installation" and "Loading the module(s)" (for the first point) and "Usage" (for the second one) in the documentation do suffice.

- I found it weird that the front-facing demo is an *unsupervised*
  model. A more "Hello World" example might be to train a Decision
  Tree.

I added a basic Random Forest example in the Readme.md so to provide the readers of an overview of different techniques to analyse the same dataset (iris).

- The way users load the built-in datasets seems pretty awkward. Maybe
  just define some functions to do this? E.g.,
  `load_bike_sharing()`. Might be instructive to have examples where
  data is pulled in using `RDatasets`, `UrlDownload` or similar?

I now load the data using a path relative to the package base path. In this way the script should load the correct data whichever is the current directory from which it is called by the user.

- I found it pretty annoying to split data by hand the way this is
  done in the notebooks and even beginners might find this
  annoying. One utility function here would go a long way to making
  life easier here (something like the `partition` function in the
  MLJ, which you are welcome to lift).

Thank you. I did indeed add a simple partition function to allow partition multiple matrices in one line, e.g. ((xtrain,xtest),(ytrain,ytest)) = partition([x,y],[0.7,0.3]). Note that a release of the software including the new partition function has still to be made.

- The notebooks are not portable as they do not come with a
  Manifest.toml. One suggestion on how to handle this is
  [here](https://github.com/ablaom/ForBetaMLReview/blob/main/bike_sharing.ipynb)
  but you should add a comment in the notebook explaining that the
  notebook is only valid if it is accompanied by the Manifest.toml. I
  think an even better solution is provided by InstantiateFromUrl.jl
  but I haven't tried this yet.

Having a manifest means that I need to keep it updated and the user understand what it is. Instead the notebooks all have a section at the beginning where the required packages are loaded. In this way even if the user just copy and paste the code to his/her own IDE, it will likely works.

A related issue is to guarantee that notebooks are kept in sync with the code. I noticed that the reviewer use Literate.jl, I may consider it, as it helps keeping the examples under testing control.

- The name `em` for the expectation-maximization clustering algorithm
  is very terse, and likely to conflict with a user variable.  I admit, I had
  to dig up the doc-string to find out what it was.

I agree and changed the name to gmm.

ablaom commented 3 years ago

Response to author's response to my review

@sylvaticus Thank you for your response and addressing some of my criticisms.

@terrytangyuan The author has not addressed, to my satisfaction, a central objection, which can be rephrased as this: To show the software meets a research need, it needs to be demonstrated that the software is substantially easier to use than the substantially more powerful alternatives (in the demonstrated absence of some other pedagogical value). The author agrees that there are much more powerful alternatives. However, I maintain it is not substantially easier to use or to learn MLBeta, as I detail in my rebuttals 1-3 to assertions in the paper.

This said, as an author of one of the alternatives, I naturally find my package easier to use than one with which I am less familiar. It is possible that @sylvaticus feels the same way about MLBeta for the much the same reason. Perhaps @ppalmes would care to comment specifically on this question (see italics above).

To be clear, I think the software and paper are quality products. I also do not dismiss the possibility that users might prefer a future enhanced version of the MLBeta to existing alternatives. I am simply questioning whether MLBeta meets the stated requirements of JOSS at this stage of its development.

arfon commented 3 years ago

I would suggest you can reply to @ablaom review questions/comments because it can help also hasten the review process so that I can just focus on those issues not covered by both of your conversations if I still need more clarifications.

:wave: @ppalmes - I think this review could definitely benefit from your input here 🙏

ppalmes commented 3 years ago

My decision is Major Revision.

The main contribution of the package is the reimplementation in pure Julia of the various algorithms in supervised and unsupervised learning for teaching purposes.

I agree with @ablaom that in terms of usability, other existing toolkit are more straightforward and consistent to use. Among the things that I consider to be a major issue is the absence of pipeline API. All related packages mentioned support this API which is a big factor for usability.

Here are my list of suggestions:

Improve the online documentation. If the target audience are students, the online documentation needs more examples and tutorials. There is only one page showing examples in the online documentation. Notebooks are great but static documentation (html or pdf) is faster to read without any installation issues.
Include good use-cases in the documentation that employ the toolkit to solve real problems incorporating different strategies from the toolbox. This will add value to the scholarly effort. Perform some benchmark and discussions about the results and implementation choices including internal data structures.
Implement the pipeline API. It's important for the ML toolkit to make the data preprocessing steps composable for easier usage and experimentation from the perspective of students.

arfon commented 3 years ago

:wave: all, I'm stepping in here to assist @terrytangyuan who is struggling to make time for JOSS editorial work at this time.

Firstly, @ppalmes and @ablaom, many thanks for your careful and constructive reviews. There is some excellent feedback here for the @sylvaticus.

I do need to address one aspect of this feedback however, best captured in this comment:

@terrytangyuan The author has not addressed, to my satisfaction, a central objection, which can be rephrased as this: To show the software meets a research need, it needs to be demonstrated that the software is substantially easier to use than the substantially more powerful alternatives (in the demonstrated absence of some other pedagogical value). The author agrees that there are much more powerful alternatives. However, I maintain it is not substantially easier to use or to learn MLBeta, as I detail in my rebuttals 1-3 to assertions in the paper.

I agree this is important but it's not a strict requirement for a works to be published in JOSS. Primarily, the review criteria around (Substantial Scholarly Effort)[https://joss.readthedocs.io/en/latest/review_criteria.html#substantial-scholarly-effort] are designed to exclude very minor software contributions which we don't believe add meaningful value for potential users of the tooling. Based on @sylvaticus' responses in this thread I do not believe this work falls into that category.

JOSS' primary mission is to provide a mechanism for authors doing software work to receive career credit for their work, and in borderline situations such as this, we defer to the author's need/ability to be cited for their work. As such, on this criterion of Substantial scholarly effort I am making an editorial decision to allow this submission to move forward.

That said, there is still a reasonable amount of feedback (especially that most recently from @ppalmes) that it would be good to hear your response to @sylvaticus. Could you please respond here with your thoughts and potential plans to address?

ppalmes commented 3 years ago

Yeah, I am ok to proceed with publication. My suggestions are to make the work more usable to the wider community. It is usable at the current form and I believe that the work will continue to improve.

sylvaticus commented 3 years ago

Yes, as you can see in the commit log, I am actually still implementing the modifications required by the reviewers.. I created an interface to my models for one of the toolbox cited (these interfface has been already being pushed but still needs to be included in a release of BetaML) and I am implementing a more detailed set of tutorials. I would still need a 1-2 weeks to complete it and update the JOSS paper.

arfon commented 3 years ago

:zap: thanks for the feedback @sylvaticus, looking forward to seeing the updates!

ablaom commented 3 years ago

I agree this is important but it's not a strict requirement for a works to be published in JOSS. Primarily, the review criteria around (Substantial Scholarly Effort)[https://joss.readthedocs.io/en/latest/review_criteria.html#substantial-scholarly-effort] are designed to exclude very minor software contributions which we don't believe add meaningful value for potential users of the tooling. Based on @sylvaticus' responses in this thread I do not believe this work falls into that category.

@afron Thanks for this clarification! Your statements makes perfect sense to me and I am very happy to see this work acknowledged through publication.

sylvaticus commented 3 years ago

Dear editor and reviewers, I have updated the library and the paper to account for the reviewers' comments:

a detailed step-by-step tutorial has been added to show the usage of the library (and, more in general, of ML techniques) and compare it with existing libraries. As I did expect, BetaML is surely less computationally performant, but it is not less accurate nor (with the notable exception of neural networks) less flexible;
I added several interfaces to BetaML models to be used within the MLJ framework;
In the meantime I updated/added several utility functions to the algorithms. For example, the set of scale, pca, oneHotEncoder, crossValidation (with a configurable user-provided function / "do block") and other functions allows to easily make a "workflow" from the data to the ML models, even if not exactly a "pipeline".

I am confident that the modifications introduced will help the users of the library and I thanks the reviewers for the time they spent in suggesting the improvements to the library and their guidance in implementing them.

sylvaticus commented 3 years ago

@whedon generate pdf

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

sylvaticus commented 3 years ago

@arfon what are the next steps now ? Should I create a software deposit on zenodo ?

arfon commented 3 years ago

@whedon generate pdf

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

arfon commented 3 years ago

@sylvaticus - could you please clean up all of the comments in your paper? I was trying to give the paper.md a read, and realized that lots of the content that I was struggling with was actually commented out.

Also, please add more information to your affiliations - I'm not sure what many of them are.

sylvaticus commented 3 years ago

@whedon generate pdf

Done it. I have removed the comments and specified the full affiliation names.

I am very sorry for the 6 affiliations (it's crazy, I know..) but that's the way we have been asked to sign our papers :-/ :

1) Tout ce qui est publié dans le BETA par les chercheurs INRA et AgroParisTech et tout ce qui traite de la forêt et du bois dans le BETA doit être signé:

“Université de Lorraine, Université de Strasbourg, AgroParisTech, CNRS, INRA, BETA, 54000, Nancy, France” pour les lorrains

Et

“Université de Strasbourg, Université de Lorraine, AgroParisTech, CNRS, INRA, BETA, 67000, Strasbourg, France” pour les strasbourgeois

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

arfon commented 3 years ago

@sylvaticus - I made a few minor changes to your paper here: https://github.com/sylvaticus/BetaML.jl/pull/23 . Once you have merged this PR, could you make a new release of this software that includes the changes that have resulted from this review. Then, please make an archive of the software in Zenodo/figshare/other service and update this thread with the DOI of the archive? For the Zenodo/figshare archive, please make sure that:

The title of the archive is the same as the JOSS paper title
That the authors of the archive are the same as the JOSS paper authors

I can then move forward with accepting the submission.

sylvaticus commented 3 years ago

Hello, I have created release v0.5.1 of the software which includes the text corrections of @arfon (thank you!) and I have deposited it on Zenodo: https://doi.org/10.5281/zenodo.4730205

sylvaticus commented 3 years ago

@whedon set 10.5281/zenodo.4730205 as archive

whedon commented 3 years ago

I'm sorry @sylvaticus, I'm afraid I can't do that. That's something only editors are allowed to do.

arfon commented 3 years ago

@whedon set 10.5281/zenodo.4730205 as archive

whedon commented 3 years ago

OK. 10.5281/zenodo.4730205 is the archive.

arfon commented 3 years ago

@whedon accept

whedon commented 3 years ago

Attempting dry run of processing paper acceptance...

whedon commented 3 years ago

:wave: @openjournals/joss-eics, this paper is ready to be accepted and published.

Check final proof :point_right: https://github.com/openjournals/joss-papers/pull/2267

If the paper PDF and Crossref deposit XML look good in https://github.com/openjournals/joss-papers/pull/2267, then you can now move forward with accepting the submission by compiling again with the flag deposit=true e.g.

@whedon accept deposit=true

whedon commented 3 years ago

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1007/s10107-010-0420-4 is OK
- 10.5281/zenodo.3541505 is OK
- 10.21105/joss.00602 is OK
- 10.21105/joss.01284 is OK
- 10.5281/zenodo.4294939 is OK

MISSING DOIs

- None

INVALID DOIs

- None

arfon commented 3 years ago

@whedon accept deposit=true

whedon commented 3 years ago

Doing it live! Attempting automated processing of paper acceptance...

whedon commented 3 years ago

🐦🐦🐦 👉 Tweet for this paper 👈 🐦🐦🐦

whedon commented 3 years ago

🚨🚨🚨 THIS IS NOT A DRILL, YOU HAVE JUST ACCEPTED A PAPER INTO JOSS! 🚨🚨🚨

Here's what you must now do:

Check final PDF and Crossref metadata that was deposited :point_right: https://github.com/openjournals/joss-papers/pull/2268
Wait a couple of minutes, then verify that the paper DOI resolves https://doi.org/10.21105/joss.02849
If everything looks good, then close this review issue.
Party like you just published a paper! 🎉🌈🦄💃👻🤘

Any issues? Notify your editorial technical team...

arfon commented 3 years ago

@ablaom, @ppalmes - many thanks for your reviews here and to @terrytangyuan for editing this submission. JOSS relies upon the volunteer efforts of people like you and we simply wouldn't be able to this without you ✨

@sylvaticus - your paper is now accepted and published in JOSS :zap::rocket::boom:

whedon commented 3 years ago

:tada::tada::tada: Congratulations on your paper acceptance! :tada::tada::tada:

If you would like to include a link to your paper from your README use the following code snippets:

Markdown:
[![DOI](https://joss.theoj.org/papers/10.21105/joss.02849/status.svg)](https://doi.org/10.21105/joss.02849)

HTML:
<a style="border-width:0" href="https://doi.org/10.21105/joss.02849">
  <img src="https://joss.theoj.org/papers/10.21105/joss.02849/status.svg" alt="DOI badge" >
</a>

reStructuredText:
.. image:: https://joss.theoj.org/papers/10.21105/joss.02849/status.svg
   :target: https://doi.org/10.21105/joss.02849

This is how it will look in your documentation:

We need your help!

Journal of Open Source Software is a community-run journal and relies upon volunteer effort. If you'd like to support us please consider doing either one (or both) of the the following:

Volunteering to review for us sometime in the future. You can add your name to the reviewer list here: https://joss.theoj.org/reviewer-signup.html
Making a small donation to support our running costs here: https://numfocus.org/donate-to-joss

sylvaticus commented 3 years ago

Thanks everyone for your precious time and useful suggestion. I ran the cloc util again and the lines of code went from 2450 when starting the review to over 5000 now, most of which incorporates reviewers ideas and suggestions.

openjournals / joss-reviews

[REVIEW]: BetaML: The Beta Machine Learning Toolkit, a self-contained repository of Machine Learning algorithms in Julia #2849

Status

Reviewer instructions & questions

Review checklist for @ablaom

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Review checklist for @ppalmes

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

What the package provides

Existing machine learning toolboxes in Julia

Value as pedagogical tool

Conclusions

Sundry comments

Author's response to @ablaom review 1

Response to author's response to my review