whedon commented 4 years ago

Submitting author: @sayanddude (Sayan Putatunda) Repository: https://github.com/daya6489/DriveML Version: v0.1 Editor: @dfm Reviewers: @mirca, @ledell Archive: Pending

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

Status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/b2766fa5fd2ee51d2e905a42e1f7ba9d"><img src="https://joss.theoj.org/papers/b2766fa5fd2ee51d2e905a42e1f7ba9d/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/b2766fa5fd2ee51d2e905a42e1f7ba9d/status.svg)](https://joss.theoj.org/papers/b2766fa5fd2ee51d2e905a42e1f7ba9d)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@mirca & @pat-s, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

Make sure you're logged in to your GitHub account
Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @dfm know.

✨ Please try and complete your review in the next six weeks ✨

Review checklist for @mirca

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@sayanddude) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[ ] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

Review checklist for @pat-s

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[ ] Contribution and authorship: Has the submitting author (@sayanddude) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[ ] Functionality: Have the functional claims of the software been confirmed?
[ ] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[ ] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[ ] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[ ] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[ ] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[ ] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

Review checklist for @ledell

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[ ] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@sayanddude) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

Functionality

[ ] Installation: Does installation proceed as outlined in the documentation?
[ ] Functionality: Have the functional claims of the software been confirmed?
[ ] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[ ] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[ ] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[ ] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[ ] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[ ] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[ ] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[ ] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[ ] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[ ] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[ ] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

whedon commented 4 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @mirca, @pat-s it looks like you're currently assigned to review this paper :tada:.

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf

whedon commented 4 years ago

Reference check summary:

OK DOIs

- 10.18637/jss.v077.i01 is OK
- 10.1016/j.orp.2016.09.002 is OK
- 10.18637/jss.v045.i03 is OK
- 10.21105/joss.01509 is OK
- 10.1007/978-981-13-1208-3_1 is OK
- 10.1145/2347736.2347755 is OK
- doi:10.1088/1742-6596/1207/1/012015 is OK
- 10.1007/978-0-387-21706-2 is OK

MISSING DOIs

- https://doi.org/10.18637/jss.v033.i01 may be missing for title: Regularization Paths for Generalized Linear Models via Coordinate Descent

INVALID DOIs

- None

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

dfm commented 4 years ago

@mirca, @pat-s: Thanks again for agreeing to review! The checklists above talk you through the process and there are various other useful links and bits of info in the main issue body. Please don't hesitate to ask questions as we proceed. We aim for a quick turn around here, but please take all the time that you need (especially given the state of the world) and let me know if there's anything I can do to help.

mirca commented 4 years ago

Dear @sayanddude, thank you for submitting your paper to JOSS. I've open several issues on the repository https://github.com/daya6489/DriveML regarding my review on the manuscript. Please, take a look at your convenience.

sayanddude commented 4 years ago

Dear @mirca Thank you for your comments! We will start working on them and will get back to you with the updated paper as soon as possible.

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

sayanddude commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

sayanddude commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

sayanddude commented 4 years ago

Hi, sorry for generating the pdf multiple times, actually struggling with the dimensions of a table in the markdown file and so the multiple trials.

sayanddude commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

sayanddude commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

sayanddude commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

sayanddude commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

dfm commented 4 years ago

@sayanddude You can use http://whedon.theoj.org/ to test your paper compilation.

sayanddude commented 4 years ago

@sayanddude You can use http://whedon.theoj.org/ to test your paper compilation.

@dfm Thanks a lot!

sayanddude commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

sayanddude commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

sayanddude commented 4 years ago

Hi @mirca , Thanks for your comments! I have addressed all the comments, updated the manuscript and have generated the latest pdf above. Please find below my response to each of the comments. Also, I have separately responded to the comments posted in Github issues at- https://github.com/daya6489/DriveML/issues

1) License file is inconsistent with the license reported in the paper Response- Thank you for pointing this out! We have now updated the Github repo with a GNU General Public License (GPL-3) and have mentioned the same in the paper as well.

2) Installation- The installation method reported is conditioned on the acceptance of the package into the Comprehensive R Archive Network (CRAN). Therefore, the package………….. Response- Thanks for your comment! The DriveML package is now accepted in CRAN and it is available at https://cran.r-project.org/web/packages/DriveML/index.html. So, we have updated the “Installation” section in the Github repo with both ways to install the package i.e. via CRAN and devtools.

3) Affiliation Response- Thanks for pointing this out! EDA AA & DS CoE stands for Enterprise Data Analytics- Advanced Analytics and Data Sciences Center of Excellence. We have updated the paper with the full form.

4) Comments on the Summary section Response- Thanks for your comments! We have provided an appropriate reference i.e. (Tuggener et al., 2019) for AutoML as suggested. We have also worked on the other two comments and have rewritten the specific lines.

5) Comments on Key Functionality section- "Figure 2" is more like a table, I would suggest to refer to it as "Table 1". Response- Thanks for your comment! We have replaced Figure 2 (of the earlier version) with Table 1 in the latest version of the paper.

6) Comments on Illustration section Response- Thanks for your comments! We agree about your specific comments for the illustration section and so we have rewritten this section entirely. The focus is now not on comparison of methods but how to practically use the DriveML package. Here we focus on the functionalities of DriveML viz., exploratory overview of the data, automated data preparation, automated machine learning model building and generating report in HTML. We use a publicly available dataset and write the corresponding R codes as well.

Regarding your comments on the unique features of DriveML mentioned in Table 1- "identifying pattern of missing values" and "HTML report using rmarkdown", we have updated the “Key Functionality” section of the paper where we have explained how the “autoMAR” function works and have provided a sample HTML report/vignette generated by DriveML that is published in CRAN.

7) Comments on Quality of Writing- There are several grammatical issues with the manuscript, please take your time reading the manuscript thoroughly. Response- Thanks for your comment! We have performed a thorough proof reading of the paper and also used tools like Grammarly. We have corrected all the grammatical and typographical errors in the updated version of the manuscript.

Please let me know in case of any issues. Thanks!

pat-s commented 4 years ago

Review of DriveML

The package is practically a bundle several wrapper scripts around existing functionality conducting opinionated preprocessing and model selection steps. Further it has a questionable value for other users because the motivations for many choices are missing. The main motivation proofing that this package performs better than other packages is flawed because the scope of the packages listed is different than the one of {DriveML}.

Also, leaving choices for intermediate steps to the user is usually done by design by the underlying packages (like mlr or caret) and not a feature that is missing, hence the usefulness of such an automation can be questioned per se. I haven't opened any issues in the repo so far because many of my points listed below are on the general side and will interact with each other, possibly invalidating others.

I'd rate the R package implementation on a technical side as "basic". The test suite is very minimal. The package comes with a long list of (recursive) dependencies which can be seen as critical. R packages should aim to be minimal, especially when they should be used by others. In addition, the packages makes use of two deprecated ML frameworks (caret & mlr). For both, successors (tidymodels & mlr3) exist since quite some time.

As pointed out by @mirca, there are also several grammar and wording issues that would need to be addressed to avoid changing the meaning of certain sentences/paragraphs to something undesired.

In summary, I do not think that this package adds any novelty or sophisticated approaches to automate (parts) of a machine learning workflow effectively. In contrast, I fear that such low-quality, opinionated automation packages published in a journal will result in questionable ML studies/inferences by others with the claim of using a package that is a "reviewed" journal paper. Hence, my suggestion would be to reject it. However, as JOSS has the policy of not rejecting submissions per se, I would then label it as "major revisions". I do not think that one round of revisions would be enough to address all points in a sufficient manner. Hence, I would like to opt-out / resign from a possible multi-round review of this submission. Nevertheless, thanks for submitting this work and I hope that my points listed below are still considered as helpful by the authors.

Detailed comments below

Details

## Package Structure - Roxygen version is outdated - Does the "heart" dataset necessarily need to come with this package or is it already available in other packages? - Why was the "hear" dataset chosen instead of one of the common ML datasets in R available in various packages? - Almost no git history, it seem the package was uploaded to git right before submission ### Authors The DESCRIPTION file list way more authors than there are contributions listed in the GitHub repo. ## Installation The install command of the dev version could be improved: - devtools should be replaced by remotes as the former just wraps the latter but comes with way more dependencies - No need to set `ref = master` as this is already the default - The user should not be forced to install the vignettes via `build_vignettes = TRUE` ### README - It is good practice to put a vignette into a pkgdown page or keep it in a .Rmd file rather than putting (parts?) of it into the README. This creates unneeded redundancy. - It is not good practice to put `install.packages()` calls into a vignette/README (e.g. `install.packages("DriveML")` as these for a reinstallation of packages every time, even this is often not necessary. - desease → disease j ### Dependencies - Heavy dependencies with recursive heavy dependencies: - DriveML → no source coude available? - SmartEDA # Functions - `ExpData(data = heart, type = 1)` - Categorizes variables into groups. Some of these are R-specific and not general variable groups. For example, only in R there are "factor" vectors. However, these are just nominal/distinct variables in a ML sense. Hence, "text", "factor" and "logical" variables are all the same. Applying this categorization might cause users to assume that there are fundamental implications for ML models (which there are not). I suggest removing those special categories. - What are "unique variables"? - `%. of Variables having >90% missing cases 0% (0)` → why 90%? - `11 %. of Variables having <50% missing cases 0% (0)` `12 %. of Variables having >50% missing cases 0% (0)` How can both be zero here? They should sum up to 100%. - The column name "Obs" is misleading as observations are rows in a dataset. Here they are used for counts of any kind. Suggesting to rename the column name. - `ExpData(data = heart, type = 2)` - categorizes between integers and numerics now, whereas `type = 1` does not - The first column name is undescriptive ("S.no") - The help file of `ExpData` needs a spell check - `ExpNumStat()` - Returned DF has different col names than `ExpData` for the same content ("Per_of_Missing" vs. "`% of Missing`" - The argument descriptions randomly start Upper case or lower case, the same goes for the variable names - The author name listed in the help page should be a fully qualified person name or removed (currently "dubrangala") - `ExpNumViz()` : the code in the README does not run. It contains arguments that are not (anymore) present in the function - opened 17(!) external graphic windows which need to be closed individually. This is very tedious and I also do not see any value in this function. These cannot be compared to each other in different graphic windows, the scales are different, etc. - The function does not mention how the plots are combined and composed internally. The user can guess that ggplot2 is used but is somewhat suprised by the result `autoDataprep()` - returns hundreds of lines. At least the printer (list of 12) needs to be adjusted. There is a dedicated printer function `printautoDataprep()` which should be the S3 printing method for class `autoDataprep` (which has an inconsistent camel case naming scheme) - The arguments have types here whereas other functions have not - Presumably the user should continue with `foo$master_data` - this is not clear from the documentation `autoMLmodel()` - The default settings in the README example take very long time (100 iters, 20 varImp iterations). Examples should - The printer needs to be adjusted# `autoMLReport()` errors # Wording - "most difficult machine learning functions" → there is no ML function, these are algorithms. None is "difficult", "complex" would be a more appropriate wording. If "function" is meant as "tasks" or "steps" than this should also be renamed. - "DriveML R package has four unique functionalities" → unique is wrong word here - The grammer and wording feels very sloppy, no grammer or spell check was done. The same applies to the help pages. ## Tests - Test suite is minimal, no test coverage is provided ## Manuscript - Review unnecessary uses of "the" - Feature engineering is not a necessary step in ML, purely optional - Data Preparation and Feature Engineering can be summarised as "Data preprocessing" - "DriveML performs the best across different parameters." → which parameters? - How does DriveML address reproducibility in any way? - Is there a convention how to format R packages and functions? - Reading the "key functionality" one could have the impression that only (b) is done via mlr and everything else is functionality from DriveML itself? - Why were two implementations of randomForest integrated? - Fig.1: caret and mlr do not do any automated ML, they provide building blocks for creating ML pipelines and do not enforce and opionionated defaults (as DriveML does). - Fig.1: The comparison made is not fair as many of the mentioned packages do not make any choice about meaningful settings for tuning and or preprocessing - "Thus, Table 1 shows that DriveML is better than almost all the other packages available in CRAN/Github" → This is a very strong statement which I would vote against. - The paper contains more print output than text and is also very sparse on references - The coding style is not consitent with respect to whitspaces around operators - No summary/discussion of the results ### Grammar Suggestions for parts that could be rephrased: - "we introduce a new package i.e. DriveML" (Summary) - reduce the use of the word "pillar", maybe replace it completely - I don't think it is the correct word in this context - Github → GitHub - "reduce developer’s errors" - Watch spaces after hyphens - Watch spaces after brackets - "HTML slide/vignette" → define what it is. It can't be both. - "This report follows the industry and academia’s best practices." → What are the best practices and how are they defined? - "sample HTML report/vignette"

sayanddude commented 4 years ago

@pat-s Thank you for your detailed comments! We have started working on addressing the issues you raised on Package Structure, Installation, README , Dependencies, Functions, Tests, Wording, Manuscript and Grammar. We plan to complete this in a week or so and we will get back to you with the updated package and manuscript.

dfm commented 4 years ago

@pat-s: Thanks for your thorough review. I'm happy to remove you as the reviewer if you would no longer like to be involved. Either way, we very much appreciate your contribution!

@sayanddude: The editorial board has decided to mark this submission as paused and pending-major-enhancements to give you the opportunity to address the concerns that @pat-s has raised. Please take your time and let us know when/if you feel they have been appropriately addressed.

pat-s commented 4 years ago

@pat-s: Thanks for your thorough review. I'm happy to remove you as the reviewer if you would no longer like to be involved. Either way, we very much appreciate your contribution!

@dfm Yes, please remove me as a reviewer since I do not have the resources for additional iterations here. Thanks.

dfm commented 4 years ago

@whedon remove @pat-s as reviewer

whedon commented 4 years ago

OK, @pat-s is no longer a reviewer

dfm commented 4 years ago

@pat-s: Done! It's possible that you'll have to manually unsubscribe from notifications on this thread (I'm not sure if @whedon does that!) by clicking unsubscribe in the right hand sidebar. Thanks again!

sayanddude commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

sayanddude commented 4 years ago

Hi @pat-s , Thanks for your comments! I have addressed all the comments and updated both the manuscript and the package. The latest pdf is generated above. Please find below my response to each of your comments (my responses are in italics).

Package Structure • Roxygen version is outdated- Thanks for pointing it out! We have updated it in the latest version of the package. • Does the "heart" dataset necessarily need to come with this package or is it already available in other packages?- The heart disease dataset is publicly available in the UCI machine learning repository. It is not available in any other package but it is bundled with our package. • Why was the "heart" dataset chosen instead of one of the common ML datasets in R available in various packages?- It is a standard and a commonly used publicly available dataset for binary classification. • Almost no git history, it seem the package was uploaded to git right before submission- Actually, we first uploaded everything to CRAN. Only for final publication, we commit it to github. That’s why no history available till now in Github. However, for this version one can find the history.

Authors The DESCRIPTION file list way more authors than there are contributions listed in the GitHub repo.- All the listed authors have equally contributed in the development of the package and in the paper writing. It is just that we are sharing different responsibilities.

Installation The install command of the dev version could be improved: • devtools should be replaced by remotes as the former just wraps the latter but comes with way more dependencies- Thanks for your comment! It is fixed. • No need to set ref = master as this is already the default- It is fixed. • The user should not be forced to install the vignettes via build_vignettes = TRUE- Thanks! It is fixed.

README • It is good practice to put a vignette into a pkgdown page or keep it in a .Rmd file rather than putting (parts?) of it into the README. This creates unneeded redundancy.- Thanks for pointing it out! We have addressed this in the latest version of the package. • It is not good practice to put install.packages() calls into a vignette/README (e.g. install.packages("DriveML") as these for a reinstallation of packages every time, even this is often not necessary.- Thanks! It’s done. • desease → disease j- Thanks! We have corrected it.

Dependencies • Heavy dependencies with recursive heavy dependencies: o DriveML → no source coude available?- The source code is available in github. o SmartEDA- It’s fixed! There is no longer a dependency on SmartEDA.

Functions • ExpData(data = heart, type = 1)- This is a function of the SmartEDA package and since there is no longer any dependency on this package, so this is no longer a concern as it is not relevant to the DriveML package! • ExpData(data = heart, type = 2)- This is a function of the SmartEDA package and since there is no longer any dependency on this package, so this is no longer a concern as it is not relevant to the DriveML package! • The help file of ExpData needs a spell check- This is a function of the SmartEDA package and since there is no longer any dependency on this package, so this is no longer a concern as it is not relevant to the DriveML package! • ExpNumStat()- This is a function of the SmartEDA package and since there is no longer any dependency on this package, so this is no longer a concern as it is not relevant to the DriveML package! • ExpNumViz() : This is a function of the SmartEDA package and since there is no longer any dependency on this package, so this is no longer a concern as it is not relevant to the DriveML package! • autoDataprep()- This is a relevant function for DriveML. o returns hundreds of lines. At least the printer (list of 12) needs to be adjusted. There is a dedicated printer function printautoDataprep() which should be the S3 printing method for class autoDataprep (which has an inconsistent camel case naming scheme)- Thanks for your comment! We have now reduced it to a list of 6. o The arguments have types here whereas other functions have not- It’s fixed! o Presumably the user should continue with foo$master_data - this is not clear from the documentation-- It’s fixed!

autoMLmodel() o The default settings in the README example take very long time (100 iters, 20 varImp iterations). Examples should- make it 10 iters- Thanks for pointing it out! We have now made it 10 iterations. o The printer needs to be adjusted#- It’s fixed autoMLReport() errors- All errors are fixed.

Wording o "most difficult machine learning functions" → there is no ML function, these are algorithms. None is "difficult", "complex" would be a more appropriate wording. If "function" is meant as "tasks" or "steps" than this should also be renamed.- Thanks for pointing it out! It is now fixed. o "DriveML R package has four unique functionalities" → unique is wrong word here- It's corrected now! o The grammer and wording feels very sloppy, no grammer or spell check was done. The same applies to the help pages.- We have done an extensive grammar and spell check for both the paper and the help files.

Tests o Test suite is minimal, no test coverage is provided- Thanks for your comment! We have added few more test tests.

Manuscript o Review unnecessary uses of "the"- Thanks! We have corrected it. o Feature engineering is not a necessary step in ML, purely optional—It is one of the important components of an AutoML pipeline as per He, Zhao, & Chu (2020) o Data Preparation and Feature Engineering can be summarised as "Data preprocessing"- Agreed! o "DriveML performs the best across different parameters." → which parameters?—Thanks for pointing it out! We meant “across different features”. It is corrected in the latest version of the manuscript. o How does DriveML address reproducibility in any way?- We have fixed a set seed command before running all the models. That should take care of the reproducibility part. o Is there a convention how to format R packages and functions?- We used “lintr” check to format functions, question, etc. We have seen that quite a few joss paper/packages used lintr checks as a standard. o Reading the "key functionality" one could have the impression that only (b) is done via mlr and everything else is functionality from DriveML itself?- This issue is resolved as mar is no longer considered in the comparison table i.e. Table 1. o Why were two implementations of randomForest integrated?- We just wanted to give the user multiple options. By the way, a random forest model developed using ranger is much faster than the one developed using the RandomForest package. o Fig.1: caret and mlr do not do any automated ML, they provide building blocks for creating ML pipelines and do not enforce and opionionated defaults (as DriveML does).- We have removed caret and mlr from Table 1 (it was Fig. 1 in the earlier version of the manuscript) in the latest version of the manuscript. o Fig.1: The comparison made is not fair as many of the mentioned packages do not make any choice about meaningful settings for tuning and or preprocessing—That is true! However, since tuning and preprocessing are an integral part of autoML so we have included them as desired characteristics for comparing different packages. o "Thus, Table 1 shows that DriveML is better than almost all the other packages available in CRAN/Github" → This is a very strong statement which I would vote against.- Thanks for pointing it out! We have modified the last paragraph of the Section – “Comparison of DriveML with other relevant R Packages”. o The paper contains more print output than text and is also very sparse on references- Thanks for your comment! We wanted to showcase the functions of DriveML and how to use them and that’s why the Illustration section has detailed R codes. In case of references, although there have been quite a few works reported in the literature for autoML however, we wanted to keep the paper short (as it is one of the guidelines of JOSS) and so have kept the most relevant references. o The coding style is not consitent with respect to whitspaces around operators- We have performed a “lintr” check that removes all the unnecessary whitespaces and thus makes the code standardized. We have used this in one of our earlier packages that was published by JOSS. o No summary/discussion of the results- The results are discussed in the Illustration section where we explain Figures 2, 3, 4 and 5.

Grammar Suggestions for parts that could be rephrased: o "we introduce a new package i.e. DriveML" (Summary)—Replaced it with… “we introduce a new package named "DriveML" for automated machine learning.” o reduce the use of the word "pillar", maybe replace it completely - I don't think it is the correct word in this context. ----- Thanks for pointing this out! We have replaced it with “components”. o Github → GitHub—Corrected! o "reduce developer’s errors" —Corrected! o Watch spaces after hyphens —Corrected! o Watch spaces after brackets —Corrected! o "HTML slide/vignette" → define what it is. It can't be both.- Thanks for pointing it out! We meant “Vignette” and have made the correction. o "This report follows the industry and academia’s best practices." → What are the best practices and how are they defined?- Agreed… it sounds too vague. So, we have removed this line from the latest version of the manuscript. o "sample HTML report/vignette"- Corrected!

Overall, we strongly feel that the DriveML package adds a lot of value for its users by not just having access to different ML model implementations consolidated in one function but also due to the DriveML's efficient data pre-processing (data preparation and feature engineering) and hyper-parameter tuning functionalities that most of the other competing packages (in R) don't focus on. DriveML also allows the user to have access to the pre-processed dataset and they can do whatever analysis they want to run on it. This is something that many auto ML R packages don't provide. In terms of directions for future upgrades, we plan to add more options for hyper-parameter tuning including Grid search and bayesian optimisation. In the default setting we are currently using Random search for hyper-parameter tuning. In terms of ML methods, we plan to add more methods such as neural networks, stacked models/ensembling, etc. in one of the next versions of DriveML. We are always open to feedback from the users of the package and strive for continuous improvement!

sayanddude commented 4 years ago

Hi @dfm, sorry it took us some time to update the package and address some of the concerns of the reviewers in the paper/package. Please find above the latest pdf and my response to the comments of the second reviewer. We have now addressed all the comments of both the reviewers. Please "un-pause" this thread and let us know the next steps.

Thanks and Regards, Sayan

sayanddude commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

sayanddude commented 3 years ago

Hi @dfm and @arfon , we have updated the package and have addressed all the concerns of the reviewers related to the package. Please find above the latest pdf of the paper and my response to the comments of the second reviewer. We have now addressed all the comments of both the reviewers. Please "un-pause" this thread and let us know the next steps.

Thanks and Regards, Sayan

dfm commented 3 years ago

@sayanddude: Thanks for checking in. I'm working on finding a new reviewer since pat-s is no longer available. I will update you when I have found someone.

sayanddude commented 3 years ago

@sayanddude: Thanks for checking in. I'm working on finding a new reviewer since pat-s is no longer available. I will update you when I have found someone.

Thanks @dfm !

mirca commented 3 years ago

@sayanddude thanks for addressing some of the issues pointed out on the review process. I have opened issues at https://github.com/daya6489/DriveML with some of my suggestions. The quality of the writing is still a major concern, in my view.

sayanddude commented 3 years ago

@whedon generate pdf

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

sayanddude commented 3 years ago

@sayanddude thanks for addressing some of the issues pointed out on the review process. I have opened issues at https://github.com/daya6489/DriveML with some of my suggestions. The quality of the writing is still a major concern, in my view.

Hi @mirca , Thank you for your suggestions! Please find above the updated version of the paper. We have addressed your comments raised in the couple of GitHub issues. Please find below my response to each of the comments (my responses are in italics):

As part of the on-going JOSS review, could you please update the README and/or related documentations in order to address how third parties would go about: 1) contributing to the software 2) reporting issues or problems with the software, and 3) seeking support?. -- Thanks for your comment! We have now added two new files viz., CONTRIBUTING.md and GOVERNANCE.md to address the points you have raised.
I strongly suggest the removal of the sentence: "These two features would be incorporated in the next release, and we are currently working on it.". The rationale being that the support for these features might not be as straightforward as the authors claim. -- Thanks for your comment! The concerned sentence is removed from the latest version of the manuscript.
Instances of "codes" should be written "code" instead.-- We have corrected it!
There are many wrong usages of the symbol "-" throughout the text.- Thanks for pointing this out! We have replaced '-' with ':' wherever appropriate.
There are still several grammar and punctuation flaws in the text. See this: "it performs various operations such as missing at random features, Date variable transformation, bulk interactions for numerical features, one-hot encoding for categorical variables and finally, feature selection using zero variance, correlation and Area under the curve (AUC) method."--- Thanks for your comment! We have corrected the issue with the concerned line and have also done a thorough proof-reading to rectify any grammatical inconsistencies.
I think the citation style is not very adequate specially for entries with many authors. Some lines in the text almost just includes a citation reference. In this case, you should use something like (FIRST_AUTHOR_LAST_NAME et. al. YEAR).-- Thanks for pointing this out! We have corrected this issue in the latest version of the manuscript.

Overall, we have done another round of proof-reading (both manual and using tools such as, Grammarly) to remove any grammatical inconsistencies.

sayanddude commented 3 years ago

Hi @dfm , Hope you are doing great! We just wanted to check with you (out of curiosity) as we were wondering if it's possible to remove the "pending-major-enhancements" label from this thread given that we have already addressed almost all the comments of the reviewers. Thanks and Regards, Sayan

dfm commented 3 years ago

Thanks for checking in! I'm not going to remove that label until have found another reviewer and they have had a chance to go through the checklist. I'm sorry that it's taking so long, but it continues to be hard to find someone who is qualified and available. I'll definitely keep you posted! Thanks for your patience.

sayanddude commented 3 years ago

Thanks for checking in! I'm not going to remove that label until have found another reviewer and they have had a chance to go through the checklist. I'm sorry that it's taking so long, but it continues to be hard to find someone who is qualified and available. I'll definitely keep you posted! Thanks for your patience.

Ok sure! Thanks!

openjournals / joss-reviews

[REVIEW]: DriveML: An R Package for Driverless Machine Learning #2278

Status

Reviewer instructions & questions

Review checklist for @mirca

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Review checklist for @pat-s

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Review checklist for @ledell

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Review of DriveML