openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
721 stars 38 forks source link

[REVIEW]: LFSpy: A Python Implementation of Local Feature Selection for Data Classification with scikit-learn Compatibility #1958

Closed whedon closed 4 years ago

whedon commented 4 years ago

Submitting author: @kiretd (Kiret Dhindsa) Repository: https://github.com/McMasterRS/LFSpy Version: v1.0.4 Editor: @dfm Reviewers: @effigies, @sauln Archive: 10.5281/zenodo.3813708

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/e8b1f0500a1397299d7f2ab30bcd0afb"><img src="https://joss.theoj.org/papers/e8b1f0500a1397299d7f2ab30bcd0afb/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/e8b1f0500a1397299d7f2ab30bcd0afb/status.svg)](https://joss.theoj.org/papers/e8b1f0500a1397299d7f2ab30bcd0afb)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@effigies & @mnarayan, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

  1. Make sure you're logged in to your GitHub account
  2. Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @dfm know.

Please try and complete your review in the next two weeks

Review checklist for @effigies

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Review checklist for @sauln

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

whedon commented 4 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @effigies, @mnarayan it looks like you're currently assigned to review this paper :tada:.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

  1. Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

  1. You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf
whedon commented 4 years ago
Attempting to check references...
whedon commented 4 years ago
Attempting PDF compilation. Reticulating splines etc...
whedon commented 4 years ago

OK DOIs

- 10.1109/TPAMI.2015.2478471 is OK
- 10.1109/TNNLS.2017.2676101 is OK
- 10.1109/JBHI.2018.2877738 is OK

MISSING DOIs

- None

INVALID DOIs

- None
whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

effigies commented 4 years ago

I've started on this, but may not finish until after the holidays. Apologies if you need a quicker review.

dfm commented 4 years ago

@effigies: No worries and thanks for letting us know! I'll remind you in the new year if we don't hear from you.

kiretd commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago
Attempting PDF compilation. Reticulating splines etc...
whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

dfm commented 4 years ago

@effigies, @mnarayan: Happy new year! I'm checking in to remind you about this review and to see if you have any questions going through the review checklist. Please let me know if there are any issues. Thanks!

kiretd commented 4 years ago

Hi @dfm, Just wondering if there's anything we should be doing on our end?

effigies commented 4 years ago

Ah, sorry. I'll try to have a look on Friday.

dfm commented 4 years ago

Thanks @effigies!

@mnarayan: Any updates from your end? Thanks!

effigies commented 4 years ago

Review

LFSpy is a Python reimplementation of LFS, originally written by Narges Armanfard (@armanfn).

Much of the requested review is straightforward box-checking, so I'm restricting these comments to those where I have questions.

General checks

  • [ ] Contribution and authorship: Has the submitting author (@kiretd) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

@kiretd wrote the paper and also contributed tests. According to the commit log, the contributors to the repository were:

    28  Oliver Cook <cooko@mcmaster.ca>
    23  Areeb Khawaja <khawaa1@mcmaster.ca>
    16  Kiret Dhindsa <kiretd@gmail.com>
    10  Tom Mudway <mudwayt@mcmaster.ca>

Areeb Khawaja appears to be missing from the author list, and Ron Harwood and Ranil Sonnadara have not contributed. Ranil appears to be a PI and Ron a staff member. I'll simply link to the Authorship guidelines and would appreciate an affirmation that Ron and Ranil provided substantial contribution to the effort, and that @khawaa1 has elected not to be an author on this submission.

Documentation

Documentation currently has the form of a README.md. Although there are extensive comments in the code, there is not rendered documentation anywhere.

  • [ ] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?

This is currently missing

  • [x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.

This is there, although the installation requirements currently include pytest. The convention typically is to set installation requirements only to those needed to use the software, and to include tests and optional functionality as extras_require arguments to the setuptools.setup function. I will submit a pull request to the main repository to use this. (Authors are not obligated to accept this. Some projects prefer to treat tests as first-class requirements.)

  • [ ] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).

The example usage is not directly runnable. From the perspective of users of scikit-learn, it is very likely enough, but fully runnable examples would be preferable, IMO. The test_lfs.py module fetches two datasets. One of these could be used to flesh out the example usage.

  • [ ] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?

No. Examples of standard scikit-learn API documentation strings can be found here.

For these to be readable (as in sklearn.linear_model.LinearRegression), the documentation would need to be rendered.

  • [x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?

Yes, however, tests fail. Opened https://github.com/McMasterRS/LFSpy/issues/3.

  • [ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

This is missing.

Software paper

  • [ ] State of the field: Do the authors describe how this software compares to other commonly-used packages?

This currently feels implicit, and explicit comparisons to other methods would be helpful. I think comparisons to other classifiers within scikit-learn would serve equally as well as independent packages, if that's easier.

I suspect the authors are more familiar with the current state of the art in machine learning than I am, but if it's useful: My impression is that random forests (or other decision tree ensembles) will do a similar task, splitting the data into subspaces and selecting the most effective dimensions to distinguish classes.

dfm commented 4 years ago

@mnarayan: Can you please let us know when you'll have a chance to look at this submission? Thanks!

kiretd commented 4 years ago

Hi @effigies, thanks for your review!

General checks

  • [ ] Contribution and authorship: Has the submitting author (@kiretd) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

@kiretd wrote the paper and also contributed tests. According to the commit log, the contributors to the repository were:

Sorry that the contributions are a bit difficult to track. Most of the work on LFSpy was done prior to putting this up on github, so the github commits don’t reflect everyone’s contributions. In addition, a good chunk of the testing we do is done locally as a group before someone pushes it to git. If you have further questions about who contributed what, please feel free to ask.

Areeb Khawaja appears to be missing from the author list, and Ron Harwood and Ranil Sonnadara have not contributed. Ranil appears to be a PI and Ron a staff member. I'll simply link to the Authorship guidelines and would appreciate an affirmation that Ron and Ranil provided substantial contribution to the effort, and that @khawaa1 has elected not to be an author on this submission.

Both Ron Harwood and Ranil Sonnadara were responsible for direction of the project, including leading the software planning, requirements gathering, and some of the user testing necessary for development of the final project. I’ve also added Areeb to the authorship now that he has an ORCID.

Documentation

Documentation currently has the form of a README.md. Although there are extensive comments in the code, there is not rendered documentation anywhere.

We now have a separate documentation page at https://lfspy.readthedocs.io/en/latest/ It’s not 100% complete yet, though it’s not too far off, and we’ll be keeping this up to date over the lifetime of this project. Please take a look and let us know if there are any pieces missing.

  • [ ] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?

This is currently missing

We’ve added a Statement of Need to the docs page and the README on github.

  • [x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution. This is there, although the installation requirements currently include pytest. The convention typically is to set installation requirements only to those needed to use the software, and to include tests and optional functionality as extras_require arguments to the setuptools.setup function. I will submit a pull request to the main repository to use this. (Authors are not obligated to accept this. Some projects prefer to treat tests as first-class requirements.)

This has been updated per your (generous) recommendations on the issues thread on Github.

  • [ ] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).

The example usage is not directly runnable. From the perspective of users of scikit-learn, it is very likely enough, but fully runnable examples would be preferable, IMO. The test_lfs.py module fetches two datasets. One of these could be used to flesh out the example usage.

Great suggestion! I’ve added an example to the README and docs, and an example.py to the repository.

  • [ ] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?

No. Examples of standard scikit-learn API documentation strings can be found here.

For these to be readable (as in sklearn.linear_model.LinearRegression), the documentation would need to be rendered.

Another great suggestion! We’ve added this here: https://lfspy.readthedocs.io/en/latest/Functionality.html

  • [x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?

Yes, however, tests fail. Opened McMasterRS/LFSpy#3.

Thanks for finding this. Apparently an update broke the test, but it should be fixed now (see response on the repo).

  • [ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

This is missing.

We now have a contribution guidelines page on our docs: https://lfspy.readthedocs.io/en/latest/Community%20Guidelines.html

Software paper

  • [ ] State of the field: Do the authors describe how this software compares to other commonly-used packages?

This currently feels implicit, and explicit comparisons to other methods would be helpful. I think comparisons to other classifiers within scikit-learn would serve equally as well as independent packages, if that's easier.

I suspect the authors are more familiar with the current state of the art in machine learning than I am, but if it's useful: My impression is that random forests (or other decision tree ensembles) will do a similar task, splitting the data into subspaces and selecting the most effective dimensions to distinguish classes.

I’ve added a section on comparison with other standard classifiers, including plots using both sample datasets used in test_lfs.py. You were correct that random forests have some similarities and are a good point of comparison. I’ve also included another standard approach; a linear SVM with univariate feature selection using the F-statistic. These comparisons are in comparisons.py, which has been added to the tests folder.

kiretd commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

PDF failed to compile for issue #1958 with the following error:

ORCID looks to be the wrong length /app/vendor/bundle/ruby/2.4.0/bundler/gems/whedon-9847f98e9ec6/lib/whedon.rb:153:in block in check_orcids': Problem with ORCID (00000-0003-4528-9146) for Areeb Khawaja (RuntimeError) from /app/vendor/bundle/ruby/2.4.0/bundler/gems/whedon-9847f98e9ec6/lib/whedon.rb:151:ineach' from /app/vendor/bundle/ruby/2.4.0/bundler/gems/whedon-9847f98e9ec6/lib/whedon.rb:151:in check_orcids' from /app/vendor/bundle/ruby/2.4.0/bundler/gems/whedon-9847f98e9ec6/lib/whedon.rb:88:ininitialize' from /app/vendor/bundle/ruby/2.4.0/bundler/gems/whedon-9847f98e9ec6/lib/whedon/processor.rb:36:in new' from /app/vendor/bundle/ruby/2.4.0/bundler/gems/whedon-9847f98e9ec6/lib/whedon/processor.rb:36:inset_paper' from /app/vendor/bundle/ruby/2.4.0/bundler/gems/whedon-9847f98e9ec6/bin/whedon:55:in prepare' from /app/vendor/bundle/ruby/2.4.0/gems/thor-0.20.3/lib/thor/command.rb:27:inrun' from /app/vendor/bundle/ruby/2.4.0/gems/thor-0.20.3/lib/thor/invocation.rb:126:in invoke_command' from /app/vendor/bundle/ruby/2.4.0/gems/thor-0.20.3/lib/thor.rb:387:indispatch' from /app/vendor/bundle/ruby/2.4.0/gems/thor-0.20.3/lib/thor/base.rb:466:in start' from /app/vendor/bundle/ruby/2.4.0/bundler/gems/whedon-9847f98e9ec6/bin/whedon:116:in<top (required)>' from /app/vendor/bundle/ruby/2.4.0/bin/whedon:23:in load' from /app/vendor/bundle/ruby/2.4.0/bin/whedon:23:in

'

kiretd commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

PDF failed to compile for issue #1958 with the following error:

[WARNING] Could not convert image '/tmp/tex2pdf.-f4119cc7dbe480a9/20ced9490086bcac201d809020b064159913517e.shtml': Cannot load file Jpeg Invalid marker used PNG Invalid PNG file, signature broken Bitmap Invalid Bitmap magic identifier GIF Invalid Gif signature :

HDR Invalid radiance file signature Tiff Invalid endian tag value TGA Invalid bit depth (104) [WARNING] Could not convert image '/tmp/tex2pdf.-f4119cc7dbe480a9/1ff300c76fa24ed9469b4aea44b91258d01e7f3d.shtml': Cannot load file Jpeg Invalid marker used PNG Invalid PNG file, signature broken Bitmap Invalid Bitmap magic identifier GIF Invalid Gif signature :

HDR Invalid radiance file signature Tiff Invalid endian tag value TGA Invalid bit depth (104) Error producing PDF. ! LaTeX Error: Cannot determine size of graphic in /tmp/tex2pdf.-f4119cc7dbe480 a9/20ced9490086bcac201d809020b064159913517e.shtml (no BoundingBox).

See the LaTeX manual or LaTeX Companion for explanation. Type H for immediate help. ...

l.522 ...490086bcac201d809020b064159913517e.shtml}

Looks like we failed to compile the PDF

kiretd commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

kiretd commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

kiretd commented 4 years ago

@dfm Hey, sorry about this, but I think I need help including figures in the pdf. Using standard markdown syntax (same as the example given on the submission instructions page), whedon spits out the error a few posts above. I switched to html because it was also the only method that allowed me to resize the images, and now the pdf compiles without images. Hopefully I'm just missing something really obvious?

dfm commented 4 years ago

@kiretd: I just opened a pull request fixing the syntax.

kiretd commented 4 years ago

@dfm Thanks so much. It looks like the images just needed to be in the same folder for whedon to find them? Also, is there no way to resize the images with markdown syntax? I assume the latex conversion won't work if the images are included using html syntax.

kiretd commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

kiretd commented 4 years ago

Ok, I think everything should be ready for review. I believe we addressed everything from the last review. Thanks again @dfm and @effigies !

dfm commented 4 years ago

Since @mnarayan seems to have lost track of this review, I'm currently looking for another referee and I'll keep you all posted.

dfm commented 4 years ago

@effigies: Thanks for your review so far! When you get a chance, please take a look at the changes that @kiretd has made and continue through the checklist. Thanks again!

dfm commented 4 years ago

@sauln has agreed to take over for mnarayan. Thanks @sauln!

I'll add you as a reviewer and update the checklist. Please let me know if you have any questions or issues as you review.

dfm commented 4 years ago

@whedon remove @mnarayan as reviewer

whedon commented 4 years ago

OK, @mnarayan is no longer a reviewer

dfm commented 4 years ago

@whedon add @sauln as reviewer

whedon commented 4 years ago

OK, @sauln is now a reviewer

sauln commented 4 years ago

👋 Looks like @effigies did a great review. I'm hoping to have my review complete by early next week.

kiretd commented 4 years ago

@whedon generate pdf

kiretd commented 4 years ago

Just small update to add the figure captions

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

sauln commented 4 years ago

I'll have to provide a more thorough review near the beginning of next week, but I can supply a few quick notes from pursuing the documentation:

dfm commented 4 years ago

Hi team! Just checking in here on the status.

@kiretd: I wanted to double check that you got @sauln's recommendations. Let us know if you have any questions!

@effigies: It seems like there might be some remaining issues for you. Can you let us know if there's more that you would like to see beyond this response: https://github.com/openjournals/joss-reviews/issues/1958#issuecomment-583011174

Thanks!

effigies commented 4 years ago

Hi, thanks for the reminder. I'll try to have a look on Friday.

effigies commented 4 years ago

@dfm My concerns have been addressed.

I submitted a minor PR (https://github.com/McMasterRS/LFSpy/pull/6) for some cleanups in the documentation, but that should not be considered blocking.

kiretd commented 4 years ago

Hi @dfm Thanks for checking in. Yes, I did get the recommendations by @sauln (thanks for those @sauln). I just moved continents last week, so I haven’t gotten much done on this myself yet. The team has been working on some of the recommendations in the meantime, and hopefully we’ll have an update soon.

@sauln: Thanks for your suggestions! I’ll respond one by one.

  • The docs link in the readme doesn't work. You'll need to use [lfspy.readthedocs.io](https://lfspy.readthedocs.io)

Thanks, looks fixed on our end now.

  • It looks like you manually wrote the Functionality page and I see a lot of good comments in the code. If you reformatted the docstrings to comply with an autodocs format, they would render with the __doc__ magic method, and you could use autodocs to render the functionality page.

This sounds really great, and I think we’ll likely start implementing it for our projects going forward. I’m not sure there’s much value in spending the time to redo our comments and functionality page for LFSpy right now, but we’re taking a look at whether it makes sense to switch it over gradually over the next few months or so.

  • It looks like each section in the homepage of the docs is repeated as an individual page. Is this necessary?

Maybe not necessary, but the way we have our hyperlinked table of contents set up is that each section has a link on the sidebar for easier navigation.

  • The contribution guide says that contributions should follow PEP8, but the codebase looks like it doesn't follow PEP8 itself. Have you ran a linter or autoformatter over code? I highly recommend black but any will do.

Oliver from our team ran it through Black as you recommended. Pretty cool tool! I’m not an expert on this part, but I believe that it’s not exactly formatted correctly even after using Black. For now, we’ve just removed the PEP8 guideline so there’s no inconsistency.

  • Could you provide more examples with larger data sets?

Do you have any particular datasets in mind? Typically sample datasets used for illustration are small so that the tests and examples can be run relatively quickly. In addition, LFS is particularly well suited for p>>n and small n problems, as per our statement of need, so we think our examples (one custom, the other the standard sk-learn sample dataset used in many of their examples) are rather suitable.

  • The table of tunable parameters in the docs homepage hasn't rendered correctly.

This looks to be fixed on our end. Can you confirm it looks right on your end too?

  • The tests appear to be end-to-end tests. I would like to see more unit tests that focus on individual pieces of logic.

LFSpy is meant to be run as a single function, usually in a pipeline, but at least as a classifier or a feature selector (using the same function). Do you mean we should publish tests of internal functions as would be used for internal validation?

kiretd commented 4 years ago

I submitted a minor PR (McMasterRS/LFSpy#6) for some cleanups in the documentation, but that should not be considered blocking.

Thanks @effigies looks good to me, just waiting for one more person to review in case I missed something.

@dfm Just want to check in to see where we are with the process. I know I took some time to respond while we made the updates, so hopefully nothing is waiting on me right now.

dfm commented 4 years ago

@sauln: Do you feel that @kiretd has addressed some of your concerns? Do you have more feedback?

@kiretd: In response to your question to @sauln:

Do you mean we should publish tests of internal functions as would be used for internal validation?

I believe that that was what he was saying. In general, unit tests should test each sub-component of the code, not just the end-to-end result. Unit testing, specifically, is not a requirement for JOSS, but it is recommended best practice.

kiretd commented 4 years ago

@dfm and @sauln Thanks, yep, unit testing makes sense. I'm going to look into whether we have the resources to add that in now, or if we'll need to add it in gradually. I'm looking into exactly what would be the best way to do it now.