openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
707 stars 37 forks source link

[REVIEW]: Clustergram: Visualization and diagnostics for cluster analysis #5240

Closed editorialbot closed 1 year ago

editorialbot commented 1 year ago

Submitting author: !--author-handle-->@martinfleis<!--end-author-handle-- (Martin Fleischmann) Repository: https://github.com/martinfleis/clustergram Branch with paper.md (empty if default branch): Version: v0.8.0 Editor: !--editor-->@csoneson<!--end-editor-- Reviewers: @csadorf, @gagolews Archive: 10.5281/zenodo.8202396

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/f3bab5d7bbdf2ed70dbea435a616ad18"><img src="https://joss.theoj.org/papers/f3bab5d7bbdf2ed70dbea435a616ad18/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/f3bab5d7bbdf2ed70dbea435a616ad18/status.svg)](https://joss.theoj.org/papers/f3bab5d7bbdf2ed70dbea435a616ad18)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@csadorf & @gagolews, your review will be checklist based. Each of you will have a separate checklist that you should update when carrying out your review. First of all you need to run this command in a separate comment to create the checklist:

@editorialbot generate my checklist

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @csoneson know.

Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest

Checklists

📝 Checklist for @csadorf

📝 Checklist for @gagolews

editorialbot commented 1 year ago

Hello humans, I'm @editorialbot, a robot that can help you with some common editorial tasks.

For a list of things I can do to help you, just type:

@editorialbot commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@editorialbot generate pdf
editorialbot commented 1 year ago
Software report:

github.com/AlDanial/cloc v 1.88  T=0.22 s (149.4 files/s, 73588.6 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
SVG                              3              0             63          10499
Python                           4            351            441           1277
Markdown                         7            225              0            702
TeX                              1             13              0            200
Jupyter Notebook                 5              0           1930            179
YAML                             8             21              0            173
TOML                             1              7              0             49
CSS                              1             15              0             48
DOS Batch                        1              8              1             26
make                             1              4              7              9
reStructuredText                 1              4              4              2
-------------------------------------------------------------------------------
SUM:                            33            648           2446          13164
-------------------------------------------------------------------------------

gitinspector failed to run statistical information for the repository
editorialbot commented 1 year ago

Wordcount for paper.md is 1361

editorialbot commented 1 year ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10/ggj45f is OK
- 10.1016/j.habitatint.2022.102641 is OK
- 10.1038/s41597-022-01640-8 is OK
- 10.1109/MCSE.2007.55 is OK
- 10.1158/1538-7445.AM2022-5038 is OK
- 10.1016/j.dib.2022.108335 is OK
- 10/ghh97z is OK
- 10.1007/BF02915278 is OK
- 10.1016/j.compenvurbsys.2022.101802 is OK
- 10.1115/IPC2022-87145 is OK
- 10.1016/j.jag.2022.102911 is OK
- 10.5281/zenodo.3960218 is OK
- 10.1007/s12061-022-09490-y is OK

MISSING DOIs

- 10.3390/info11040193 may be a valid DOI for title: Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence

INVALID DOIs

- None
csoneson commented 1 year ago

👋🏼 @martinfleis, @csadorf, @gagolews - this is the review thread for the submission. All of our communications will happen here from now on.

Please check the post at the top of the issue for instructions on how to generate your own review checklist. As you go over the submission, please check any items that you feel have been satisfied. There are also links to the JOSS reviewer guidelines.

The JOSS review is different from most other journals. Our goal is to work with the authors to help them meet our criteria instead of merely passing judgment on the submission. As such, the reviewers are encouraged to submit issues directly in the software repository. If you do so, please mention this thread so that a link is created (and I can keep an eye on what is happening). Please also feel free to comment and ask questions in this thread. It is often easier to post comments/questions/suggestions as you come across them instead of waiting until you've reviewed the entire package.

Please feel free to ping me (@csoneson) if you have any questions or concerns. Thanks!

editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

gagolews commented 1 year ago

I'll complete the review in 2-3 weeks.

gagolews commented 1 year ago

Review checklist for @gagolews

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Please refer to my remarks below.

Comments

An interesting contribution, but with a "but".

Clustergrams are rather simple – merely a dozen or so lines of plotting code https://github.com/martinfleis/clustergram/blob/main/clustergram/clustergram.py – therefore the scope of the package is quite limited... I am unsure if it is not a "missed opportunity" for something more substantial.

I wonder if the Python community would benefit more from the current author's contribution to the seaborn package, for example?

Or adding more diagnostic tools for clustering so that its scope is not limited to a single visualisation type?

It seems that there is an implementation of the clustergrams in the EcotoneFinder package for R, see <https://cran.r-project.org/web/packages/EcotoneFinder/index.html >, this probably should be mentioned as well. – in the form of a single function within a much larger package.

According to this journal's scope and submission requirements, I read that "Minor 'utility' packages, including 'thin' API clients, and single-function packages are not acceptable." – I am not completely convinced that this is exactly the case here...

Other remarks:

martinfleis commented 1 year ago

Thank you for the comments @gagolews!

I am fully aware that the size of the submission is borderline and I was assuming it will undergo a scope query prior actual review. I am copying below the note I made in the submission regarding this.

It is not exceeding the scope limits by a lot but based on my experience, this should pass the possible query-scope process if you decide it needs to go through it. It is being used in papers as is even by people I've never heard before if fields far from mine so I believe that also confirms the scientific benefits (all cited in the paper).

So I am happy if @csoneson requests a scope check from the editorial team here.

As said above, I believe this is borderline on the side of large enough, but I may be wrong.

Clustergrams are rather simple – merely a dozen or so lines of plotting code https://github.com/martinfleis/clustergram/blob/main/clustergram/clustergram.py – therefore the scope of the package is quite limited

I believe that the whole package should be considered when assessing the scope, not only the plotting part. It is not that straightforward to create the data that needs to be plotted from various clustering engines. I think that this processing part is equally valuable as the plotting segments (either static or interactive).

I wonder if the Python community would benefit more from the current author's contribution to the seaborn package, for example?

You'll probably agree that seaborn is not the ideal candidate. I'd be happy to contribute the code elsewhere but I haven't found a good home for it. Hence it lives alone. If someone has good suggestions, I am all ears. It would certainly be better than maintaining this as a small package on its own.

I am unsure if it is not a "missed opportunity" for something more substantial. ... Or adding more diagnostic tools for clustering so that its scope is not limited to a single visualisation type?

As much as I'd love to, my time I can spend on this is fairly limited, so if the current size of the package does not fulfil the scope limits, it will just not be published.

It seems that there is an implementation of the clustergrams in the EcotoneFinder

Thanks, I wasn't aware of it. It is a bit tough to assess it though, I haven't found any documentation or repository apart from what is on CRAN.

is such a self-promotion necessary in a research paper?

One of the points on evaluation of substantive scholarly effort is Whether the software has already been cited in academic papers.,. This shows an evidence for that point.

gagolews commented 1 year ago

@martinfleis I'm not going to be too stubborn, the package itself is quite nice. Let's see what others say about it...

csadorf commented 1 year ago

I am sorry, but I was a bit swamped the past two weeks, I plan on doing the review mid next week.

csoneson commented 1 year ago

👋🏻 @csadorf - just wanted to check whether you had time to take a first look at this submission. Thanks!

csoneson commented 1 year ago

Ping @csadorf

👋🏻 @csadorf - just wanted to check whether you had time to take a first look at this submission. Thanks!

csadorf commented 1 year ago

@csoneson Very sorry about the delay, I'll have a look at it this week-end.

csoneson commented 1 year ago

@csadorf - no worries, thank you!

csadorf commented 1 year ago

Review checklist for @csadorf

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

csoneson commented 1 year ago

👋🏻 @csadorf - could you provide an update on the progress of your review? Thanks!

csadorf commented 1 year ago

The paper is very well written and illustrated, and the references are almost complete. The code is professionally developed, very well documented, tested, and distributed. I think the paper reaches the level of scholarly effort required for a publication in JOSS, however – as acknowledged by the author themself – just barely, because the majority of the code are thin wrappers around packages that readily provide clustering algorithms (notably sklearn, scipy, and RAPIDS cuML) whereas the core function of the package is to generate plots (the clustergrams), which is achieved in a handful of functions and very few lines of code. It is the substantial amount of testing and documentation that is really providing the net value of this package.

The author claims to embrace scikit-learn’s API style, however that is unfortunately not entirely achieved in my view. While the main class does provide a estimator-like interface, accepting hyper parameters as constructor arguments and offering a fit() function, other parts of the scikit-learn API guide are ignored, such as avoiding the modification of hyper parameters within the constructor, storing fitted parameters with an underscore suffix, and allowing for easy composability. Other plotting-classes within the scikit-learn package typically implement “from_estimator” and “from_predictions” class methods (e.g. ConfusionMatrixDisplay, PredictionErrorDisplay, etc.). In this way it is easy to use the class with any class that would provide clustering data. The current implementation of the class provides from_data() and from_centers() methods, which achieve similar, but not directly compatible behavior.

Since the author is explicitly referencing scikit-learn’s API style, I would suggest to cite Buitinck et al. (arXiv:1309.0238 [cs.LG]) as recommended by the scikit-learn's citation guide.

As a bit of a minor point, however since plotting is the core function of this package I think it warrants mentioning, I would recommend that the ticks on the x-axis on all clustergram diagrams are ensured to be natural numbers since there are no fractions of numbers of clusters.

There are no instructions on how to contribute to the package.

I would recommend that the author adopt some of the recommendations on making the class more modular and independent, consistently follow the scikit-learn API guidelines, and then to consider pursuing a contribution to the scikit-learn package.

I recommend the paper for acceptance with minor revisions, because despite the scholarly effort being marginal, I believe it provides benefit to the community and would simplify reference in future publications.

martinfleis commented 1 year ago

The author claims to embrace scikit-learn’s API style, however that is unfortunately not entirely achieved in my view.

Thanks for the comment! I'll have a deeper dive to see if the API can follow the style more closely. I suppose the issue will be in the difference between sklearn's estimators producing clustering labels for a fixed k while clustergram requires a range of results but let me see what can be done here.

is explicitly referencing scikit-learn’s API style, I would suggest to cite Buitinck et al.

Will do, thanks!

I would recommend that the ticks on the x-axis on all clustergram diagrams are ensured to be natural numbers

Good point. Will fix.

There are no instructions on how to contribute to the package.

Contributing guide is available in the documentation: https://clustergram.readthedocs.io/en/stable/contributing.html I can eventually copy its content to CONTRIBNUTING.md and store it at the root of the repo but I didn't think the duplication is necessary here.

Thanks you for your comment!

martinfleis commented 1 year ago

@editorialbot generate pdf

editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

martinfleis commented 1 year ago

I have finished a review of the package and the paper based on the comments by @csadorf and @gagolews above. Below are responses to points that have not been discussed in the individual comments in above.

From the @gagolews's review (plus see https://github.com/openjournals/joss-reviews/issues/5240#issuecomment-1490353259):

It seems that there is an implementation of the clustergrams in the EcotoneFinder package for R, see <https://cran.r-project.org/web/packages/EcotoneFinder/index.html >, this probably should be mentioned as well.

EcotoneFinder is now mentioned in the paper, I wasn't previously aware of the implementation.

Other remarks: [...]

All the comments related to the paper have been resolved in https://github.com/martinfleis/clustergram/commit/52676ab7966f0d3a183e48b1a93af79d6ea2e55b

From the @csadorf's review:

The author claims to embrace scikit-learn’s API style, however that is unfortunately not entirely achieved in my view. While the main class does provide a estimator-like interface, accepting hyper parameters as constructor arguments and offering a fit() function, other parts of the scikit-learn API guide are ignored, such as avoiding the modification of hyper parameters within the constructor, storing fitted parameters with an underscore suffix, and allowing for easy composability. Other plotting-classes within the scikit-learn package typically implement “from_estimator” and “from_predictions” class methods (e.g. ConfusionMatrixDisplay, PredictionErrorDisplay, etc.). In this way it is easy to use the class with any class that would provide clustering data. The current implementation of the class provides from_data() and from_centers() methods, which achieve similar, but not directly compatible behavior.

The API have been adjusted in https://github.com/martinfleis/clustergram/pull/59. See the PR for the details.

Since the author is explicitly referencing scikit-learn’s API style, I would suggest to cite Buitinck et al. (arXiv:1309.0238 [cs.LG]) as recommended by the scikit-learn's citation guide.

Included now.

As a bit of a minor point, however since plotting is the core function of this package I think it warrants mentioning, I would recommend that the ticks on the x-axis on all clustergram diagrams are ensured to be natural numbers since there are no fractions of numbers of clusters.

Both plotting methods (matplotlib, bokeh), are now limited to integer ticks only.

There are no instructions on how to contribute to the package.

As mentioned in https://github.com/openjournals/joss-reviews/issues/5240#issuecomment-1587065252, the contributing guide was part of the documentation. It is now also available as CONTRIBUTING.md in the root of the repo.


@csoneson I believe that all comments are now either resolved or discussed in my earlier replies.

gagolews commented 1 year ago

Review checklist for @gagolews

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

gagolews commented 1 year ago

I deem the revised version acceptable. Well done.

csoneson commented 1 year ago

Thank you @gagolews!

csadorf commented 1 year ago

I agree, the revisions look good to me. 👍🏻

martinfleis commented 1 year ago

Thank you @csadorf and @gagolews!

@csoneson I have released the updated version as v0.8.0. Its zenodo DOI is 10.5281/zenodo.8202396.

csoneson commented 1 year ago

Thank you @csadorf - could you please also tick the remaining boxes in the checklist above, for completeness?

@martinfleis - I will also have a quick look through the submission, and get back to you with the next steps.

csoneson commented 1 year ago

@editorialbot generate pdf

csoneson commented 1 year ago

@editorialbot check references

editorialbot commented 1 year ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10/ggj45f is OK
- 10.1016/j.habitatint.2022.102641 is OK
- 10.1038/s41597-022-01640-8 is OK
- 10.1109/MCSE.2007.55 is OK
- 10.1158/1538-7445.AM2022-5038 is OK
- 10.1016/j.dib.2022.108335 is OK
- 10/ghh97z is OK
- 10.1007/BF02915278 is OK
- 10.1016/j.compenvurbsys.2022.101802 is OK
- 10.1115/IPC2022-87145 is OK
- 10.1016/j.jag.2022.102911 is OK
- 10.5281/zenodo.3960218 is OK
- 10.1007/s12061-022-09490-y is OK

MISSING DOIs

- 10.3390/info11040193 may be a valid DOI for title: Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence

INVALID DOIs

- https://doi.org/10.1016/j.ins.2021.10.004 is INVALID because of 'https://doi.org/' prefix
editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

martinfleis commented 1 year ago

@editorialbot check references

editorialbot commented 1 year ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10/ggj45f is OK
- 10.1016/j.habitatint.2022.102641 is OK
- 10.1038/s41597-022-01640-8 is OK
- 10.1109/MCSE.2007.55 is OK
- 10.1158/1538-7445.AM2022-5038 is OK
- 10.1016/j.dib.2022.108335 is OK
- 10/ghh97z is OK
- 10.1007/BF02915278 is OK
- 10.1016/j.compenvurbsys.2022.101802 is OK
- 10.1115/IPC2022-87145 is OK
- 10.1016/j.jag.2022.102911 is OK
- 10.3390/info11040193 is OK
- 10.5281/zenodo.3960218 is OK
- 10.1007/s12061-022-09490-y is OK
- 10.1016/j.ins.2021.10.004 is OK

MISSING DOIs

- None

INVALID DOIs

- None
martinfleis commented 1 year ago

@editorialbot generate pdf

editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

csoneson commented 1 year ago

👋🏻 @csadorf - as I noticed you reacted with 👍🏻 to the message below; could you also tick the remaining boxes in https://github.com/openjournals/joss-reviews/issues/5240#issuecomment-1559574286, to make sure everything is documented? Of course, let us know if you need more time! Thanks 🙂

Thank you @csadorf - could you please also tick the remaining boxes in the checklist above, for completeness?

csoneson commented 1 year ago

Ping @csadorf

👋🏻 @csadorf - as I noticed you reacted with 👍🏻 to the message below; could you also tick the remaining boxes in https://github.com/openjournals/joss-reviews/issues/5240#issuecomment-1559574286, to make sure everything is documented? Of course, let us know if you need more time! Thanks 🙂

csadorf commented 1 year ago

@csoneson I've checked all the boxes now. All good from my side.

csoneson commented 1 year ago

Thanks @csadorf!

csoneson commented 1 year ago

@editorialbot generate pdf

editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

csoneson commented 1 year ago

@editorialbot check references

editorialbot commented 1 year ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10/ggj45f is OK
- 10.1016/j.habitatint.2022.102641 is OK
- 10.1038/s41597-022-01640-8 is OK
- 10.1109/MCSE.2007.55 is OK
- 10.1158/1538-7445.AM2022-5038 is OK
- 10.1016/j.dib.2022.108335 is OK
- 10/ghh97z is OK
- 10.1007/BF02915278 is OK
- 10.1016/j.compenvurbsys.2022.101802 is OK
- 10.1115/IPC2022-87145 is OK
- 10.1016/j.jag.2022.102911 is OK
- 10.3390/info11040193 is OK
- 10.5281/zenodo.3960218 is OK
- 10.1007/s12061-022-09490-y is OK
- 10.1016/j.ins.2021.10.004 is OK

MISSING DOIs

- None

INVALID DOIs

- None
csoneson commented 1 year ago

Post-Review Checklist for Editor and Authors

Additional Author Tasks After Review is Complete

Editor Tasks Prior to Acceptance

csoneson commented 1 year ago

@editorialbot set v0.8.0 as version

editorialbot commented 1 year ago

Done! version is now v0.8.0

csoneson commented 1 year ago

@martinfleis I have gone through the paper and it looks good - one reference (Pedregosa et al) is missing a DOI but I don't think it actually has one. Before I recommend acceptance, could you please update the title and author list of the Zenodo archive to match those of the paper, and also specify the license in the Zenodo archive?

martinfleis commented 1 year ago

@csadorf Thanks! I have updated Zenodo as requested. DOI of the latest fixed versions is 10.5281/zenodo.8202396.

one reference (Pedregosa et al) is missing a DOI but I don't think it actually has one

That was my understanding as well.

All tasks from the author checklist above (checks etc) have also been done.

csoneson commented 1 year ago

@editorialbot set 10.5281/zenodo.8202396 as archive