openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
725 stars 38 forks source link

[REVIEW]: HiPart: Hierarchical divisive clustering toolbox #5024

Closed editorialbot closed 1 year ago

editorialbot commented 1 year ago

Submitting author: !--author-handle-->@panagiotisanagnostou<!--end-author-handle-- (Panagiotis Anagnostou) Repository: https://github.com/panagiotisanagnostou/HiPart Branch with paper.md (empty if default branch): paper Version: v0.3.1 Editor: !--editor-->@jbytecode<!--end-editor-- Reviewers: @AP6YC, @jjerphan Archive: 10.5281/zenodo.7814113

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/1c40d7fca511858540163a9a034a08e3"><img src="https://joss.theoj.org/papers/1c40d7fca511858540163a9a034a08e3/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/1c40d7fca511858540163a9a034a08e3/status.svg)](https://joss.theoj.org/papers/1c40d7fca511858540163a9a034a08e3)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@AP6YC & @jjerphan, your review will be checklist based. Each of you will have a separate checklist that you should update when carrying out your review. First of all you need to run this command in a separate comment to create the checklist:

@editorialbot generate my checklist

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @jbytecode know.

Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest

Checklists

📝 Checklist for @jjerphan

📝 Checklist for @AP6YC

editorialbot commented 1 year ago

Hello humans, I'm @editorialbot, a robot that can help you with some common editorial tasks.

For a list of things I can do to help you, just type:

@editorialbot commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@editorialbot generate pdf
editorialbot commented 1 year ago

Checking the BibTeX entries failed with the following error:

Failed to parse BibTeX entry: cite-key missing
editorialbot commented 1 year ago
Software report:

github.com/AlDanial/cloc v 1.88  T=0.13 s (231.1 files/s, 59013.8 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Python                          14           1060           1799           3390
CSS                              1             83             67            370
Markdown                         3            105              0            338
TeX                              1             37              0            332
YAML                             4             21             21            118
reStructuredText                 3             26             36             34
DOS Batch                        1              8              1             26
JSON                             1              1              0             16
make                             1              4              7              9
TOML                             1              0              0              6
INI                              1              0              0              2
-------------------------------------------------------------------------------
SUM:                            31           1345           1931           4641
-------------------------------------------------------------------------------

gitinspector failed to run statistical information for the repository
editorialbot commented 1 year ago

Wordcount for paper.md is 1727

editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

jjerphan commented 1 year ago

Review checklist for @jjerphan

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

jbytecode commented 1 year ago

Dear @AP6YC and @jjerphan

This is the review thread. Firstly, type

@editorialbot generate my checklist

to generate your own checklist. In that checklist, there are 23 check items. Whenever you complete the corresponding task, you can check off them.

Please write your comments as separate posts and do not modify your checklist descriptions.

The review process is interactive so you can always interact with the authors, reviewers, and the editor. You can also create issues and pull requests in the target repo. Please do mention this thread's URL in the issues so we can keep tracking what is going on out of our world.

Please do not hesitate to ask me about anything, anytime.

Thank you in advance!

jbytecode commented 1 year ago

@jjerphan - Thank you for creating your checklist.

@AP6YC - Could you please create your checklist as I described above?

AP6YC commented 1 year ago

Review checklist for @AP6YC

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

jbytecode commented 1 year ago

@editorialbot check references

editorialbot commented 1 year ago

Checking the BibTeX entries failed with the following error:

Failed to parse BibTeX entry: cite-key missing
jbytecode commented 1 year ago

@editorialbot generate pdf

editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

jjerphan commented 1 year ago

I started reviewing the HiPart in the last few days.

I am drafting https://github.com/panagiotisanagnostou/HiPart/pull/19 to provide explicit suggestions for the repository.

jbytecode commented 1 year ago

Dear reviewers @AP6YC and @jjerphan

Could you please update us on how is your review going?

Thank you in advance

jjerphan commented 1 year ago

Hi @jbytecode, I did a first review and need to find some time to get in the details for the last items which are not checked yet (e.g. reproducing results) and provide comprehensive feedback to authors. To me, I am on track.

jbytecode commented 1 year ago

@jjerphan - I am appreciated of your great effort and spending your valuable time for JOSS. Thank you for your quick response.

jjerphan commented 1 year ago

You are welcome. Ideally I will finish my review this weekend.

AP6YC commented 1 year ago

@jjerphan Sorry for the delay; I am prioritizing my review and suggestions to the authors to be complete this week.

jjerphan commented 1 year ago

A remark of Conflict of interest

I am one of the maintainers of scikit-learn which is providing reference implementation of clustering algorithms. I do not have any conflict of interest in the favor of the acceptance of the rejection of this paper: scikit-learn contributors and maintainers welcome qualitative third party packages extending scikit-learn in the ecosystem.

Last review items' clarification

As of https://github.com/panagiotisanagnostou/HiPart/tree/f6c0d3785741f752a5a2cecafcef1d1eb03347bb, I have not checked the following 7 items of the 21 items mainly because I have questions, remarks or suggestions to improve of them:

Bellow are my questions, remarks or suggestions for each of them.


  • Contribution and authorship: Has the submitting author (@panagiotisanagnostou) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

It is clear that @panagiotisanagnostou is the main major contributor to HiPart. @stevestavropoulos also has contributed notable to HiPart.

To validate this item I would like to know the rationale for other authors' presence given that authors are indicated as having contributed equally.

  • Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines

I think HiPart is rather small as a package with perfectible aspects. To me, HiPart currently do not meet all the criteria for being eligible to be published in JOSS. Yet, I think it is on track and that this submission is appreciable and legitimate.

Perfectible aspects of HiPart are:

Functionality: Have the functional claims of the software been confirmed? Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

  • Reproducibility: If the paper contains original results, results are entirely reproducible by reviewers. If the paper contains no original results, please check this item.

I unfortunately can't reproduce results (benchmarks and figures) using script in paper/paper_scripts. Some import paths are invalid and data is missing.

To validate this item, please do check that those scripts work (I think using relative paths might help prevent issues).

Moreover, from my reading of the implementations (which are mainly backed by NumPy), I hardly think implementations can be scalable or efficient for the targeted problems.

I can't say if HiPart is completely functional, is performant, and if claims are backed by reproducible results.

  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Guidelines for 2) and 3) are clear to me but guidelines for 1) are missing. Adding a simple contributing guide as CONTRIBUTING.md would suffice to me.

  • Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?

See https://github.com/panagiotisanagnostou/HiPart/pull/23 for remarks for this items.

jjerphan commented 1 year ago

@jjerphan Sorry for the delay; I am prioritizing my review and suggestions to the authors to be complete this week.

@AP6YC: I am not sure if you instead intended to mention @jbytecode. To me, it's fine: you still have at least two weeks left. :slightly_smiling_face:

jbytecode commented 1 year ago

@panagiotisanagnostou - could you please update your status and inform us on how is your edits going?

panagiotisanagnostou commented 1 year ago

@jbytecode - Tomorrow I will update you on my edits, they are almost ready. We tried to finilize each one of them before responding, apologies for the delay!

panagiotisanagnostou commented 1 year ago

In order to address the comments made by @jjerphan, I will respond by quoting his questions and providing my answers in the section below.

  • Contribution and authorship: Has the submitting author (@panagiotisanagnostou) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

It is clear that @panagiotisanagnostou is the main major contributor to HiPart. @stevestavropoulos also has contributed notable to HiPart.

To validate this item I would like to know the rationale for other authors' presence given that authors are indicated as having contributed equally.

The list of authors came after careful consideration of their contribution to the quality of the work done. Apparently the submitting author (@panagiotisanagnostou) is maintaining the GitHub repository but there are a lot of work done prior to uploading the source code. Prof. Sotiris Tasoulis was the original developer, prototyping the divisive clustering concept, so his insights into developing the basic design of the package were invaluable. Dr. Dimitris Tasoulis provided invaluable insight in the debugging process as a developer with industry experience, while also contributing in the preparation of the submitted manuscript. Finally, Prof. Vassilis Plagianakos contributed mainly to the visualization tools of the HiPart package with solutions that allow great expandability. We also highlight that all authors have published articles regarding divisive clustering, and thus contributed through their deep understanding and expertise of the domain.

  • Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines

I think HiPart is rather small as a package with perfectible aspects. To me, HiPart currently do not meet all the criteria for being eligible to be published in JOSS. Yet, I think it is on track and that this submission is appreciable and legitimate.

Perfectible aspects of HiPart are:

  • its documentation: for now, there is a small number of undocumented examples. Notebook styles examples for all methods might improve HiPart approachability and adoption.

  • names containing typos and not PEP8-compliant: as an example, I think bicecting_kmeans must be renamed BisectingKmeans

  • lack comparisons with other libraries in the Scientific Python Ecosystem, in particular proof of functionality for the targeted problems (scalable methods for high dimensional data clustering) with respect to alternatives.

  • code quality: there's currently a lot of duplication on some files (see the ones graded as F). Due to the size of the library, some more effort can bring it to a A grade on Codacy.

Thank you for legitimizing our submission to the JOSS. Our goal for the following weeks is to add more methods and expand the documentation significantly. In the meantime, according to your suggestion, we have included much more well-presented examples in the documentation.

As far as your concerns on the PEP8 compliance and the quality of code, as you can see in the recent commits, we have addressed most issues.

The algorithm members of the HiPart package are binary space partition tree-based algorithms. Specifically, each split the algorithms conduct can be represented by a node in an unbalanced binary tree. The final clustering of the data in all the five algorithms are the samples contained as data in the leaves of the binary tree.

Moreover, as mentioned in the CONTRIBUTING.md the algorithms have the same three member methods in their implementation. Those method are, one to split the node into two children, one for selecting the nest node to split and one to calculate the data each node will contain. This fact has the unfortunate effect of creating duplication in the code, especially if there is a need to keep the same basic structure in a source code. Nevertheless, we tried to reduce the duplication errors as much as we can.

After extensive research in the Scientific Python Ecosystem, to the best of our research efforts, divisive clustering algorithmic implementations are missing. This fact limits our options for comparisons; Still, we now decided to include further comparisons with more generic clustering toolboxes.

Functionality: Have the functional claims of the software been confirmed? Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

  • Reproducibility: If the paper contains original results, results are entirely reproducible by reviewers. If the paper contains no original results, please check this item.

I unfortunately can't reproduce results (benchmarks and figures) using script in paper/paper_scripts. Some import paths are invalid and data is missing.

To validate this item, please do check that those scripts work (I think using relative paths might help prevent issues).

Moreover, from my reading of the implementations (which are mainly backed by NumPy), I hardly think implementations can be scalable or efficient for the targeted problems.

I can't say if HiPart is completely functional, is performant, and if claims are backed by reproducible results.

Thank you for highlighting this problem. We have now included all files needed for reproduction with the paper_results_generation.py and the generate_paper_fig.py scripts. As far as the paper_scripts we uploaded the data in the repository in a compressed form.

We have now clarified the scalability of the included implementations for the target problems. We now clearly mention that the package focuses on Large Scale data that fit in memory and that the key aspect which makes it possible is the pass-by-reference nature of the Python programming language. In addition, we have also indicated that HiPart provide algorithmic options that differ significantly in complexity.

  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Guidelines for 2) and 3) are clear to me but guidelines for 1) are missing. Adding a simple contributing guide as CONTRIBUTING.md would suffice to me.

Thank you for the suggestion we created a CONTRIBUTING.md file. We believe that it contains all the necessary information for someone that wants to contribute to the repository to do so. We hope it is up to community standards.

  • Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?

See panagiotisanagnostou/HiPart#23 for remarks for this items.

I have added a response in the pull request @jjerphan created.

jbytecode commented 1 year ago

@AP6YC and @jjerphan - It would be nice to hear from you. Thank you in advance.

jjerphan commented 1 year ago

I will have a look tomorrow.

jjerphan commented 1 year ago

Last review items' clarification

As of panagiotisanagnostou/HiPart@efb7859, I have now checked:

I have not checked the following 4 items of the 21 items mainly because I still either have remark or need more time to assess it:


Responding to https://github.com/openjournals/joss-reviews/issues/5024#issuecomment-1412414533:

Contribution and authorship

Authorships is now clear. I guess part of this paragraph can be included in the project's README.md.

Substantial scholarly effort

Thank you for legitimizing our submission to the JOSS. Our goal for the following weeks is to add more methods and expand the documentation significantly. In the meantime, according to your suggestion, we have included much more well-presented examples in the documentation.

As far as your concerns on the PEP8 compliance and the quality of code, as you can see in the recent commits, we have addressed most issues.

Thank you, for mentioning this recent work.

The algorithm members of the HiPart package are binary space partition tree-based algorithms. Specifically, each split the algorithms conduct can be represented by a node in an unbalanced binary tree. The final clustering of the data in all the five algorithms are the samples contained as data in the leaves of the binary tree.

Moreover, as mentioned in the CONTRIBUTING.md the algorithms have the same three member methods in their implementation. Those method are, one to split the node into two children, one for selecting the nest node to split and one to calculate the data each node will contain. This fact has the unfortunate effect of creating duplication in the code, especially if there is a need to keep the same basic structure in a source code. Nevertheless, we tried to reduce the duplication errors as much as we can.

Thank you, I need more time to validate those aspects.

After extensive research in the Scientific Python Ecosystem, to the best of our research efforts, divisive clustering algorithmic implementations are missing. This fact limits our options for comparisons; Still, we now decided to include further comparisons with more generic clustering toolboxes.

Thank you, I need more time to validate this.

Functionality, Performance and Reproducibility

I still can't run the scripts:

 → python paper_scripts/generate_paper_fig.py 
Traceback (most recent call last):
  File "/home/jjerphan/dev/HiPart/paper_scripts/generate_paper_fig.py", line 29, in <module>
    import __utilities as util
ModuleNotFoundError: No module named '__utilities'
~/dev/HiPart HiPart ± paper
 → python paper_scripts/generate_paper_fig.py
~/dev/HiPart HiPart ± paper
 → python paper_scripts/paper_results_generation.py 
Traceback (most recent call last):
  File "/home/jjerphan/dev/HiPart/paper_scripts/paper_results_generation.py", line 26, in <module>
    import __utilities as util
ModuleNotFoundError: No module named '__utilities'

You need to fix the imports.

Community guidelines

Thank you for the suggestion we created a CONTRIBUTING.md file. We believe that it contains all the necessary information for someone that wants to contribute to the repository to do so. We hope it is up to community standards.

Thank you, this now qualifies to me.

Quality of writing

I have added a response in the pull request @jjerphan created.

Thank you, this now qualifies to me.


Edit:

@panagiotisanagnostou, I think providing working scripts would allow me to better access the remaining items. Do you think you can provide such scripts?

@jbytecode: are there any strict deadlines?

panagiotisanagnostou commented 1 year ago

Response to comment:

Functionality, Performance and Reproducibility

I still can't run the scripts:

 → python paper_scripts/generate_paper_fig.py 
Traceback (most recent call last):
  File "/home/jjerphan/dev/HiPart/paper_scripts/generate_paper_fig.py", line 29, in <module>
    import __utilities as util
ModuleNotFoundError: No module named '__utilities'
~/dev/HiPart HiPart ± paper
 → python paper_scripts/generate_paper_fig.py
~/dev/HiPart HiPart ± paper
 → python paper_scripts/paper_results_generation.py 
Traceback (most recent call last):
  File "/home/jjerphan/dev/HiPart/paper_scripts/paper_results_generation.py", line 26, in <module>
    import __utilities as util
ModuleNotFoundError: No module named '__utilities'

You need to fix the imports.

Thanks for bringing this up. It seems that, the problem you encountered with the scripts is that the working directory for the Python interpreter during execution should be the location of the script. We have updated the code, and this is no longer the case. The scripts are now executable from every directory in your system. In addition, the two required packages are installed automatically upon execution.

The results of the execution can be found in the folder that contains the scripts under the names “dendrogram.pdf” and “paper_results.csv”.

Thank you so much for your time.

jbytecode commented 1 year ago

@jjerphan - In page (https://joss.readthedocs.io/en/latest/editing.html#overview-of-editorial-process) it is stressed that

We aim for reviews to be completed within about 2-4 weeks. Please let me know if any of you require some more time. We can also use EditorialBot (our bot) to set automatic reminders if you know you'll be away for a known period of time.

of course, any exceptions and excuses are welcome.

jjerphan commented 1 year ago

Thanks for bringing this up. It seems that, the problem you encountered with the scripts is that the working directory for the Python interpreter during execution should be the location of the script. We have updated the code, and this is no longer the case. The scripts are now executable from every directory in your system. In addition, the two required packages are installed automatically upon execution.

@panagiotisanagnostou: I am afraid, but I am still meeting the problem using the latest revision of the main branch (i.e. 0af766c2896070cce7fa54059a0401e308160072) and the latest revision of paper branch (i.e. 85e1426c9e14f6dc084d3b2ec75dc37dc71472da). Can you provide the commit of the resolution?

panagiotisanagnostou commented 1 year ago

@jjerphan, The commit of the resolution can be located at the last commit of the Enhancements (i.e. f26080243e75e8aff0a6e88817a8f8bf180afacb). In this commit, you will see changes in three files. The "__utilities.py", the "generate_paper_fig.py" and the "paper_results_generation.py". The changes are (i) the change in the interpreter working directory, (ii) the try-except block for automatic installation of the "fuzzy-c-means" package, and the "h5py" package.

Please provide me with the error report in order to try and find a suitable solution. I really need help seeing what the problem might be.

jjerphan commented 1 year ago

OK, the problem was that I ran python in isolated mode. Disabling this mode allows importing from __utilities.py. Generally, I think it is better to use relative import à la:

from .local_submodule import an_interface

as it allows supporting this mode.

What do you think?

panagiotisanagnostou commented 1 year ago

The main problem in this mode is the following, as stated in isolated mode:

In isolated mode sys.path contains neither the script's directory nor the user's site-packages directory. All PYTHON* environment variables are ignored, too.

This resulted in the sys.path variable to contain the following paths:

Unfortunately, according to the above paths, there is no way to access the __utilities.py script in a relative way. The solution I used for this issue is to manually add the path to the script I execute to the sys.path variable, with the help of the __file__ dunder variable, and this solves the problem of the imports.

I hope that this is a satisfactory solution.

jbytecode commented 1 year ago

@jjerphan - Does that modification solve the problem encountered here?

jjerphan commented 1 year ago

There's no problem if the script is executed without the isolated mode.

jjerphan commented 1 year ago

Hi, I need some time to assess the "Substantial scholarly effort" item.

jjerphan commented 1 year ago

To me, the given elements for substantial scholarly effort are met:

Hence I am validating the last "Substantial scholarly effort" item. I have not time to provide room of improvements but think previous indications have helped already.

Thank you all for your patience.

jbytecode commented 1 year ago

@jjerphan - Thank you for finalizing your review.

@AP6YC - Could you please update your status? Thank you in advance.

panagiotisanagnostou commented 1 year ago

@jjerphan We appreciate the time and effort you put into completing this review. Thank you!

jbytecode commented 1 year ago

@AP6YC - Could you please update your status and inform us on how is your review going? Thank you in advance.

jbytecode commented 1 year ago

@AP6YC - It would be great to hear from you. Could you please update your status and send us a life signal? Thank you in advance. I am so sorry if I am bothering.

panagiotisanagnostou commented 1 year ago

@jbytecode I would like to include some new features in the package that are not documented in the paper I submitted. Can I do it while the package is under review? Thank you in advance for your time.

jbytecode commented 1 year ago

@panagiotisanagnostou - We have almost finished the review. It would be better to save your changes in a seperate branch and merge them after we have all of the stuff done. Thank you in advance.

AP6YC commented 1 year ago

@panagiotisanagnostou Sorry for the significant delay and oversight on my end; I am finishing up my review now after @jbytecode reached out to me directly.

panagiotisanagnostou commented 1 year ago

@jbytecode and @AP6YC Thank you very much! I already have a branch with some changes!

jbytecode commented 1 year ago

Post-Review Checklist for Editor and Authors

Additional Author Tasks After Review is Complete

Editor Tasks Prior to Acceptance

jbytecode commented 1 year ago

@editorialbot check references

editorialbot commented 1 year ago

Checking the BibTeX entries failed with the following error:

Failed to parse BibTeX entry: cite-key missing
jbytecode commented 1 year ago

@editorialbot generate pdf

editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left: