editorialbot commented 5 months ago

Submitting author: !--author-handle-->@nglm@lrnv<!--end-editor-- Reviewers: @robcaulk, @wob86, @laurent-vouriot Archive: 10.5281/zenodo.13838192

Status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/fb63888e0a78da2866b03247ce18909d"><img src="https://joss.theoj.org/papers/fb63888e0a78da2866b03247ce18909d/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/fb63888e0a78da2866b03247ce18909d/status.svg)](https://joss.theoj.org/papers/fb63888e0a78da2866b03247ce18909d)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@robcaulk & @wob86 & @laurent-vouriot, your review will be checklist based. Each of you will have a separate checklist that you should update when carrying out your review. First of all you need to run this command in a separate comment to create the checklist:

@editorialbot generate my checklist

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @lrnv know.

✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨

Checklists

📝 Checklist for @robcaulk

📝 Checklist for @laurent-vouriot

📝 Checklist for @wob86

editorialbot commented 5 months ago

Hello humans, I'm @editorialbot, a robot that can help you with some common editorial tasks.

For a list of things I can do to help you, just type:

@editorialbot commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@editorialbot generate pdf

editorialbot commented 5 months ago

Software report:

github.com/AlDanial/cloc v 1.90  T=0.09 s (1051.6 files/s, 236080.3 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
CSV                             12              0              0          11084
Python                          32           1001           2591           3596
TeX                              1             32             36            397
Markdown                         9             99              0            201
Ruby                             1             28             12            106
reStructuredText                28            356            375             89
YAML                             3             11             25             59
DOS Batch                        1              8              1             26
JSON                             1              0              0             23
TOML                             1              5              0             23
make                             1              4              7              9
-------------------------------------------------------------------------------
SUM:                            90           1544           3047          15613
-------------------------------------------------------------------------------

Commit count by author:

   267  Natacha, Galmiche
     3  Natacha Galmiche

editorialbot commented 5 months ago

Paper file info:

📄 Wordcount for paper.md is 932

✅ The paper includes a Statement of need section

editorialbot commented 5 months ago

License info:

✅ License found: MIT License (Valid open source OSI approved license)

editorialbot commented 5 months ago

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1016/j.patcog.2012.07.021 is OK
- 10.1080/03610927408827101 is OK
- 10.18637/jss.v061.i06 is OK
- 10.1109/tpami.1979.4766909 is OK
- 10.1080/01969727408546059 is OK
- 10.1016/j.ins.2021.10.004 is OK
- 10.1016/j.patrec.2010.11.006 is OK
- 10.1007/3-540-45372-5_26 is OK
- 10.1109/icdm.2001.989517 is OK
- 10.1002/9780470316801.ch2 is OK
- 10.1016/j.patrec.2005.04.007 is OK
- 10.1109/tpami.2002.1114856 is OK
- 10.1007/978-3-540-45167-9_14 is OK
- 10.1016/j.patcog.2010.09.013 is OK
- 10.48550/ARXIV.1912.01703 is OK
- 10.1016/0377-0427(87)90125-7 is OK
- 10.1007/978-3-540-73499-4_14 is OK
- 10.1109/icdcsw.2011.20 is OK
- 10.48550/ARXIV.1909.07872 is OK
- 10.2307/2529577 is OK
- 10.1016/B978-1-59749-272-0.50018-9 is OK
- 10.1007/bf02289263 is OK
- 10.1111/1467-9868.00293 is OK
- 10.1080/01621459.1963.10500845 is OK
- 10.1109/34.85677 is OK

MISSING DOIs

- No DOI given, and none found for title: aeon
- No DOI given, and none found for title: Clustering benchmarks
- No DOI given, and none found for title: Using Dynamic Time Warping to Find Patterns in Tim...
- No DOI given, and none found for title: The UCR Time Series Classification Archive
- 10.1109/tit.1982.1056489 may be a valid DOI for title: Least squares quantization in PCM
- No DOI given, and none found for title: Scikit-learn: Machine Learning in Python
- No DOI given, and none found for title: Scikit-learn Extra
- No DOI given, and none found for title: TensorFlow: Large-Scale Machine Learning on Hetero...
- No DOI given, and none found for title: Tslearn, A Machine Learning Toolkit for Time Serie...

INVALID DOIs

- None

editorialbot commented 5 months ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

lrnv commented 5 months ago

@robcaulk, @wob86, @laurent-vouriot, @nglm this is the review thread for the paper. All of our communications will happen here (and on issues/prs on the submitted repository) from now on.

@robcaulk, @wob86, @laurent-vouriot, as reviewers, the first step for each of you is to create a checklist for your review by entering

@editorialbot generate my checklist

as the top of a new comment in this thread. These checklists contain the JOSS requirements. As you go over the submission, please check any items that you feel have been satisfied. The first comment in this thread also contains links to the JOSS reviewer guidelines.

The JOSS review is different from most other journals. Our goal is to work with the authors to help them meet our criteria instead of merely passing judgment on the submission. As such, the reviewers are encouraged to submit issues and pull requests on the software repository. When doing so, please mention openjournals/joss-reviews#6841 so that a link is created to this thread (and I can keep an eye on what is happening !). Please also feel free to comment and ask questions on this thread. In my experience, it is better to post comments/questions/suggestions as you come across them instead of waiting until you've reviewed the entire package.

We aim for reviews to be completed within about 2-4 weeks. Please let me know if any of you require some more time. We can also use EditorialBot (our bot) to set automatic reminders if you know you'll be away for a known period of time.

Please feel free to ping me (@lrnv) if you have any questions/concerns.

robcaulk commented 5 months ago

Review checklist for @robcaulk

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the https://github.com/nglm/pycvi?
[x] License: Does the repository contain a plain-text LICENSE or COPYING file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@nglm) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
[x] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines
[x] Data sharing: If the paper contains original data, data are accessible to the reviewers. If the paper contains no original data, please check this item.
[x] Reproducibility: If the paper contains original results, results are entirely reproducible by reviewers. If the paper contains no original results, please check this item.
[x] Human and animal research: If the paper contains original data research on humans subjects or animals, does it comply with JOSS's human participants research policy and/or animal research policy? If the paper contains no such data, please check this item.

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Does the paper have a section titled 'Statement of need' that clearly states what problems the software is designed to solve, who the target audience is, and its relation to other work?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

laurent-vouriot commented 5 months ago

Review checklist for @laurent-vouriot

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the https://github.com/nglm/pycvi?
[x] License: Does the repository contain a plain-text LICENSE or COPYING file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@nglm) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
[x] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines
[x] Data sharing: If the paper contains original data, data are accessible to the reviewers. If the paper contains no original data, please check this item.
[x] Reproducibility: If the paper contains original results, results are entirely reproducible by reviewers. If the paper contains no original results, please check this item.
[x] Human and animal research: If the paper contains original data research on humans subjects or animals, does it comply with JOSS's human participants research policy and/or animal research policy? If the paper contains no such data, please check this item.

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Does the paper have a section titled 'Statement of need' that clearly states what problems the software is designed to solve, who the target audience is, and its relation to other work?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

wob86 commented 5 months ago

Review checklist for @wob86

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the https://github.com/nglm/pycvi?
[x] License: Does the repository contain a plain-text LICENSE or COPYING file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@nglm) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
[x] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines
[x] Data sharing: If the paper contains original data, data are accessible to the reviewers. If the paper contains no original data, please check this item.
[x] Reproducibility: If the paper contains original results, results are entirely reproducible by reviewers. If the paper contains no original results, please check this item.
[x] Human and animal research: If the paper contains original data research on humans subjects or animals, does it comply with JOSS's human participants research policy and/or animal research policy? If the paper contains no such data, please check this item.

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Does the paper have a section titled 'Statement of need' that clearly states what problems the software is designed to solve, who the target audience is, and its relation to other work?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

lrnv commented 5 months ago

@robcaulk, @wob86, @laurent-vouriot, how is it going since last week ? Do not hesitate to tell me if there is anything i might be able to help with.

robcaulk commented 5 months ago

@robcaulk, @wob86, @laurent-vouriot, how is it going since last week ? Do not hesitate to tell me if there is anything i might be able to help with.

Hello, I just got around to working on this today. There are some issues preventing moving forward for now: https://github.com/nglm/pycvi/issues/4

robcaulk commented 5 months ago

Some other questions I have regarding the paper:

Please note, that I would not consider myself a scholar in clustering. I am, however, a "power-user" of scikit-learn pipelines/clustering, so I am bringing a user perspective to this review. I would appreciate if you could correct any of my misunderstandings around the use of variational information and CVIs, as my questions may be naive. My aim here is to evaluate this repository from a practical perspective as well as from a tool-use perspective.

Primary question

My principal question here is related to the "Substantial scholarly effort" checkbox. Which relates to likelihood of citation in academia.

If I understand correctly, this library is an aggregation of many existing CVI methods. I'm not convinced that an academic would need to cite this library instead of simply citing the individual clustering/CVI methods directly. Perhaps the argument is that an academic would cite both the method and the library. Ok, but I think the library in that case needs to bring something novel, like some "auto-selection" of the CVI based on some analytics of the clusters, or something (please see my note above about how I am not a scholar on clustering methods). It seems they are doing something with the variation of information (VI), but if I understood correctly, it is not used for automatically choosing the best CVI.

It introduces my next question: "why did they elect against contributing more CVI methods to scikit-learn?" In other words "what exactly forced the need of a new library?" If I understand correctly, scikit-learn already has 3 CVI methods, while this current library has 12. Why not simply augment the offering at scikit-learn, which opens up these algorithms to a much wider user base?

In general, I think they need to highlight exactly what novelty this repository brings. I believe the novelty/answer to my above questions is hidden somewhere in this statement from the paper:

Unfortunately, DTW and DBA are not compatible with the libraries mentioned above.

and

Then, in PyCVI their 46 definition is extended in order to make them compatible with DTW and DBA in addition to 47 non time-series data.

But I cannot find any support or rationale for why DTW and DBA are not compatible with scikit-learn or why this was a difficult implementation that needed a separate library. Again, I am not a scholar in clustering, so maybe this is obvious to the authors, but i think a couple sentences on this contribution is key to the "Scholarly Contribution" of this library.

Secondary question

Another question that is not clear to me is the claim that:

PyCVI is entirely compatible with scikit-learn clustering pipelines

I was unable to locate an example showing the use of pycvi in a scikit-learn Pipeline. So I am curious what the authors meant by this, or if they can show an example to support the claim.

Tertiary issues

From a practical perspective, I was unable to run the basic_usage.py, please review the steps that I took in the issue at the target repository:

https://github.com/nglm/pycvi/issues/4

nglm commented 5 months ago

Thank you @robcaulk and @wob86 for your feedback! I'll be working on it today and tomorrow.

nglm commented 4 months ago

Thank you all again for your feedback, I am now back from holidays and started working again on this review.

In this post, I will answer to @robcaulk questions.

Question: Why do we need PyCVI?

No implementation of individual CVI in python

If I understand correctly, this library is an aggregation of many existing CVI methods. I'm not convinced that an academic would need to cite this library instead of simply citing the individual clustering/CVI methods directly.

The CVIs implemented in this library are indeed all already existing, but there is no available implementation of them, which slows down further research on the topic and diminishes the quality of scripts of the general user which rely on the very few methods that are available in sklearn. You then brought up the next very natural follow-up question: "why did they elect against contributing more CVI methods to scikit-learn?", I will detail the answer to this question below.

Auto-selection

I think the library in that case needs to bring something novel, like some "auto-selection" of the CVI based on some analytics of the clusters, or something.

There is to this day not enough research on this topic to be able to select the right CVI for the right dataset, although some papers initiated this effort. See for example [^1] or [^2]. Another obstacle to this approach is that for this to work, one would have to actually get to have a proper "analytics" of the clusters which is not really feasible outside generated toy examples to have specific characteristics or very well studied datasets.

It seems they are doing something with the variation of information (VI), but if I understood correctly, it is not used for automatically choosing the best CVI.

VI can only be used when one knows the true clusters, therefore it is meant here to be used on known datasets in order to study the performance of CVIs or clustering methods (in a similar approach as the one detailed in [^2]), but it can not be used in a real-life clustering setting in which we precisely don't know the ground truth.

Sklearn not compatible with time series data, nor DTW, nor DBA

Shapes of the datasets in sklearn

The first naive observation is that sklearn only allows datasets with the shape (n_samples, n_features). This of course could still work with time series if we either flatten the time-series first, or contribute to sklearn to allow for datasets of different shapes. The second option could be more ambitious than it seems given the size of the sklearn library and given how machine learning libraries heavily rely on having a specific shape to compute distances. The first option, flattening the data before using sklearn as is, would create additional ugly lines which in itself could be an acceptable sacrifice but is not the main problem. Although, being a very naive issue, this is already a reason to use aeon or tslearn or sktime instead of sklearn in case of time series data.

Use of custom metric functions in sklearn clustering methods

First of all, very few clustering methods implemented in sklearn are actually compatible with using a custom distance function ("metric" parameter). For example, KMeans, MeanShift, SpectralClustering are not while AgglomerativeClustering and DBSCAN are. This is by the way one reason why users might not use sklearn but scikit-learn-extra (e.g. KMedoids or CLARA).

In addition, this custom distance function, when available, has to take as inputs (n_samples, n_features) datasets. Again, while adding a few ugly lines of code, we could in theory define a distance function that first unflattens a time-series dataset (provided that we know the original shape) before calling the actual DTW distance function.

Finally, these tricks might let us use sklearn with time-series data and DTW for clustering methods that allow it, but this would still not solve the issue of the definition of the average in the case of time series data. DTW can compute pairwise distances, but can not directly compute the average of a group of time-series. If the usual definition of the average is used instead (i.e. using the euclidean distance), then the resulting average time-series might be very far from representing what we would expect from the average time series. For more explanation on this, and a more thorough motivation of this problem, I can recommend the original DBA paper from Petitjean et al. [^DBA], and the corresponding GitHub repository as well as the cluster center example in PyCVI. The concept of average is essential in clustering as it is often used to create clusters (by comparing the distances to the different cluster centroids) but also to evaluate clusters (by computing for example the inertia or most of CVIs). This would completely change the code of for example the KMeans or Ward algorithms. And there, time-series adapted libraries such as aeon, sktime and tslearn become strictly necessary.

Question: Sklearn Pipeline (and back to the need of PyCVI)

I don't see where your quote comes from. My guess is that you refer to this sentence:

PyCVI is entirely compatible with scikit-learn, scikit-learn-extra, aeon and sktime, in order to be easily integrated into any clustering pipeline in python.

Which meant something totally different, and did not refer to the Pipeline class in sklearn, altough I understand the confusion.

As explained above, sklearn is not compatible with time-series data, nor distances adapted to time series data. If a python user wants to use clustering methods that are adapted to time series data, they have to switch to another library (e.g. aeon, sktime or tslearn), which would result in different structures in their scripts compared to when they are not working with time-series data (in which case they would use sklearn).

The first motivation of PyCVI is to fill the gap in terms of CVIs, and this has nothing to do (originally) with time-series data, since the gap is present even for non-time series data. But, the same reasons that made time-series specific libraries necessary, would make a time-series compatible CVI library necessary as well (if not already designed to support both).

PyCVI is compatible with both time series and non time series data, with the exact same syntax and functions for time series data and non time series data. In addition, PyCVI is compatible with sklearn as well as aeon and sktime, with the exact same syntax when combined with sklearn clustering methods or aeon clustering methods. Which means that the user would use PyCVI in the exact same way regardless of the type of data they are handling and would integrate PyCVI snippets of code in the exact same way in their already written scripts regardless of the original library used (sklearn, scikit-learn extra, aeon or sktime).

Question: runnning examples

See my reply on the corresponding issue, please let me know if there is anything more I should do on that matter.

[^1]: Y. Liu, Z. Li, H. Xiong, X. Gao, J. Wu, and S. Wu, “Understanding and enhancement of internal clustering validation measures,” IEEE Transactions on Cybernetics, vol. 43, pp. 982–994, June 2013. [^2]: I. Gurrutxaga, J. Muguerza, O. Arbelaitz, J. M. P ́erez, and J. I. Mart ́ın, “Towards a standard methodology to evaluate internal cluster validity indices,” Pattern Recognition Letters, vol. 32, p. 505–515, Feb. 2011. [^DBA]: F. Petitjean, A. Ketterlin, and P. Gan carski, “A global averaging method for dynamic time warping, with applications to clustering,” Pattern Recognition, vol. 44, pp. 678–693, Mar. 2011.

lrnv commented 4 months ago

@nglm I saw that you redacted responses to questions from the reviewers. I did not read the discussion yet, but I am thinking that, maybe, you could (in a later stage of the review) curate those discussions as a FAQ in your docs. Some of your answers might be usefull to the userbase more generally.

nglm commented 4 months ago

@editorialbot generate pdf

editorialbot commented 4 months ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

laurent-vouriot commented 4 months ago

Hi @nglm some feedbacks on the paper :

Overall, the paper is well-written and easy to understand. Here are my few concerns and suggestions :

I would define a bit more precisely Varitional Information, and earlier in the paper.
I understand that the aim of the library is not to define new CVIs and adovcate for their quality, but to allow the access and easy usage for all of them. One of the argument is to facilitate the research in the field of new CVIs, clustering and evaluation methods, as mentioned :

This illustrates the need of further research on CVIs, which is facilitated by PyCVI, notably in the case of concentric subgroups.

Therefore, when the quality of the clustering selected by the CVI is poor it can then either be due to the clustering method or due to the CVI, hence the necessity of robust evaluation pipeline for both clustering methods and CVIs.

and here :

The poor results shown in the corresponding histogram \autoref{fig:agglo-hist} tells us that the CVI used here are not adapted to this dataset.

However I would add an example showing CVIs of good quality, just for the sake of displaying a contexte where they could be meaningful.

The ability to handle time series clustering is a significant contribution of this work, given the limited availability of libraries for such problems. Highlighting this aspect more could enhance the paper’s impact.

Substantial Scholarly Effort

In terms of scholarly effort, the paper aligns well with most of the JOSS guidelines, particularly in the area of "making addressing research challenges significantly better (e.g., faster, easier, simpler)". Although I do not specialize in clustering, colleagues who do, have noted the lack of accessible CVIs in standard libraries. They usually implement their own versions for their studies and have expressed interest in this library.

A potential concern is how the paper will be cited. As noted by @robcaulk, users might prefer citing the original CVI papers.

In your full example, all CVIs are computed as follows:

for cvi_class in CVIs:

    # Instantiate a CVI model
    cvi = cvi_class()
    t_start = time.time()
    print(f" ====== {cvi} ====== ")

    # Compute CVI values for all clusterings
    scores = compute_all_scores(
        cvi,
        X,
        clusterings,
        DTW=DTW,
        scaler=StandardScaler(),
    )

How about implementing some kind of wrapper, analogous to sklearn's scorer, in your library? This wrapper could allow running multiple CVIs of choice at once, providing a scoring strategy more aligned with sklearn's usual practices. This could provide a more structured and reusable way to handle multiple CVIs, making the process more direct and user-friendly. It may encourage citing the library when users selected CVIs using pyCVI.

Additionally, putting a stronger focus on the time series clustering aspect could also help differentiate the contribution and encourage citations.

nglm commented 3 months ago

Once again thank you so much for the very useful feedback! I have now updated the library, the paper and the documentation in an attempt to answer @laurent-vouriot questions and concerns. This post tries to summarize the different updates.

On updating the paper

Clearer explanation of the Variation of information
One more example, with successful CVIs: added one example featuring pycvi.cvi.CVIAggregator. The example was added to the documentation and to the paper, showing successful CVIs for a given dataset (see details on pycvi.cvi.CVIAggregator below).
Stronger focus on the importance of time series compatibility.

On adding a scorer-like wrapper

By scorer-like wrapper, I mean (based on what I understand from your post) a sklearn-like scrorer as described here and there and which can be summarized as a callable that:

can be called with parameters (estimator, X, y),
returns a floating point number that quantifies the estimator prediction quality on X, with reference to y, using the convention that higher numbers are better.

Objectives:

Make it easier to integrate PyCVI's CVIs into sklearn pipelines, notably those using GridSearchCV, RandomizedSearchCV and cross_validate.
Make it easier to use mulitple CVIs at a time.
Make PyCVI pipelines more general and re-usable by creating an object that should always be maximized from each implemented CVI.

Challenges / Problems:

No particular challenge regarding this objective.
There is no technical challenge about this objective, but rather a "design choice" against it. We believe that some datasets are ambiguous and that multiple relevant interpretations of each of them (e.g. multiple relevant interpretations on the numbers of clusters) can coexist and should not be hidden to the user by showing only the "best" one and hidding the other potentially relevant ones (this is typically the case with datasets with subclusters, noisy clusters, clusters with different variance or number of datapoints, etc.). This point of view is rather specific to clustering tasks because we don't have any ground truth to tell us what is the correct interpretation, although there exists also similar issues with fuzzy classification (classification problem in which each input can belong to several labels "to some degree") and multi-label classification (classification problem in which each input can have several true labels). Those edge cases of classification (and our viewpoint on clustering) are not taken into account in sklearn and/or out of the scope of sklearn. Our recommendation to face this issue is to plot the number of CVIs that voted for a given number of cluster with respect to a given range of number of clusters, as we did in our example "Selecting the number of clusters k"
Many CVIs are relative and/or using a threshold and/or using a specific selection rule given a sequence of cvi value for different clusterings. In such cases, it is impossible to output a float to be maximised for some given clusterings. The cvi value cannot be further exploited other than to select the best clustering among a set of clusterings. This particularity/difficulty is however mitigated by our "select" method, common to all implemented CVIs, that selects the best clustering given computed CVI values using the same user syntax regardless of the CVI used (see the line further down in the "full example" "k_selected = cvi.select(scores)"). Note that we have precisely attempted to extend some existing CVIs to make their outputs more interpretable and comparable with other CVIs but in vain for most CVIs. There were few CVIs with which we could do it and make experiments that suggest that this extension is relevant, but this is ongoing work and outside the scope of this JOSS submission.

Conclusion

Problem number 3 is the problem that makes this feature impossible to implement for all CVIs. One possibility would be to make this feature possible for only compatible CVIs but this would probably discourage users to use the other CVIs, which is also something we would rather avoid. I am however willing to do it if you think that missing this feature would actually discourage users to use the PyCVI library at all. Please let me know what you think about it.

In the meantime, to facilitate the use of mutliple CVIs at once and to encourage citations, I have now implemented the pycvi.cvi.CVIAggregator class that allows users to combine several CVIs of their choice and to easily select the best clustering using a majority vote rule. The CVIAggregator syntax to compute scores and to select the best clustering is completely identical to any CVI syntax to do the same thing. To illustrate this and to showcase this added feature, I have added an example in the documentation "CVIAggregator: Combining CVIs" and I have updated the paper.

On how to make PyCVI more likely to be cited

Stronger emphasis on time series in the paper and in the documentation.
Made it possible to compute CVI values and select the best clustering based on an aggregation of CVIs and the majority vote rule. Added the pycvi.cvi.CVIAggregator class and an example featuring pycvi.cvi.CVIAggregator.
Stronger emphasis in the paper on the fact that there is no public python implementation of these CVIs.

nglm commented 3 months ago

@editorialbot generate pdf

editorialbot commented 3 months ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

lrnv commented 3 months ago

@editorialbot commands

editorialbot commented 3 months ago

Hello @lrnv, here are the things you can ask me to do:


# List all available commands
@editorialbot commands

# Add to this issue's reviewers list
@editorialbot add @username as reviewer

# Remove from this issue's reviewers list
@editorialbot remove @username from reviewers

# Get a list of all editors's GitHub handles
@editorialbot list editors

# Assign a user as the editor of this submission
@editorialbot assign @username as editor

# Remove the editor assigned to this submission
@editorialbot remove editor

# Remind an author, a reviewer or the editor to return to a review after a 
# certain period of time (supported units days and weeks)
@editorialbot remind @reviewer in 2 weeks

# Adds a checklist for the reviewer using this command
@editorialbot generate my checklist

# Set a value for version
@editorialbot set v1.0.0 as version

# Set a value for branch
@editorialbot set joss-paper as branch

# Set a value for repository
@editorialbot set https://github.com/organization/repo as repository

# Set a value for the archive DOI
@editorialbot set 10.5281/zenodo.6861996 as archive

# Mention the EiCs for the correct track
@editorialbot ping track-eic

# Run checks and provide information on the repository and the paper file
@editorialbot check repository

# Check the references of the paper for missing DOIs
@editorialbot check references

# Generates the pdf paper
@editorialbot generate pdf

# Recommends the submission for acceptance
@editorialbot recommend-accept

# Generates a LaTeX preprint file
@editorialbot generate preprint

# Flag submission with questionable scope
@editorialbot query scope

# Get a link to the complete list of reviewers
@editorialbot list reviewers

# Creates a post-review checklist with editor and authors tasks
@editorialbot create post-review checklist

# Open the review issue
@editorialbot start review

lrnv commented 3 months ago

@editorialbot check repository

editorialbot commented 3 months ago

Software report:

github.com/AlDanial/cloc v 1.90  T=0.11 s (921.4 files/s, 327951.6 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
CSV                             16              0              0          26706
Python                          36           1197           3074           4256
TeX                              1             32             36            397
Markdown                        12            142              0            267
reStructuredText                32            390            405            114
Ruby                             1             28             12            106
YAML                             3             11             25             59
TOML                             1              7              3             29
DOS Batch                        1              8              1             26
JSON                             1              0              0             23
make                             1              4              7              9
-------------------------------------------------------------------------------
SUM:                           105           1819           3563          31992
-------------------------------------------------------------------------------

Commit count by author:

   318  Natacha, Galmiche
     3  Natacha Galmiche

editorialbot commented 3 months ago

Paper file info:

📄 Wordcount for paper.md is 1780

✅ The paper includes a Statement of need section

editorialbot commented 3 months ago

License info:

✅ License found: MIT License (Valid open source OSI approved license)

lrnv commented 3 months ago

@nglm Word count for the paper doubled, why is that ? In joss, we aim at 1000 words max and 1-to-2 pages papers, this is now way too much figures and text

nglm commented 3 months ago

@lrnv This was a slow accumulation of a few additions that were suggested during the review process, notably:

Explain more Variation of Information
Emphasize more the time-series constraints and PyCVI's contributions on this aspect (compatibility with DTW, DBA and times series machine learning libraries like aeon or sktime) to better explain the need of PyCVI as opposed to just contributing to sklearn.
Add one example of successful clustering and clustering selection (the previous examples were meant to motivate the need of variation of information, of further research on CVIs and clustering methods, of a careful choice of CVI when clustering a given dataset, etc) as well as showcasing the compatibility of PyCVI with time-series data and non-time series data.
Add an additional feature to make it easier to combine multiple CVIs to select the best clustering in order to facilitate practical usage but also to make PyCVI more likely to be cited as opposed to citing the authors of the individual and original CVIs.

I will try to make it more concise, but to go back to < 1000 words, I guess some of the explanations will have to be skipped.

nglm commented 3 months ago

@editorialbot generate pdf

editorialbot commented 3 months ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

nglm commented 3 months ago

@lrnv I have shortened the paper (from 1780 counted words to 1244 counted). Among those 1244 remaining words more than 100 are actually not visible in the pdf paper but consist of the metadata and the references tags in markdown. So the pdf paper has now less than 1144 visible words. I also removed some figures and combined some others. The text itself takes up less than 3 pages, the rest are references and figures.

I am aware that it is still slightly above recommendations but shortening it even more will be at the cost of clarity and of the improvements we have made based on the reviews.

lrnv commented 3 months ago

@nglm I appreciate that. Without even looking at the text, I see a few things:

1) The reduction of figure sizes is indeed a good idea, but sometimes it produces weird output with a lot of spaces. See e.g. page 2 is mostly empty, maybe you could reformat the figure so that it fills out ? see screenshot :

2) Figures 3 and 4 could be one figure with propper facetting.

3) Figures 1 and 2 also could be one figure with propper facetting.

nglm commented 3 months ago

@editorialbot generate pdf

editorialbot commented 3 months ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

nglm commented 3 months ago

@editorialbot generate pdf

nglm commented 3 months ago

@lrnv I have updated the figures. (And sorry for spamming with the bot)

editorialbot commented 3 months ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

lrnv commented 2 months ago

@robcaulk, @wob86, @laurent-vouriot, your three reviews still have uncheck boxes, could you please either check them or give me an update on what is happening ?

lrnv commented 2 months ago

@robcaulk, @wob86, @laurent-vouriot, gentle bump

wob86 commented 2 months ago

sorry not had chance to update, i am happy with all changes made and I am happy to approve this now. Thanks

laurent-vouriot commented 2 months ago

Hi @lrnv and @nglm. Just read the updated version of the paper. My key concerns were taken into consideration, and I'm happy with the results. Last few very minor concerns about the paper :

line 19

There exists many mature libraries in python for machine learning and in particular clustering scikit-learn ...

It should be "exist" and I would add ":" between clustering and scikit-learn.

line 49

PyCVI fills that gap by implementing 12 state-of-the-art internal CVIs: [...] . Then, in PyCVI their definition is extended ...

Starting the sentence with "Then" feels a bit akward to me

line 64

High VI values there mean large distances between the true clustering and the computed clusterings, meaning poor computed clusterings

"there mean" sounds akward, and "poor computed clustering" also. poor results, poor quality, bad quality ?

line 69

Since the generated clusterings are poor...

Same concern.

Aside from these small details, everyhting looks good to me, and I'm glad to approve it.

robcaulk commented 2 months ago

Thanks for the effort on this, I am OK with the updates.

nglm commented 2 months ago

Hi all @lrnv @robcaulk @wob86 @laurent-vouriot . Thank you once more for your help throughout this review! I have now updated the paper according to @laurent-vouriot 's comments.

nglm commented 2 months ago

@editorialbot generate pdf

editorialbot commented 2 months ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

lrnv commented 2 months ago

Post-Review Checklist for Editor and Authors

Additional Author Tasks After Review is Complete

Double check authors and affiliations (including ORCIDs)
Make a release of the software with the latest changes from the review and post the version number here. This is the version that will be used in the JOSS paper.
Archive the release on Zenodo/figshare/etc and post the DOI here.
Make sure that the title and author list (including ORCIDs) in the archive match those in the JOSS paper.
Make sure that the license listed for the archive is the same as the software license.

Editor Tasks Prior to Acceptance

[x] Read the text of the paper and offer comments/corrections (as either a list or a pull request)
[x] Check that the archive title, author list, version tag, and the license are correct
[x] Set archive DOI with @editorialbot set <DOI here> as archive
[x] Set version with @editorialbot set <version here> as version
[x] Double check rendering of paper with @editorialbot generate pdf
[x] Specifically check the references with @editorialbot check references and ask author(s) to update as needed
[ ] Recommend acceptance with @editorialbot recommend-accept

lrnv commented 2 months ago

@nglm could you please handle the post-review items on the upper check-list and report to me once this is done ?

nglm commented 2 months ago

Hi @lrnv This is now done!

openjournals / joss-reviews

[REVIEW]: PyCVI: A Python package for internal Cluster Validity Indices, compatible with time-series data #6841

Status

Reviewer instructions & questions

Checklists

Review checklist for @robcaulk

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Review checklist for @laurent-vouriot

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Review checklist for @wob86

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Primary question

Secondary question

Tertiary issues

Question: Why do we need PyCVI?

No implementation of individual CVI in python

Auto-selection

Sklearn not compatible with time series data, nor DTW, nor DBA

Shapes of the datasets in sklearn

Use of custom metric functions in sklearn clustering methods

Question: Sklearn Pipeline (and back to the need of PyCVI)

Question: runnning examples

Substantial Scholarly Effort

On updating the paper

On adding a scorer-like wrapper

Objectives:

Challenges / Problems:

Conclusion

On how to make PyCVI more likely to be cited

Post-Review Checklist for Editor and Authors

Additional Author Tasks After Review is Complete

Editor Tasks Prior to Acceptance