Closed editorialbot closed 1 month ago
Hello humans, I'm @editorialbot, a robot that can help you with some common editorial tasks.
For a list of things I can do to help you, just type:
@editorialbot commands
For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:
@editorialbot generate pdf
Software report:
github.com/AlDanial/cloc v 1.90 T=0.09 s (1051.6 files/s, 236080.3 lines/s)
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
CSV 12 0 0 11084
Python 32 1001 2591 3596
TeX 1 32 36 397
Markdown 9 99 0 201
Ruby 1 28 12 106
reStructuredText 28 356 375 89
YAML 3 11 25 59
DOS Batch 1 8 1 26
JSON 1 0 0 23
TOML 1 5 0 23
make 1 4 7 9
-------------------------------------------------------------------------------
SUM: 90 1544 3047 15613
-------------------------------------------------------------------------------
Commit count by author:
267 Natacha, Galmiche
3 Natacha Galmiche
Paper file info:
📄 Wordcount for paper.md
is 932
✅ The paper includes a Statement of need
section
License info:
✅ License found: MIT License
(Valid open source OSI approved license)
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):
OK DOIs
- 10.1016/j.patcog.2012.07.021 is OK
- 10.1080/03610927408827101 is OK
- 10.18637/jss.v061.i06 is OK
- 10.1109/tpami.1979.4766909 is OK
- 10.1080/01969727408546059 is OK
- 10.1016/j.ins.2021.10.004 is OK
- 10.1016/j.patrec.2010.11.006 is OK
- 10.1007/3-540-45372-5_26 is OK
- 10.1109/icdm.2001.989517 is OK
- 10.1002/9780470316801.ch2 is OK
- 10.1016/j.patrec.2005.04.007 is OK
- 10.1109/tpami.2002.1114856 is OK
- 10.1007/978-3-540-45167-9_14 is OK
- 10.1016/j.patcog.2010.09.013 is OK
- 10.48550/ARXIV.1912.01703 is OK
- 10.1016/0377-0427(87)90125-7 is OK
- 10.1007/978-3-540-73499-4_14 is OK
- 10.1109/icdcsw.2011.20 is OK
- 10.48550/ARXIV.1909.07872 is OK
- 10.2307/2529577 is OK
- 10.1016/B978-1-59749-272-0.50018-9 is OK
- 10.1007/bf02289263 is OK
- 10.1111/1467-9868.00293 is OK
- 10.1080/01621459.1963.10500845 is OK
- 10.1109/34.85677 is OK
MISSING DOIs
- No DOI given, and none found for title: aeon
- No DOI given, and none found for title: Clustering benchmarks
- No DOI given, and none found for title: Using Dynamic Time Warping to Find Patterns in Tim...
- No DOI given, and none found for title: The UCR Time Series Classification Archive
- 10.1109/tit.1982.1056489 may be a valid DOI for title: Least squares quantization in PCM
- No DOI given, and none found for title: Scikit-learn: Machine Learning in Python
- No DOI given, and none found for title: Scikit-learn Extra
- No DOI given, and none found for title: TensorFlow: Large-Scale Machine Learning on Hetero...
- No DOI given, and none found for title: Tslearn, A Machine Learning Toolkit for Time Serie...
INVALID DOIs
- None
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
@robcaulk, @wob86, @laurent-vouriot, @nglm this is the review thread for the paper. All of our communications will happen here (and on issues/prs on the submitted repository) from now on.
@robcaulk, @wob86, @laurent-vouriot, as reviewers, the first step for each of you is to create a checklist for your review by entering
@editorialbot generate my checklist
as the top of a new comment in this thread. These checklists contain the JOSS requirements. As you go over the submission, please check any items that you feel have been satisfied. The first comment in this thread also contains links to the JOSS reviewer guidelines.
The JOSS review is different from most other journals. Our goal is to work with the authors to help them meet our criteria instead of merely passing judgment on the submission. As such, the reviewers are encouraged to submit issues and pull requests on the software repository. When doing so, please mention openjournals/joss-reviews#6841
so that a link is created to this thread (and I can keep an eye on what is happening !). Please also feel free to comment and ask questions on this thread. In my experience, it is better to post comments/questions/suggestions as you come across them instead of waiting until you've reviewed the entire package.
We aim for reviews to be completed within about 2-4 weeks. Please let me know if any of you require some more time. We can also use EditorialBot (our bot) to set automatic reminders if you know you'll be away for a known period of time.
Please feel free to ping me (@lrnv) if you have any questions/concerns.
@robcaulk, @wob86, @laurent-vouriot, how is it going since last week ? Do not hesitate to tell me if there is anything i might be able to help with.
@robcaulk, @wob86, @laurent-vouriot, how is it going since last week ? Do not hesitate to tell me if there is anything i might be able to help with.
Hello, I just got around to working on this today. There are some issues preventing moving forward for now: https://github.com/nglm/pycvi/issues/4
Some other questions I have regarding the paper:
Please note, that I would not consider myself a scholar in clustering. I am, however, a "power-user" of scikit-learn pipelines/clustering, so I am bringing a user perspective to this review. I would appreciate if you could correct any of my misunderstandings around the use of variational information and CVIs, as my questions may be naive. My aim here is to evaluate this repository from a practical perspective as well as from a tool-use perspective.
My principal question here is related to the "Substantial scholarly effort" checkbox. Which relates to likelihood of citation in academia.
If I understand correctly, this library is an aggregation of many existing CVI methods. I'm not convinced that an academic would need to cite this library instead of simply citing the individual clustering/CVI methods directly. Perhaps the argument is that an academic would cite both the method and the library. Ok, but I think the library in that case needs to bring something novel, like some "auto-selection" of the CVI based on some analytics of the clusters, or something (please see my note above about how I am not a scholar on clustering methods). It seems they are doing something with the variation of information (VI), but if I understood correctly, it is not used for automatically choosing the best CVI.
It introduces my next question: "why did they elect against contributing more CVI methods to scikit-learn?" In other words "what exactly forced the need of a new library?" If I understand correctly, scikit-learn already has 3 CVI methods, while this current library has 12. Why not simply augment the offering at scikit-learn, which opens up these algorithms to a much wider user base?
In general, I think they need to highlight exactly what novelty this repository brings. I believe the novelty/answer to my above questions is hidden somewhere in this statement from the paper:
Unfortunately, DTW and DBA are not compatible with the libraries mentioned above.
and
Then, in PyCVI their 46 definition is extended in order to make them compatible with DTW and DBA in addition to 47 non time-series data.
But I cannot find any support or rationale for why DTW and DBA are not compatible with scikit-learn or why this was a difficult implementation that needed a separate library. Again, I am not a scholar in clustering, so maybe this is obvious to the authors, but i think a couple sentences on this contribution is key to the "Scholarly Contribution" of this library.
Another question that is not clear to me is the claim that:
PyCVI is entirely compatible with scikit-learn clustering pipelines
I was unable to locate an example showing the use of pycvi
in a scikit-learn Pipeline
. So I am curious what the authors meant by this, or if they can show an example to support the claim.
From a practical perspective, I was unable to run the basic_usage.py
, please review the steps that I took in the issue at the target repository:
Thank you @robcaulk and @wob86 for your feedback! I'll be working on it today and tomorrow.
Thank you all again for your feedback, I am now back from holidays and started working again on this review.
In this post, I will answer to @robcaulk questions.
If I understand correctly, this library is an aggregation of many existing CVI methods. I'm not convinced that an academic would need to cite this library instead of simply citing the individual clustering/CVI methods directly.
The CVIs implemented in this library are indeed all already existing, but there is no available implementation of them, which slows down further research on the topic and diminishes the quality of scripts of the general user which rely on the very few methods that are available in sklearn. You then brought up the next very natural follow-up question: "why did they elect against contributing more CVI methods to scikit-learn?", I will detail the answer to this question below.
I think the library in that case needs to bring something novel, like some "auto-selection" of the CVI based on some analytics of the clusters, or something.
There is to this day not enough research on this topic to be able to select the right CVI for the right dataset, although some papers initiated this effort. See for example [^1] or [^2]. Another obstacle to this approach is that for this to work, one would have to actually get to have a proper "analytics" of the clusters which is not really feasible outside generated toy examples to have specific characteristics or very well studied datasets.
It seems they are doing something with the variation of information (VI), but if I understood correctly, it is not used for automatically choosing the best CVI.
VI can only be used when one knows the true clusters, therefore it is meant here to be used on known datasets in order to study the performance of CVIs or clustering methods (in a similar approach as the one detailed in [^2]), but it can not be used in a real-life clustering setting in which we precisely don't know the ground truth.
The first naive observation is that sklearn only allows datasets with the shape (n_samples, n_features)
. This of course could still work with time series if we either flatten the time-series first, or contribute to sklearn to allow for datasets of different shapes. The second option could be more ambitious than it seems given the size of the sklearn library and given how machine learning libraries heavily rely on having a specific shape to compute distances. The first option, flattening the data before using sklearn as is, would create additional ugly lines which in itself could be an acceptable sacrifice but is not the main problem. Although, being a very naive issue, this is already a reason to use aeon or tslearn or sktime instead of sklearn in case of time series data.
First of all, very few clustering methods implemented in sklearn are actually compatible with using a custom distance function ("metric"
parameter). For example, KMeans, MeanShift, SpectralClustering are not while AgglomerativeClustering and DBSCAN are. This is by the way one reason why users might not use sklearn but scikit-learn-extra (e.g. KMedoids or CLARA).
In addition, this custom distance function, when available, has to take as inputs (n_samples, n_features)
datasets. Again, while adding a few ugly lines of code, we could in theory define a distance function that first unflattens a time-series dataset (provided that we know the original shape) before calling the actual DTW distance function.
Finally, these tricks might let us use sklearn with time-series data and DTW for clustering methods that allow it, but this would still not solve the issue of the definition of the average in the case of time series data. DTW can compute pairwise distances, but can not directly compute the average of a group of time-series. If the usual definition of the average is used instead (i.e. using the euclidean distance), then the resulting average time-series might be very far from representing what we would expect from the average time series. For more explanation on this, and a more thorough motivation of this problem, I can recommend the original DBA paper from Petitjean et al. [^DBA], and the corresponding GitHub repository as well as the cluster center example in PyCVI. The concept of average is essential in clustering as it is often used to create clusters (by comparing the distances to the different cluster centroids) but also to evaluate clusters (by computing for example the inertia or most of CVIs). This would completely change the code of for example the KMeans or Ward algorithms. And there, time-series adapted libraries such as aeon, sktime and tslearn become strictly necessary.
I don't see where your quote comes from. My guess is that you refer to this sentence:
PyCVI is entirely compatible with scikit-learn, scikit-learn-extra, aeon and sktime, in order to be easily integrated into any clustering pipeline in python.
Which meant something totally different, and did not refer to the Pipeline class in sklearn, altough I understand the confusion.
As explained above, sklearn is not compatible with time-series data, nor distances adapted to time series data. If a python user wants to use clustering methods that are adapted to time series data, they have to switch to another library (e.g. aeon, sktime or tslearn), which would result in different structures in their scripts compared to when they are not working with time-series data (in which case they would use sklearn).
The first motivation of PyCVI is to fill the gap in terms of CVIs, and this has nothing to do (originally) with time-series data, since the gap is present even for non-time series data. But, the same reasons that made time-series specific libraries necessary, would make a time-series compatible CVI library necessary as well (if not already designed to support both).
PyCVI is compatible with both time series and non time series data, with the exact same syntax and functions for time series data and non time series data. In addition, PyCVI is compatible with sklearn as well as aeon and sktime, with the exact same syntax when combined with sklearn clustering methods or aeon clustering methods. Which means that the user would use PyCVI in the exact same way regardless of the type of data they are handling and would integrate PyCVI snippets of code in the exact same way in their already written scripts regardless of the original library used (sklearn, scikit-learn extra, aeon or sktime).
See my reply on the corresponding issue, please let me know if there is anything more I should do on that matter.
[^1]: Y. Liu, Z. Li, H. Xiong, X. Gao, J. Wu, and S. Wu, “Understanding and enhancement of internal clustering validation measures,” IEEE Transactions on Cybernetics, vol. 43, pp. 982–994, June 2013. [^2]: I. Gurrutxaga, J. Muguerza, O. Arbelaitz, J. M. P ́erez, and J. I. Mart ́ın, “Towards a standard methodology to evaluate internal cluster validity indices,” Pattern Recognition Letters, vol. 32, p. 505–515, Feb. 2011. [^DBA]: F. Petitjean, A. Ketterlin, and P. Gan carski, “A global averaging method for dynamic time warping, with applications to clustering,” Pattern Recognition, vol. 44, pp. 678–693, Mar. 2011.
@nglm I saw that you redacted responses to questions from the reviewers. I did not read the discussion yet, but I am thinking that, maybe, you could (in a later stage of the review) curate those discussions as a FAQ in your docs. Some of your answers might be usefull to the userbase more generally.
@editorialbot generate pdf
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
Hi @nglm some feedbacks on the paper :
Overall, the paper is well-written and easy to understand. Here are my few concerns and suggestions :
I would define a bit more precisely Varitional Information, and earlier in the paper.
I understand that the aim of the library is not to define new CVIs and adovcate for their quality, but to allow the access and easy usage for all of them. One of the argument is to facilitate the research in the field of new CVIs, clustering and evaluation methods, as mentioned :
This illustrates the need of further research on CVIs, which is facilitated by PyCVI, notably in the case of concentric subgroups.
Therefore, when the quality of the clustering selected by the CVI is poor it can then either be due to the clustering method or due to the CVI, hence the necessity of robust evaluation pipeline for both clustering methods and CVIs.
and here :
The poor results shown in the corresponding histogram \autoref{fig:agglo-hist} tells us that the CVI used here are not adapted to this dataset.
However I would add an example showing CVIs of good quality, just for the sake of displaying a contexte where they could be meaningful.
In terms of scholarly effort, the paper aligns well with most of the JOSS guidelines, particularly in the area of "making addressing research challenges significantly better (e.g., faster, easier, simpler)". Although I do not specialize in clustering, colleagues who do, have noted the lack of accessible CVIs in standard libraries. They usually implement their own versions for their studies and have expressed interest in this library.
A potential concern is how the paper will be cited. As noted by @robcaulk, users might prefer citing the original CVI papers.
In your full example, all CVIs are computed as follows:
for cvi_class in CVIs:
# Instantiate a CVI model
cvi = cvi_class()
t_start = time.time()
print(f" ====== {cvi} ====== ")
# Compute CVI values for all clusterings
scores = compute_all_scores(
cvi,
X,
clusterings,
DTW=DTW,
scaler=StandardScaler(),
)
How about implementing some kind of wrapper, analogous to sklearn's scorer, in your library? This wrapper could allow running multiple CVIs of choice at once, providing a scoring strategy more aligned with sklearn's usual practices. This could provide a more structured and reusable way to handle multiple CVIs, making the process more direct and user-friendly. It may encourage citing the library when users selected CVIs using pyCVI.
Additionally, putting a stronger focus on the time series clustering aspect could also help differentiate the contribution and encourage citations.
Once again thank you so much for the very useful feedback! I have now updated the library, the paper and the documentation in an attempt to answer @laurent-vouriot questions and concerns. This post tries to summarize the different updates.
pycvi.cvi.CVIAggregator
. The example was added to the documentation and to the paper, showing successful CVIs for a given dataset (see details on pycvi.cvi.CVIAggregator
below).By scorer-like wrapper, I mean (based on what I understand from your post) a sklearn-like scrorer as described here and there and which can be summarized as a callable that:
(estimator, X, y)
,k_selected = cvi.select(scores)
"). Note that we have precisely attempted to extend some existing CVIs to make their outputs more interpretable and comparable with other CVIs but in vain for most CVIs. There were few CVIs with which we could do it and make experiments that suggest that this extension is relevant, but this is ongoing work and outside the scope of this JOSS submission.Problem number 3 is the problem that makes this feature impossible to implement for all CVIs. One possibility would be to make this feature possible for only compatible CVIs but this would probably discourage users to use the other CVIs, which is also something we would rather avoid. I am however willing to do it if you think that missing this feature would actually discourage users to use the PyCVI library at all. Please let me know what you think about it.
In the meantime, to facilitate the use of mutliple CVIs at once and to encourage citations, I have now implemented the pycvi.cvi.CVIAggregator
class that allows users to combine several CVIs of their choice and to easily select the best clustering using a majority vote rule. The CVIAggregator
syntax to compute scores and to select the best clustering is completely identical to any CVI syntax to do the same thing. To illustrate this and to showcase this added feature, I have added an example in the documentation "CVIAggregator: Combining CVIs" and I have updated the paper.
pycvi.cvi.CVIAggregator
class and an example featuring pycvi.cvi.CVIAggregator
.@editorialbot generate pdf
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
@editorialbot commands
Hello @lrnv, here are the things you can ask me to do:
# List all available commands
@editorialbot commands
# Add to this issue's reviewers list
@editorialbot add @username as reviewer
# Remove from this issue's reviewers list
@editorialbot remove @username from reviewers
# Get a list of all editors's GitHub handles
@editorialbot list editors
# Assign a user as the editor of this submission
@editorialbot assign @username as editor
# Remove the editor assigned to this submission
@editorialbot remove editor
# Remind an author, a reviewer or the editor to return to a review after a
# certain period of time (supported units days and weeks)
@editorialbot remind @reviewer in 2 weeks
# Adds a checklist for the reviewer using this command
@editorialbot generate my checklist
# Set a value for version
@editorialbot set v1.0.0 as version
# Set a value for branch
@editorialbot set joss-paper as branch
# Set a value for repository
@editorialbot set https://github.com/organization/repo as repository
# Set a value for the archive DOI
@editorialbot set 10.5281/zenodo.6861996 as archive
# Mention the EiCs for the correct track
@editorialbot ping track-eic
# Run checks and provide information on the repository and the paper file
@editorialbot check repository
# Check the references of the paper for missing DOIs
@editorialbot check references
# Generates the pdf paper
@editorialbot generate pdf
# Recommends the submission for acceptance
@editorialbot recommend-accept
# Generates a LaTeX preprint file
@editorialbot generate preprint
# Flag submission with questionable scope
@editorialbot query scope
# Get a link to the complete list of reviewers
@editorialbot list reviewers
# Creates a post-review checklist with editor and authors tasks
@editorialbot create post-review checklist
# Open the review issue
@editorialbot start review
@editorialbot check repository
Software report:
github.com/AlDanial/cloc v 1.90 T=0.11 s (921.4 files/s, 327951.6 lines/s)
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
CSV 16 0 0 26706
Python 36 1197 3074 4256
TeX 1 32 36 397
Markdown 12 142 0 267
reStructuredText 32 390 405 114
Ruby 1 28 12 106
YAML 3 11 25 59
TOML 1 7 3 29
DOS Batch 1 8 1 26
JSON 1 0 0 23
make 1 4 7 9
-------------------------------------------------------------------------------
SUM: 105 1819 3563 31992
-------------------------------------------------------------------------------
Commit count by author:
318 Natacha, Galmiche
3 Natacha Galmiche
Paper file info:
📄 Wordcount for paper.md
is 1780
✅ The paper includes a Statement of need
section
License info:
✅ License found: MIT License
(Valid open source OSI approved license)
@nglm Word count for the paper doubled, why is that ? In joss, we aim at 1000 words max and 1-to-2 pages papers, this is now way too much figures and text
@lrnv This was a slow accumulation of a few additions that were suggested during the review process, notably:
I will try to make it more concise, but to go back to < 1000 words, I guess some of the explanations will have to be skipped.
@editorialbot generate pdf
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
@lrnv I have shortened the paper (from 1780 counted words to 1244 counted). Among those 1244 remaining words more than 100 are actually not visible in the pdf paper but consist of the metadata and the references tags in markdown. So the pdf paper has now less than 1144 visible words. I also removed some figures and combined some others. The text itself takes up less than 3 pages, the rest are references and figures.
I am aware that it is still slightly above recommendations but shortening it even more will be at the cost of clarity and of the improvements we have made based on the reviews.
@nglm I appreciate that. Without even looking at the text, I see a few things:
1) The reduction of figure sizes is indeed a good idea, but sometimes it produces weird output with a lot of spaces. See e.g. page 2 is mostly empty, maybe you could reformat the figure so that it fills out ? see screenshot :
2) Figures 3 and 4 could be one figure with propper facetting.
3) Figures 1 and 2 also could be one figure with propper facetting.
@editorialbot generate pdf
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
@editorialbot generate pdf
@lrnv I have updated the figures. (And sorry for spamming with the bot)
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
@robcaulk, @wob86, @laurent-vouriot, your three reviews still have uncheck boxes, could you please either check them or give me an update on what is happening ?
@robcaulk, @wob86, @laurent-vouriot, gentle bump
sorry not had chance to update, i am happy with all changes made and I am happy to approve this now. Thanks
Hi @lrnv and @nglm. Just read the updated version of the paper. My key concerns were taken into consideration, and I'm happy with the results. Last few very minor concerns about the paper :
There exists many mature libraries in python for machine learning and in particular clustering scikit-learn ...
It should be "exist" and I would add ":" between clustering and scikit-learn.
PyCVI fills that gap by implementing 12 state-of-the-art internal CVIs: [...] . Then, in PyCVI their definition is extended ...
Starting the sentence with "Then" feels a bit akward to me
High VI values there mean large distances between the true clustering and the computed clusterings, meaning poor computed clusterings
"there mean" sounds akward, and "poor computed clustering" also. poor results, poor quality, bad quality ?
Since the generated clusterings are poor...
Same concern.
Aside from these small details, everyhting looks good to me, and I'm glad to approve it.
Thanks for the effort on this, I am OK with the updates.
Hi all @lrnv @robcaulk @wob86 @laurent-vouriot . Thank you once more for your help throughout this review! I have now updated the paper according to @laurent-vouriot 's comments.
@editorialbot generate pdf
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
@editorialbot set <DOI here> as archive
@editorialbot set <version here> as version
@editorialbot generate pdf
@editorialbot check references
and ask author(s) to update as needed@editorialbot recommend-accept
@nglm could you please handle the post-review items on the upper check-list and report to me once this is done ?
Hi @lrnv This is now done!
Submitting author: !--author-handle-->@nglm<!--end-author-handle-- (Natacha Galmiche) Repository: https://github.com/nglm/pycvi Branch with paper.md (empty if default branch): Version: v0.1.4 Editor: !--editor-->@lrnv<!--end-editor-- Reviewers: @robcaulk, @wob86, @laurent-vouriot Archive: 10.5281/zenodo.13838192
Status
Status badge code:
Reviewers and authors:
Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)
Reviewer instructions & questions
@robcaulk & @wob86 & @laurent-vouriot, your review will be checklist based. Each of you will have a separate checklist that you should update when carrying out your review. First of all you need to run this command in a separate comment to create the checklist:
The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @lrnv know.
✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨
Checklists
📝 Checklist for @robcaulk
📝 Checklist for @laurent-vouriot
📝 Checklist for @wob86