katoss commented 1 year ago

Submitting Author: (@katoss) All current maintainers: (@katoss) Package Name: cardsort One-Line Description of Package: A python package to analyse data from open card sorting tasks Repository Link: https://github.com/katoss/cardsort Version submitted: 0.2.2 Editor: @Batalex Reviewer 1: @Robaina Reviewer 2: @khynder Archive: JOSS DOI: N/A Version accepted: v 0.2.36 Date accepted (month/day/year): 08/16/2023

Code of Conduct & Commitment to Maintain Package

[x] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package after should it be accepted.
[x] I have read and will commit to package maintenance after the review as per the pyOpenSci Policies Guidelines.

Description

Include a brief paragraph describing what your package does:

Card sorting is a standard research method in the human computer interaction field to design information architectures. Thecardsort package analyzes and visualizes data from open card sorting tasks. Using csv data in the format returned by the kardsort tool (or any other tool outputting the same columns), it outputs a dendrogram based on hierarchical cluster analysis and pairwise similarity scores. It can also return category labels to learn which labels study participants gave to combinations of cards from emerging clusters.

Scope

Please indicate which category or categories. Check out our package scope page to learn more about our scope. (If you are unsure of which category you fit, we suggest you make a pre-submission inquiry):
- [ ] Data retrieval
- [x] Data extraction
- [x] Data processing/munging
- [ ] Data deposition
- [ ] Data validation and testing
- [x] Data visualization[^1]
- [ ] Workflow automation
- [ ] Citation management and bibliometrics
- [ ] Scientific software wrappers
- [ ] Database interoperability

Domain Specific & Community Partnerships

- [ ] Geospatial
- [ ] Education
- [ ] Pangeo

Community Partnerships

If your package is associated with an existing community please check below:

[ ] Pangeo
- [ ] My package adheres to the Pangeo standards listed in the pyOpenSci peer review guidebook

[^1]: Please fill out a pre-submission inquiry before submitting a data visualization package.

For all submissions, explain how the and why the package falls under the categories you indicated above. In your explanation, please address the following points (briefly, 1-2 sentences for each):
- Who is the target audience and what are scientific applications of this package?

The target audience are researchers or practitioners in human computer interaction and user experience. Open card sorting is a popular user research method used to understand how people order and conceptualize information, in order to design information architectures. In order to make sense of the data, clusters are often visualized in form of a dendrogram. To do so, pairwise similarity scores need to be calculated for all cards, followed by a hierarchical cluster analysis. This functionality is provided by the cardsort package. It also offers functionality to return category labels, in order to learn which labels study participants gave to combinations of cards in the emerging clusters.

Are there other Python packages that accomplish the same thing? If so, how does yours differ?

I did not find any python packages, nor other open source tools, that accomplish this, which was my motivation to make this package, when I was running a card sorting study in the course of my PhD. While research articles describe the steps of this analysis (i.e. pairwise similarity scores --> hierarchical cluster analysis --> dendrogram, see articles linked in question above), I did not find any open source tools that help to do this analysis. I used the widely used kardsort.com tool to collect the data, which refers to rather dated, closed source Windows software for analysis, which underpins my assumption of a lack of open source tools. The approach is rather simple, making use of scipy for hierarchical cluster analysis and dendrogram, adding a custom function to create the similarity scores, and putting everything in an easy-to-use pipeline. Nevertheless, in doing so, I think the cardsort package can help remove barriers for the application of this user research method (easy to use even for python beginners, no need to use closed source software that only runs on windows or expensive subscription tools). In any case I would have liked to have this package when I started my study :)

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

101

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] uses an OSI approved license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions.
[x] contains a tutorial with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration setup, such as GitHub Actions CircleCI, and/or others.

Publication Options

[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Checks

- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements][JossSubmissionRequirements]. Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements][JossSubmissionRequirements]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements][JossPaperRequirements] with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

[x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Confirm each of the following by checking the box.

[x] I have read the author guide.
[x] I expect to maintain this package for at least 2 years and can help find a replacement for the maintainer (team) if needed.

Please fill out our survey

[x] Last but not least please fill out our pre-review survey. This helps us track submission and improve our peer review process. We will also ask our reviewers and editors to fill this out.

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

The editor template can be found here.

The review template can be found here.

Batalex commented 1 year ago

Hey @katoss, I am Alex, and I will be acting as the editor-in-chief for this submission. I am looking forward to working with you 👋 Over this weekend, I will review the checklist below to see if we need to address anything before undergoing the review process.

Batalex commented 1 year ago

Editor in Chief checks

Hi there! Thank you for submitting your package for pyOpenSci review. Below are the basic checks that your package needs to pass to begin our review. If some of these are missing, we will ask you to work on them before the review process begins.

Please check our Python packaging guide for more information on the elements below.

[x] Installation The package can be installed from a community repository such as PyPI (preferred), and/or a community channel on conda (e.g. conda-forge, bioconda).
- [x] The package imports properly into a standard Python environment import package-name.
[x] Fit The package meets criteria for fit and overlap. Note 🐈‍⬛: addressed in presubmission inquiry
[x] Documentation The package has sufficient online documentation to allow us to evaluate package function and scope without installing the package. This includes:
- [x] User-facing documentation that overviews how to install and start using the package.
- [x] Short tutorials that help a user understand how to use the package and what it can do for them.
- [x] API documentation (documentation for your code's functions, classes, methods and attributes): this includes clearly written docstrings with variables defined using a standard docstring format. We suggest using the Numpy docstring format.
[x] Core GitHub repository Files
- [x] README The package has a README.md file with clear explanation of what the package does, instructions on how to install it, and a link to development instructions.
- [x] Contributing File The package has a CONTRIBUTING.md file that details how to install and contribute to the package.
- [x] Code of Conduct The package has a Code of Conduct file.
- [x] License The package has an OSI approved license. NOTE: We prefer that you have development instructions in your documentation too.
[x] Issue Submission Documentation All of the information is filled out in the YAML header of the issue (located at the top of the issue template).
[x] Automated tests Package has a testing suite and is tested via GitHub actions or another Continuous Integration service.
[x] Repository The repository link resolves correctly.
[x] Package overlap The package doesn't entirely overlap with the functionality of other packages that have already been submitted to pyOpenSci.
[ ] Archive (JOSS only, may be post-review): The repository DOI resolves correctly.
[ ] Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

[x] Initial onboarding survey was filled out We appreciate each maintainer of the package filling out this survey individually. :raised_hands: Thank you authors in advance for setting aside five to ten minutes to do this. It truly helps our organization. :raised_hands:

Editor comments

The version in pyproject.toml does not match the latest release. Is there any way to use semantic release to update the pyproject.toml file based on the tag? If not, maybe we should set the version here to something like "0.0.0" to indicate that this is not the true version.
This is on the verge of reviewing the code, so I will no go any further at the moment, but the dependencies constraints are quite hard.
```
pandas = "^1.5.3"
numpy = "^1.24.2"
matplotlib = "^3.7.1"
scipy = "^1.10.1"
```
You could increase the pool of potential users by loosening the constraints. Based on what I saw in the sources, we do not need the latest versions for these four packages. + myst-nb should be a dev dependency
I believe the import would be clearer if we just used the functionalities from the top module namespace, e.g from cardsort import get_distance_matrix or import cardsort; cardsort.get_distance_matrix() instead of propagating analysis all the way up to the user code. This is quite easy, you just need to import the functions in __init__.py

katoss commented 1 year ago

Thank you @Batalex ! I will try to address your comments in the course of the week. :)

katoss commented 1 year ago

Hi @Batalex, thank you very much for your feedback. I tried addressing everything, but have some open questions:

The version in pyproject.toml does not match the latest release. Is there any way to use semantic release to update the pyproject.toml file based on the tag? If not, maybe we should set the version here to something like "0.0.0" to indicate that this is not the true version.

Normally semantic release should bump the version automatically, but I keep running into issues. It only seems to work via getting the release version from the GitHub tag. I set the version to 0.0.0 for now, as you suggested. I will try to find a better solution, but for now at least the version is correctly updated on github and pypi.

This is on the verge of reviewing the code, so I will no go any further at the moment, but the dependencies constraints are quite hard. pandas = "^1.5.3" numpy = "^1.24.2" matplotlib = "^3.7.1" scipy = "^1.10.1" You could increase the pool of potential users by loosening the constraints. Based on what I saw in the sources, we do not need the latest versions for these four packages.

I am sorry if this is a stupid question, but how do I know which are the minimum versions of packages I can allow without breaking my code? I tried using pipdeptree, but did not find that very helpful, because it only tells me which are the dependencies of the currently installed package versions, not if I could use something lower.

myst-nb should be a dev dependency

True, thank you. Fixed it.

I believe the import would be clearer if we just used the functionalities from the top module namespace, e.g from cardsort import get_distance_matrix or import cardsort; cardsort.get_distance_matrix() instead of propagating analysis all the way up to the user code. This is quite easy, you just need to import the functions in init.py

I wasn't aware that this was possible. Thank you, fixed it!

Best,

@katoss

Batalex commented 1 year ago

I am sorry if this is a stupid question, but how do I know which are the minimum versions of packages I can allow without breaking my code?

This is an excellent question! And a truly difficult one to answer in Python, given the dynamic nature of the language. Maybe someday, a tool will be able to parse the source code and match it against the dependencies changelogs, but I am dreaming 🫠

If the dependencies were perfectly following the semver convention, you could use the latest major releases for each dependency. This is extreme in a practical setting, so I would not advise going so far.

You are using but a few basic numpy functions, so you should be could fine with the same constraint as pandas. The 1.5.X series defines the following minimal dependencies: https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.5.0.html#backwards-incompatible-api-changes. You should be good using

pandas = "^1.5.0"
numpy = "^1.20.3"
matplotlib = "^3.3.2"
scipy = "^1.7.1"

Maybe we could go even lower with pandas, but this change is enough IMO. With the recent release of pandas 2.0.0, it might be worth loosening the upper constraint users using the latest version. Based on what I see, this should not be an issue for cardsort.

katoss commented 1 year ago

Thank you for the explanations! So it is really not that easy. For now, I reduced the constraints to the versions you indicated.

NickleDave commented 1 year ago

Hi @Batalex @katoss just adding one more possibly useful piece of info: this Scientific Python Ecosystem Coordination (SPEC) document specifies when packages should drop support for versions of core scientific Python packages, which includes your dependencies @katoss:
https://scientific-python.org/specs/spec-0000/

Unfortunately it does not say what is the minimum that you should support 🤷 but I basically go with the minimum for whatever Python is still supported -- right now that's Python 3.9 -> numpy 1.21.0, etc. ...

It looks like you ended up with roughly those versions as your lower bounds anyways.

Not to overwhelm you with info but if you don't know about it already you might want to check out: https://iscinumpy.dev/post/bound-version-constraints/ For now this probably won't matter (we won't get scipy version 2 anytime soon) but in general you might want to use loose lower bounds instead of the poetry-specific caret operator

Batalex commented 1 year ago

Hey @katoss!, You are not done with me yet, as I will be the editor for this review! 🙌 I will be recruiting reviewers in the coming days/weeks.

Batalex commented 1 year ago

:wave: Hi @robaina and @khynder! Thank you for volunteering to review for pyOpenSci! I am thrilled to have you both on board for this review. Feel free to introduce yourselves to our very own submitter @katoss 🤗

@katoss, meet @robaina, whom I had the pleasure of working with on his submission to pyOpenSci. As for @khynder, I have enjoyed working with her over the past five years and can vouch for her dedication and passion for science.

The following resources will help you complete your review:

Here is the reviewers guide. This guide contains all of the steps and information needed to complete your review.
Here is the review template that you will need to fill out and submit here as a comment, once your review is complete.

Please get in touch with any questions or concerns! Your review is due: on June 5th.

Reviewers: @robaina, @khynder Due date: June 5th, 2023

Robaina commented 1 year ago

Hi @katoss, nice to meet you! I'll be one of the reviewers of cardsort. I had such a nice experience when I submitted my package to pyOpenSci. Hope you do too! Looking forward to the review.

katoss commented 1 year ago

Nice to meet you @Robaina and @khynder! Thank you for taking the time to review my package :) Looking forward to working with you.

Robaina commented 1 year ago

Hi @Batalex, @katoss,

a quick question: according to pyOpenSci docs I understand that I should review cardsort v0.2.2 and not the latest release (v0.2.12) since this is the version that was submitted to pyOpenSci, right? Just to be sure.

Batalex commented 1 year ago

Hey, This is indeed a tricky question. On the one hand, it would be inconvenient to ask submitting authors to freeze the project development during submission. On the other hand, reviewing the complete code base is already very time-consuming for the reviewers; they do not have the time to review the whole code repeatedly. We are going to proceed with the following approach:

@Robaina, @khynder the version listed (0.2.2) is the one you should review
Once you are done with your review, it is up to @katoss to point out the PRs relevant to your comments / requested changes.

The rationale behind this approach is that we want to ensure the review scope stays more or less the same. No one wants to commit to a review on a fixed scope only to discover that the goalposts moved a few weeks later. Furthermore, some of the reviewers' comments might already be addressed in the development branch. The submitting author would then be the most likely to answer with something like "thanks for the comment. Are you satisfied with the solution we made a few days ago on ecb645 / PR # 32?"

Based on what I saw, the most recent releases deal with a few mishaps on the packaging side. Most of the code base remains untouched so this approach should be fine!

Robaina commented 1 year ago

Alright! Thanks for the explanation

Robaina commented 1 year ago

Hi @katoss, @Batalex,

⭐ I have completed the review ⭐

I find cardsorta nice contribution! You have put a lot of work into structuring and documenting the package, it has sufficient tests and a cool CI / CD pipeline 🚀🚀 . I have got some comments, mostly suggestions to make the code more efficient and readable. I hope these will be useful to you! Please, let me know if something isn't clear from my comments. Also, I'm willing to help you implement these suggestions with PRs if you want.

Ok, here it goes:

CHECKLIST:

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README.

It is succinct but sufficient. However, I think the repo would benefit from a little more explanation for non-experts. This may be as easy as providing a link to an url explaining why card sorting is useful in UX design.

[x] Installation instructions: for the development version of the package and any non-standard dependencies in README.

The instructions are located within the 'CONTRIBUTING.md' file. I would add the link to this file directly in the Contributing section of the README.md file.

[x] Vignette(s) demonstrating major functionality that runs successfully locally.
[x] Function Documentation: for all user-facing functions.

The documentation is extensive. Hosted in readthedocs.io. I suggest to provided the url to the docs in the About section of the repo to increase visibility.

[x] Examples for all user-facing functions.
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.

I suggest adding the link to the contributing file in the contributing section of the README.md file.

[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a pyproject.toml file or elsewhere.

Readme file requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name

The current name of the package, 'cardsort', is missing as such in the title of the README file. Also, if you like, you could create a simple logo for the package to attract more users (See this repo as an example).

[x] Badges for:
- [x] Continuous integration and test coverage,
- [x] Docs building (if you have a documentation website),
- [x] A repostatus.org badge,
- [x] Python versions supported,
- [x] Current package version (on PyPI / Conda).

There are no badges in the README. You can read about badges here in case you are not familiar. Also, I can help with this if you need it.

[x] Short description of package goals.
[x] Package installation instructions
[x] Any additional setup required to use the package (authentication tokens, etc.)
[x] Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file.
- [x] Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
[x] Link to your documentation website.

As stated above, I suggest adding the URL to the About section of the repo to increase visibility.

[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
[x] Citation information

No citation info was provided. I suggest publishing the repo in Zenodo and adding the citation and the DOI to the README.md file. Once published, Zenodo provides a badge containing the DOI and which you can display in the badge section of the README file. You could also add a CITATON.cff file to the repo so users can easily get the citation string.

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:

[x] Package documentation is clear and easy to find and use.
[x] The need for the package is clear

While the purpose of this package is indicated in the README, I think a little more explanation for non-experts would be useful. This may be as easy as providing a link to an url explaining why card sorting is useful in UX design.

[x] All functions have documentation and associated examples for use
[x] The package is easy to install

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.

As long as I can tell, the package works as expected.

[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests:
- [x] All tests pass on the reviewer's local machine for the package version submitted by the author. Ideally this should be a tagged version making it easy for reviewers to install.
- [x] Tests cover essential functions of the package and a reasonable range of inputs and conditions.
[x] Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines. A few notable highlights to look at:
- [x] Package supports modern versions of Python and not End of life versions.
- [x] Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)
There are no linting tests in the CI. Also, the original code does not follow PEP8 in some cases (see below). I suggest using a linter such as black or ruff (highly recommended) to ensure the code follows PEP8.

For packages also submitting to JOSS

Not applicable.

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: With DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[x] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 3.5

Review Comments

General comments:

While not necessary, I think using type hints would improve the readability of the functions.
Inconsistent use of .query and .loc as selectors. I would recommend using .loc in all cases as the code is not using complex queries and .loc may be more readable for Python users.
PEP-8 recommends using format strings instead of concatenation as it improves readability. Overall, I would recommend using a linter such as black or ruff (highly recommended) to ensure the code follows PEP8. The linter could also be added to a CI pipeline (let me know if you need help with this)
There are several cases in which a while loop followed by a try-except block is used. This choice makes the code more complex and it is unnecessary. In these cases, it can be replaced by a simpler loop, removing the need for the try-except block (see below).
'example-data.csv`is located within /docs, you could indicate the relative path in the example to help users find the file within the repo.

Detailed comments for each function:

_get_distance_matrix_for_user

There is no need to update the index variables, i, j, within the loop, their value is set by the for loop itself.
Given that the distance matrix is symmetric, you could save computational time by only calculating the upper triangle of the matrix and then copying the values to the lower triangle. For example:

for i in range(n):
    for j in range(i, n):
        cat1 = ...
        cat2 = ...
        X[i, j] = X[j, i] = 0 if cat1 == cat2 else 1

Note that you could also write the if - else statement in a single line as shown above.

Perhaps using loc instead of query would be more readable for Python users, we are not making a complex query here anyway.

get_distance_matrix

You could replace the while loop with a for loop and avoid the need to update the index variable, id, within the loop. To this end, you could first find all unique user_ids in the dataframe: user_ids = df['user_id'].unique(), and then loop over them: for id in user_ids:.

create_dendrogram

You could assign value to color_threshold in a single line as follows:

color_treshold = 0.75 if color_treshold is None else color_treshold

and similar to the second definition to simplify the nested if - else statements.

You could also write the if-else statement in plt.xticks in a single line:

 plt.xticks(np.arange(0.0, 1.1, 0.1) if x_max <= 1 else np.arange(0, x_max + 1, 1))

_get_cluster_label_for_user

It seems that index is not used within the function, hence enumerate and updating its value within the loop (index += 1) are not needed.
The try-except block adds complexity and it isn't really needed. You could replace it with an if-else block:

if card in df_u["card_label"].values:
    ...
else:
    print(f'"{card}" is not a valid card label. Removed from list.')
    cluster_cards.remove(card)
    print("Continue with cards: %s" % cluster_cards)

_get_cards_for_label

As mentioned above, I would recommend using loc.

get_cluster_labels

The while loop and the try-except block add complexity and are not necessary here. You could replace the while loop with a for loop after listing all unique user_ids (similar to the suggestion above for get_distance_matrix). This way, the try-except block can be removed as we can guarantee that df_u is not empty (also getting rid of the UnboundLocalError which isn't specific and may be confusing)
PEP-8 recommends using format strings instead of concatenation. For example, you could replace the following line:

print("User " + str(id) + " labeled card(s): " + cluster_label)

with:

print(f"User {id} labeled card(s): {cluster_label}")

get_cluster_labels_df

Similarly to the previous function, I would recommend using a for loop instead of a while loop and removing the try-except block to simplify the code and improve readability.

katoss commented 1 year ago

Thanks a lot for your review @Robaina ! I had a quick look over your comments, and will try to address them over the weekend :)

Batalex commented 1 year ago

That was fast 🚀 Thanks a bunch, @Robaina!

katoss commented 1 year ago

Hi @Robaina, thanks a lot again for the thorough review! 🌟

I have started to address your comments, mainly so far by adapting the functions:

_get_distance_matrix_for_user (commits: 0c3b4dd)
get_distance_matrix (commits: 269a52a, e6be892)
create_dendrogram (commits: 8fc8679)
_get_cluster_label_for_user (commits: 7bbdb84)
_get_cards_for_label (commits: b882486)
get_cluster_labels (commits: 031ae0d)
get_cluster_labels_df (commits: 2ed2dba)
I also added the link to the docs to the about section of the repository

Thank you for helping me improve the code quality! Please let me know if I managed to adapt the functions as you imagined.

Regarding the rest of the feedback, I will try to address it as soon as possible.

katoss commented 1 year ago

I started to wonder about a more general question regarding versioning. My ci/cd pipeline publishes a new version with each commit. Like yesterday I went from 0.2.12. to 0.2.20 by making a commit for each change (since I did not want to make the commits too big). Based on your experience @Robaina @Batalex, is this the usual way to go it or is it too much?

Robaina commented 1 year ago

Hi @Robaina, thanks a lot again for the thorough review! star2

I have started to address your comments, mainly so far by adapting the functions:

_get_distance_matrix_for_user (commits: 0c3b4dd)

get_distance_matrix (commits: 269a52a, e6be892)

create_dendrogram (commits: 8fc8679)

_get_cluster_label_for_user (commits: 7bbdb84)

_get_cards_for_label (commits: b882486)

get_cluster_labels (commits: 031ae0d)

get_cluster_labels_df (commits: 2ed2dba)

I also added the link to the docs to the about section of the repository

Thank you for helping me improve the code quality! Please let me know if I managed to adapt the functions as you imagined.

Regarding the rest of the feedback, I will try to address it as soon as possible.

Great! Glad my comments were useful! I'll take a deeper look at them later and come back to you.

Robaina commented 1 year ago

I started to wonder about a more general question regarding versioning. My ci/cd pipeline publishes a new version with each commit. Like yesterday I went from 0.2.12. to 0.2.20 by making a commit for each change (since I did not want to make the commits too big). Based on your experience @Robaina @Batalex, is this the usual way to go it or is it too much?

Based on my experience I would say this is too much, I would update the version only after a significant change to the codebase. Of course this "significant" is subjective. Here are some guidelines for semantic versioning.

On the other hand, I actually forgot to ask you about your CI / CD pipeline the other day. I observed that currently the package is built and uploaded to PIP on every push to the main branch. I am by no means an expert in DevOps, but I think that if you are directly pushing every commit to main, then building the package and uploading to PIP is perhaps too much. I would do this only after collecting some changes that address a specific issue (or issues). This workflow would be fine if there were another branch, say, dev (for development) meant to collect changes and then the CD workflow would get triggered on merging dev to main.

Definitely @Batalex will know more about this.

katoss commented 1 year ago

Ok, thank you! Then my intuition wasn't wrong. I followed this tutorial to create the package and CI/CD, and did not question the way they did it, since it was the first time for me setting all of this up. Is it possible to trigger the whole CD manually so it only publishes a new version when I want it to? Or maybe the easiest thing would be that I create a development branch and collect the changes there, as you said.

Robaina commented 1 year ago

Perhaps the easiest solution is to trigger the PIP upload only on release --- automatically triggered after a new release of the repo (with a new version number). I could do a PR with this if you like.

katoss commented 1 year ago

@Robaina please go ahead if you like, that'd be great!

Batalex commented 1 year ago

Or maybe the easiest thing would be that I create a development branch and collect the changes there, as you said.

This is the approach I usually use. I think however that Robaina's approach is easier to implement in this case. The main change would be to run the GH workflow cd on tags rather than on every commit (and letting semantic-release create a new tag anyway).

It might look like this

on:
  push:
    tags:
      - 'v[0-9]+.[0-9]+.[0-9]+'

Robaina commented 1 year ago

Interesting, I was thinking of using:

on:
  release:
    types: [published]

This should trigger the workflow only after a new release is created. Either way should work, though!

Batalex commented 1 year ago

Both approaches should work. The "release published" one would work on all kinds of releases, whereas the tags one is limited by the regexp. The latter has the benefit of being actionable with basic git commands:

git tag x.y.z
git push origin x.y.z

In both cases, we would need to run semantic-release ourselves to get the proper "x.y.z" to use

Robaina commented 1 year ago

Hi @katoss ,

great work! I reviewed your commits and left some minor comments. Only one related to efficiency in _get_distance_matrix_for_user, you forgot to iterate only over the upper triangle.

I suggest that you use GitHub issues and pull requests in future interactions, it's easier to get code reviewed by a peer this way. Also, the code can be reviewed before a merge is done (for instance from a dev branch to main).

Cheers

katoss commented 1 year ago

Only one related to efficiency in _get_distance_matrix_for_user, you forgot to iterate only over the upper triangle.

I suggest that you use GitHub issues and pull requests in future interactions, it's easier to get code reviewed by a peer this way. Also, the code can be reviewed before a merge is done (for instance from a dev branch to main).

Thank you @Robaina ! I fixed the function in a new branch. I invited you and @Batalex as collaborators so I can assign you as reviewers to a pull request

katoss commented 1 year ago

And with regards to the cd, the line that will have to be changed is this one, and the rest can stay the same?

# Only run this job if new work is pushed to the "main" branch if: github.event_name == 'push' && github.ref == 'refs/heads/main'

Batalex commented 1 year ago

And with regards to the cd, the line that will have to be changed is this one, and the rest can stay the same?

With this configuration, you would make a new release each time main is updated. You might want to restrict releases to git tags or github releases (based on git tags)

katoss commented 1 year ago

Ok @Batalex, thank you for the clarification. I have never worked with tags, but am trying to understand them.

Is it right that I would need to put this code in the CD pipeline:

on:
  push:
    tags:
      - 'v[0-9]+.[0-9]+.[0-9]+'

instead of

 # Only run this job if new work is pushed to the "main" branch
 if: github.event_name == 'push' && github.ref == 'refs/heads/main'

?

And then I can assign tags to commits in the command line? If I understood right, this is done retrospectively, e.g. I decide at some point that everything up to commit abcdef123456 shall be included in the new release, so I'd do git tag v1.0.0 abcdef123456, push the tag, and then the release should be triggered?

khynder commented 1 year ago

Hi @katoss @Batalex @Robaina !

Here is my review: First of all, I really like the idea behind this package, the use of a dendrogram to visualize the card sorting is inteligent and usefull. The package is well documented and easy to read and use! Congrats 🎉.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README.
[x] Installation instructions: for the development version of the package and any non-standard dependencies in README.
[ ] Vignette(s) demonstrating major functionality that runs successfully locally. @Batalex what are the "vignettes" ?
[x] Function Documentation: for all user-facing functions.
[x] Examples for all user-facing functions.
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a pyproject.toml file or elsewhere.

Readme file requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[ ] The package name the title of the README is not exactly the package name
[ ] Badges for: no badges for now
- [ ] Continuous integration and test coverage,
- [ ] Docs building (if you have a documentation website),
- [ ] A repostatus.org badge,
- [ ] Python versions supported,
- [ ] Current package version (on PyPI / Conda).

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)

[x] Short description of package goals.
[x] Package installation instructions
[x] Any additional setup required to use the package (authentication tokens, etc.)
[ ] Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file.
- [ ] Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
[x] Link to your documentation website.
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
[ ] Citation information

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:

[x] Package documentation is clear and easy to find and use. 🎉🎉🎉
[x] The need for the package is clear
[x] All functions have documentation and associated examples for use
[x] The package is easy to install

Functionality

[x] Installation: Installation succeeds as documented. I installed directly the last version
[ ] Functionality: Any functional claims of the software been confirmed. Be careful with the data format when it does not come directly from kardsort.com, more details below
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests:
- [x] All tests pass on the reviewer's local machine for the package version submitted by the author. Ideally this should be a tagged version making it easy for reviewers to install.
- [ ] Tests cover essential functions of the package and a reasonable range of inputs and conditions. Maybe a test could be added on the order of labels in the dendrogram (on the test-data.csv)
[x] Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines. A few notable highlights to look at:
- [x] Package supports modern versions of Python and not End of life versions.
- [ ] Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass) no linter

For packages also submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: With DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[x] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 5 hours

Review Comments

General remarks:

It would be safer if a function was added to check that the data format is correct, especially if the package is used with data which does not come from kardsort.com. More specifically:

The first user in df should have 1 as user_id (otherwise in get_distance_matrix, distance_matrix_all will be used before creation, and labels could not be computed)
All users should categorize all cards and each card only once
card_id should go from 1 to n (n being the number of cards)
one card_id should always be associated with exactly one card_label (not two or more)

Function by function:

get_distance_matrix_for_user:

It can be written as a comprehension list if you want (no obligation, if you feel more at ease with the current code keep it 😊)

df_user = df_user.set_index("card_id")
X = np.array(
    [
        [
            df_user.loc[i + 1, "category_label"]
            != df_user.loc[j + 1, "category_label"]
            for i in range(n)
        ]
        for j in range(n)
    ]
).astype(int)

get_distance_matrix:

avoid using print in code, if you have a lot of users in the database it could slow down the execution. You can add a Boolean parameter print_progression or use loggers (@Batalex could explain loggers better than me I think)
the first user should have user_id = 1, otherwise distance_matrix_all won’t exist. It can stay that way, with a check at the beginning of the function to stop the computation if the first user_id is not 1. Or you can create before the for loop a variable distance_matrix_all = None, and replace the id == 1 condition by a distance_matrix_all is not None condition

create_dendrogram:

when you define labels

labels = df.loc[df["user_id"] == 1]["card_label"].squeeze().to_list()

It is very important that the order of the card_label in labels is exactly the same as the one used to create the distance_matrix. The distance_matrix is created with the card_id in the order 1, 2, 3, 4, …, n. So labels should be: first the card_label corresponding to card_id = 1, then card_label corresponding to card_id = 2, then card_label corresponding to card_id = 3, etc. The above code line works fine if the card_id for the user_id 1 are in the order 1, 2, 3 etc, but does not work if the card_id are shuffled. I’m not sure if i'm clear, so I tried with an example: if you replace the user_id results in test_data by:

card_id	card_label	category_id	category_label	user_id
9	Mooncake	4	Moon-shaped	1
10	Moon	4	Moon-shaped	1
1	Dog	1	pets	1
2	Tiger	1	pets	1
3	Cat	1	pets	1
4	Apple	2	lunch	1
5	Sandwich	2	lunch	1
6	Banana	3	long food	1
7	Hot Dog	3	long food	1
8	Croissant	4	Moon-shaped	1

Then the dendrogram will be: which makes no sense.

One way to correct this behavior is to add a sort_values(“card_id”) to the labels definition.

labels = df.loc[df["user_id"] == 1].sort_values(“card_id”)["card_label"].squeeze().to_list()

Note that it should also be changed in the example.ipynb.

get_cluster_labels:

I think the if/else condition after the for loop can be removed.
There is a lot of common code lines between get_cluster_labels and get_cluster_labels_df. These two functions can be factorized into one function, with 2 boolean parameters to indicate what type of return you want, for example print_results and return_df_results.

It could look like this:

def get_cluster_labels(df, cluster_cards, print_results=True, return_df_results=False):
    if return_df_results:
        cluster_df = pd.DataFrame(columns=["user_id", "cluster_label", "cards"])

    user_ids = df["user_id"].unique()

    for id in user_ids:
        df_u = df.loc[df["user_id"] == id]
        cluster_label = _get_cluster_label_for_user(df_u, cluster_cards)
        if cluster_label is not None:
            if print_results:
                print("User " + str(id) + " labeled card(s): " + cluster_label)
            if return_df_results:
                cards = _get_cards_for_label(cluster_label, df_u)
                cluster_df = pd.concat(
                    [
                        cluster_df,
                        pd.DataFrame.from_records(
                            [
                                {
                                    "user_id": id,
                                    "cluster_label": cluster_label,
                                    "cards": cards,
                                }
                            ]
                        ),
                    ],
                    ignore_index=True,
                )
        else:
            if print_results:
                print("User " + str(id) + " did not cluster cards together.")

    if return_df_results:
        return cluster_df

the warning "is/are not a valid card label" can be moved in this function, instead of being checked at a user_id level:

if not set(cluster_cards) <= set(df["card_label"]):
    missing_card_labels = set(cluster_cards) - set(df["card_label"])
    print(
        f'"{missing_card_labels}" is/are not a valid card label. Removed from list.'
    )
    cluster_cards = [
        card_label
        for card_label in cluster_cards
        if card_label not in missing_card_labels
    ]
    print("Continue with cards: %s" % cluster_cards)

get_cluster_label_for_user

It can be simplified with the “isin()” pandas method: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.isin.html

    list_cat = df_u.loc[
        df_u["card_label"].isin(cluster_cards), "category_label"
    ].unique()
    if len(list_cat) == 1:
        return list_cat.squeeze()

Documentation

A “)” is missing at the end of the line starting with "distance_matrix", in the “Adapting the dendrogram” section.

Batalex commented 1 year ago

If I understood right, this is done retrospectively, e.g. I decide at some point that everything up to commit abcdef123456 shall be included in the new release

Yes, that would be the process

Batalex commented 1 year ago

Thanks @khynder, I knew you would like this package 🤗

@katoss, I have a few remarks as well. This is not a thorough review, but I like to give my opinion even as the editor.

imports should be at the top level, __all__ should go below. I would advice as suggested by @Robaina using a code formater and a linter
_get_distance_matrix_for_user could be rewritten using scipy's pdist, but it is worth checking if the rewrite to compute the upper triangular matrix is more efficient.
get_cluster_labels, get_distance_matrix and get_cluster_labels_df use the form id = 1; while id <= max(df.user.id). We can make this more robust by using a for loop on the values in the dataframe. The original version would fail for string ids or even if a value is missing, e.g [1, 2, 4, 5]. Since the value 1 is used to initialize the distance matrix, you may add a boolean condition (to be changed in the first iteration), or you may create a matrix with np.zeros before the for loop. I like using for user_id, sub_df in df.groupby("user_id"):, you are guaranteed to loop over all user_id values in the dataframe.
Since the project targets python versions >= 3.8, I would advise using f-strings to interpolate variables in strings.
Most of the time, we want to have the less possible code in try .. except blocks. The more specific you are when catching exception, the easier it is to debug your program.
I believe most of the complexity of get_cluster_labels could be cut down by checking the values of cluster_cards against the possible values before entering the loop. It is quite risky to mutate a datastructure (especially removing values) while using it in a loop, and usually considered a bad practice.

katoss commented 1 year ago

Thanks a lot for your reviews @khynder and @Batalex ! I will try to address them as soon as possible (it might take a bit because i'm traveling for work in the next days and there are quite some points to address).

I will start with fixing the CD pipeline, so it does not disturb the review process. Also I will start using pull requests, for easier review. I added you three as collaborators for the project, so I can assign PRs to you (@Batalex and @khynder please check your invites, @Robaina already accepted) :)

katoss commented 1 year ago

Ok @Batalex, thank you for the clarification. I have never worked with tags, but am trying to understand them.

Is it right that I would need to put this code in the CD pipeline:
on:
  push:
    tags:
      - 'v[0-9]+.[0-9]+.[0-9]+'
instead of
 # Only run this job if new work is pushed to the "main" branch
 if: github.event_name == 'push' && github.ref == 'refs/heads/main'
?

And then I can assign tags to commits in the command line? If I understood right, this is done retrospectively, e.g. I decide at some point that everything up to commit abcdef123456 shall be included in the new release, so I'd do git tag v1.0.0 abcdef123456, push the tag, and then the release should be triggered?

Just created a PR with these changes :)

Robaina commented 1 year ago

Hi @katoss, @khynder, @Batalex,

I checked the PR above. I left some comments. I'm not entirely sure that the current solution will work as intended. Would appreciate it if you take a look.

katoss commented 1 year ago

Just commented. And I second @Robaina, would be great if you could have a look over our solution ideas, @Batalex and @khynder !

katoss commented 1 year ago

Just pushed a commit and commented on the PR :) Will have a look at the other feedback now.

katoss commented 1 year ago

Made another draft PR for some general formatting issues (linting, type hints, etc).

@Batalex and @khynder, could you accept my invite to be collaborator in this repo? Then I can add you as reviewers to PRs.

So as an overview, as far as I see the main blocks of feedback to address are the CI/CD improvements, general formatting, improvement of functions, and documentation. I think I will wait with further function improvements until CI/CD and general formatting blocks (the two open PRs) are fixed, but will be able to address documentation quite independently.

Thanks again for your reviews, great to see the package progessing! :)

Robaina commented 1 year ago

Great to hear @katoss. Will have a look at the PR in the following days

katoss commented 1 year ago

Hey, I just made a PR for improvements of the documentation @Robaina @khynder @Batalex 🌟

And I converted the linting/code formatting PR from draft to PR, I think it should be ready to be merged.

Once these are merged I will start addressing the feedback from @khynder and @Batalex regarding functions 👩‍💻

katoss commented 1 year ago

Hi, I just started a draft PR for changes in functions based on the reviews of @khynder and @Batalex :)

It is quite a lot, I listed everything, but have addressed only some issues so far, and have some questions for others. @khynder , could you accept the invitation to become a collaborator on cardsort, so I can assign you as a reviewer to the PR?

I also feel there might be still an issue with the trigger for CD, since CD started for all commits for patch changes in the new PR (see Actions).

Batalex commented 1 year ago

Hi, I just started a draft PR for changes in functions based on the reviews of @khynder and @Batalex :)

It is quite a lot, I listed everything, but have addressed only some issues so far, and have some questions for others. @khynder , could you accept the invitation to become a collaborator on cardsort, so I can assign you as a reviewer to the PR?

I also feel there might be still an issue with the trigger for CD, since CD started for all commits for patch changes in the new PR (see Actions).

Can you try removing the pull_request trigger in CI.yml ? According to the documentation, there are many possible sub triggers for this one

katoss commented 1 year ago

Can you try removing the pull_request trigger in CI.yml ? According to the documentation, there are many possible sub triggers for this one

Thank you, just changed it in the fix-functions branch, let's see if it changes something.

Edit: I guess the change was not thought to impact the CD and was more of a global suggestion, but in any case the CD is still getting triggered. Probably the

release: types: [published]

does not work. I will try to think about another solution

katoss commented 1 year ago

I worked on the fix-functions PR, I think it might be good to go @khynder @Batalex

katoss commented 1 year ago

I merged the fix-functions PR 🚀 , thank you for your help @khynder @Batalex . I also added the repo to zenodo and added a badge, as you suggested @Robaina (on branch update-readme so far). I'm also about to add a list of contributors, will try to figure out how to use the all-contributors bot in the next days. I should probably also change the repo-status from WIP to active?

I think I worked through all the feedback now 🤔 What are the next steps? :)

Batalex commented 1 year ago

Congrats @katoss, that was a big one!
Somehow I forgot to keep up, did you manage to fix the CD pipeline?

What are the next steps? :)

For now, I'll ask @Robaina and @khynder to check the final approval checkbox in their review message. Then, I will wrap up the review by checking some other stuff. It's great that you already took care of zenodo and the badge!

katoss commented 1 year ago

Somehow I forgot to keep up, did you manage to fix the CD pipeline?

Yes, I did, I think I forgot to write it. Now the CD only triggers when changes are pushed or merged with the main branch. So far it seems to work :)

For now, I'll ask @Robaina and @khynder to check the final approval checkbox in their review message. Then, I will wrap up the review by checking some other stuff. It's great that you already took care of zenodo and the badge!

Ok amazing!

pyOpenSci / software-submission

cardsort: analyzing data from open card sorting tasks #102

Code of Conduct & Commitment to Maintain Package

Description

Scope

Community Partnerships

101

Technical checks

Publication Options

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

Please fill out our survey

Editor and Review Templates

Editor in Chief checks

[x] Initial onboarding survey was filled out We appreciate each maintainer of the package filling out this survey individually. :raised_hands: Thank you authors in advance for setting aside five to ten minutes to do this. It truly helps our organization. :raised_hands:

Editor comments

Please get in touch with any questions or concerns! Your review is due: on June 5th.

Package Review

Documentation

Usability

Functionality

For packages also submitting to JOSS

Final approval (post-review)

Review Comments

Package Review

Documentation

Usability

Functionality

For packages also submitting to JOSS

Final approval (post-review)

Review Comments

General remarks:

Function by function:

get_distance_matrix_for_user:

get_distance_matrix:

create_dendrogram:

get_cluster_labels:

get_cluster_label_for_user

Documentation