pyOpenSci / software-submission

Submit your package for review by pyOpenSci here! If you have questions please post them here: https://pyopensci.discourse.group/
91 stars 36 forks source link

cardsort: analyzing data from open card sorting tasks #102

Closed katoss closed 11 months ago

katoss commented 1 year ago

Submitting Author: (@katoss) All current maintainers: (@katoss) Package Name: cardsort One-Line Description of Package: A python package to analyse data from open card sorting tasks Repository Link: https://github.com/katoss/cardsort Version submitted: 0.2.2 Editor: @Batalex Reviewer 1: @Robaina Reviewer 2: @khynder Archive: DOI JOSS DOI: N/A Version accepted: v 0.2.36 Date accepted (month/day/year): 08/16/2023


Code of Conduct & Commitment to Maintain Package

Description

Card sorting is a standard research method in the human computer interaction field to design information architectures. Thecardsort package analyzes and visualizes data from open card sorting tasks. Using csv data in the format returned by the kardsort tool (or any other tool outputting the same columns), it outputs a dendrogram based on hierarchical cluster analysis and pairwise similarity scores. It can also return category labels to learn which labels study participants gave to combinations of cards from emerging clusters.

Scope

Domain Specific & Community Partnerships

- [ ] Geospatial
- [ ] Education
- [ ] Pangeo

Community Partnerships

If your package is associated with an existing community please check below:

[^1]: Please fill out a pre-submission inquiry before submitting a data visualization package.

The target audience are researchers or practitioners in human computer interaction and user experience. Open card sorting is a popular user research method used to understand how people order and conceptualize information, in order to design information architectures. In order to make sense of the data, clusters are often visualized in form of a dendrogram. To do so, pairwise similarity scores need to be calculated for all cards, followed by a hierarchical cluster analysis. This functionality is provided by the cardsort package. It also offers functionality to return category labels, in order to learn which labels study participants gave to combinations of cards in the emerging clusters.

I did not find any python packages, nor other open source tools, that accomplish this, which was my motivation to make this package, when I was running a card sorting study in the course of my PhD. While research articles describe the steps of this analysis (i.e. pairwise similarity scores --> hierarchical cluster analysis --> dendrogram, see articles linked in question above), I did not find any open source tools that help to do this analysis. I used the widely used kardsort.com tool to collect the data, which refers to rather dated, closed source Windows software for analysis, which underpins my assumption of a lack of open source tools. The approach is rather simple, making use of scipy for hierarchical cluster analysis and dendrogram, adding a custom function to create the similarity scores, and putting everything in an easy-to-use pipeline. Nevertheless, in doing so, I think the cardsort package can help remove barriers for the application of this user research method (easy to use even for python beginners, no need to use closed source software that only runs on windows or expensive subscription tools). In any case I would have liked to have this package when I started my study :)

101

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

Publication Options

JOSS Checks - [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements][JossSubmissionRequirements]. Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements][JossSubmissionRequirements]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements][JossPaperRequirements] with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

Confirm each of the following by checking the box.

Please fill out our survey

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

The editor template can be found here.

The review template can be found here.

Batalex commented 1 year ago

Hey @katoss, I am Alex, and I will be acting as the editor-in-chief for this submission. I am looking forward to working with you 👋 Over this weekend, I will review the checklist below to see if we need to address anything before undergoing the review process.

Batalex commented 1 year ago

Editor in Chief checks

Hi there! Thank you for submitting your package for pyOpenSci review. Below are the basic checks that your package needs to pass to begin our review. If some of these are missing, we will ask you to work on them before the review process begins.

Please check our Python packaging guide for more information on the elements below.



Editor comments

katoss commented 1 year ago

Thank you @Batalex ! I will try to address your comments in the course of the week. :)

katoss commented 1 year ago

Hi @Batalex, thank you very much for your feedback. I tried addressing everything, but have some open questions:

The version in pyproject.toml does not match the latest release. Is there any way to use semantic release to update the pyproject.toml file based on the tag? If not, maybe we should set the version here to something like "0.0.0" to indicate that this is not the true version.

Normally semantic release should bump the version automatically, but I keep running into issues. It only seems to work via getting the release version from the GitHub tag. I set the version to 0.0.0 for now, as you suggested. I will try to find a better solution, but for now at least the version is correctly updated on github and pypi.

This is on the verge of reviewing the code, so I will no go any further at the moment, but the dependencies constraints are quite hard. pandas = "^1.5.3" numpy = "^1.24.2" matplotlib = "^3.7.1" scipy = "^1.10.1" You could increase the pool of potential users by loosening the constraints. Based on what I saw in the sources, we do not need the latest versions for these four packages.

I am sorry if this is a stupid question, but how do I know which are the minimum versions of packages I can allow without breaking my code? I tried using pipdeptree, but did not find that very helpful, because it only tells me which are the dependencies of the currently installed package versions, not if I could use something lower.

  • myst-nb should be a dev dependency

True, thank you. Fixed it.

I believe the import would be clearer if we just used the functionalities from the top module namespace, e.g from cardsort import get_distance_matrix or import cardsort; cardsort.get_distance_matrix() instead of propagating analysis all the way up to the user code. This is quite easy, you just need to import the functions in init.py

I wasn't aware that this was possible. Thank you, fixed it!

Best,

@katoss

Batalex commented 1 year ago

I am sorry if this is a stupid question, but how do I know which are the minimum versions of packages I can allow without breaking my code?

This is an excellent question! And a truly difficult one to answer in Python, given the dynamic nature of the language. Maybe someday, a tool will be able to parse the source code and match it against the dependencies changelogs, but I am dreaming 🫠

If the dependencies were perfectly following the semver convention, you could use the latest major releases for each dependency. This is extreme in a practical setting, so I would not advise going so far.

You are using but a few basic numpy functions, so you should be could fine with the same constraint as pandas. The 1.5.X series defines the following minimal dependencies: https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.5.0.html#backwards-incompatible-api-changes. You should be good using

pandas = "^1.5.0"
numpy = "^1.20.3"
matplotlib = "^3.3.2"
scipy = "^1.7.1"

Maybe we could go even lower with pandas, but this change is enough IMO. With the recent release of pandas 2.0.0, it might be worth loosening the upper constraint users using the latest version. Based on what I see, this should not be an issue for cardsort.

katoss commented 1 year ago

Thank you for the explanations! So it is really not that easy. For now, I reduced the constraints to the versions you indicated.

NickleDave commented 1 year ago

Hi @Batalex @katoss just adding one more possibly useful piece of info: this Scientific Python Ecosystem Coordination (SPEC) document specifies when packages should drop support for versions of core scientific Python packages, which includes your dependencies @katoss:
https://scientific-python.org/specs/spec-0000/

Unfortunately it does not say what is the minimum that you should support 🤷 but I basically go with the minimum for whatever Python is still supported -- right now that's Python 3.9 -> numpy 1.21.0, etc. ...

It looks like you ended up with roughly those versions as your lower bounds anyways.

Not to overwhelm you with info but if you don't know about it already you might want to check out: https://iscinumpy.dev/post/bound-version-constraints/ For now this probably won't matter (we won't get scipy version 2 anytime soon) but in general you might want to use loose lower bounds instead of the poetry-specific caret operator

Batalex commented 1 year ago

Hey @katoss!, You are not done with me yet, as I will be the editor for this review! 🙌 I will be recruiting reviewers in the coming days/weeks.

Batalex commented 1 year ago

:wave: Hi @robaina and @khynder! Thank you for volunteering to review for pyOpenSci! I am thrilled to have you both on board for this review. Feel free to introduce yourselves to our very own submitter @katoss 🤗

@katoss, meet @robaina, whom I had the pleasure of working with on his submission to pyOpenSci. As for @khynder, I have enjoyed working with her over the past five years and can vouch for her dedication and passion for science.

The following resources will help you complete your review:

  1. Here is the reviewers guide. This guide contains all of the steps and information needed to complete your review.
  2. Here is the review template that you will need to fill out and submit here as a comment, once your review is complete.

Please get in touch with any questions or concerns! Your review is due: on June 5th.

Reviewers: @robaina, @khynder Due date: June 5th, 2023

Robaina commented 1 year ago

Hi @katoss, nice to meet you! I'll be one of the reviewers of cardsort. I had such a nice experience when I submitted my package to pyOpenSci. Hope you do too! Looking forward to the review.

katoss commented 1 year ago

Nice to meet you @Robaina and @khynder! Thank you for taking the time to review my package :) Looking forward to working with you.

Robaina commented 1 year ago

Hi @Batalex, @katoss,

a quick question: according to pyOpenSci docs I understand that I should review cardsort v0.2.2 and not the latest release (v0.2.12) since this is the version that was submitted to pyOpenSci, right? Just to be sure.

Batalex commented 1 year ago

Hey, This is indeed a tricky question. On the one hand, it would be inconvenient to ask submitting authors to freeze the project development during submission. On the other hand, reviewing the complete code base is already very time-consuming for the reviewers; they do not have the time to review the whole code repeatedly. We are going to proceed with the following approach:

The rationale behind this approach is that we want to ensure the review scope stays more or less the same. No one wants to commit to a review on a fixed scope only to discover that the goalposts moved a few weeks later. Furthermore, some of the reviewers' comments might already be addressed in the development branch. The submitting author would then be the most likely to answer with something like "thanks for the comment. Are you satisfied with the solution we made a few days ago on ecb645 / PR # 32?"

Based on what I saw, the most recent releases deal with a few mishaps on the packaging side. Most of the code base remains untouched so this approach should be fine!

Robaina commented 1 year ago

Alright! Thanks for the explanation

Robaina commented 1 year ago

Hi @katoss, @Batalex,

⭐ I have completed the review ⭐

I find cardsorta nice contribution! You have put a lot of work into structuring and documenting the package, it has sufficient tests and a cool CI / CD pipeline 🚀🚀 . I have got some comments, mostly suggestions to make the code more efficient and readable. I hope these will be useful to you! Please, let me know if something isn't clear from my comments. Also, I'm willing to help you implement these suggestions with PRs if you want.

Ok, here it goes:

CHECKLIST:

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

It is succinct but sufficient. However, I think the repo would benefit from a little more explanation for non-experts. This may be as easy as providing a link to an url explaining why card sorting is useful in UX design.

The instructions are located within the 'CONTRIBUTING.md' file. I would add the link to this file directly in the Contributing section of the README.md file.

The documentation is extensive. Hosted in readthedocs.io. I suggest to provided the url to the docs in the About section of the repo to increase visibility.

I suggest adding the link to the contributing file in the contributing section of the README.md file.

Readme file requirements The package meets the readme requirements below:

The README should include, from top to bottom:

The current name of the package, 'cardsort', is missing as such in the title of the README file. Also, if you like, you could create a simple logo for the package to attract more users (See this repo as an example).

There are no badges in the README. You can read about badges here in case you are not familiar. Also, I can help with this if you need it.

As stated above, I suggest adding the URL to the About section of the repo to increase visibility.

No citation info was provided. I suggest publishing the repo in Zenodo and adding the citation and the DOI to the README.md file. Once published, Zenodo provides a badge containing the DOI and which you can display in the badge section of the README file. You could also add a CITATON.cff file to the repo so users can easily get the citation string.

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:

While the purpose of this package is indicated in the README, I think a little more explanation for non-experts would be useful. This may be as easy as providing a link to an url explaining why card sorting is useful in UX design.

Functionality

As long as I can tell, the package works as expected.

For packages also submitting to JOSS

Not applicable.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing: 3.5


Review Comments

General comments:

  1. While not necessary, I think using type hints would improve the readability of the functions.

  2. Inconsistent use of .query and .loc as selectors. I would recommend using .loc in all cases as the code is not using complex queries and .loc may be more readable for Python users.

  3. PEP-8 recommends using format strings instead of concatenation as it improves readability. Overall, I would recommend using a linter such as black or ruff (highly recommended) to ensure the code follows PEP8. The linter could also be added to a CI pipeline (let me know if you need help with this)

  4. There are several cases in which a while loop followed by a try-except block is used. This choice makes the code more complex and it is unnecessary. In these cases, it can be replaced by a simpler loop, removing the need for the try-except block (see below).

  5. 'example-data.csv`is located within /docs, you could indicate the relative path in the example to help users find the file within the repo.

Detailed comments for each function:

_get_distance_matrix_for_user
  1. There is no need to update the index variables, i, j, within the loop, their value is set by the for loop itself.

  2. Given that the distance matrix is symmetric, you could save computational time by only calculating the upper triangle of the matrix and then copying the values to the lower triangle. For example:

for i in range(n):
    for j in range(i, n):
        cat1 = ...
        cat2 = ...
        X[i, j] = X[j, i] = 0 if cat1 == cat2 else 1

Note that you could also write the if - else statement in a single line as shown above.

  1. Perhaps using loc instead of query would be more readable for Python users, we are not making a complex query here anyway.
get_distance_matrix
  1. You could replace the while loop with a for loop and avoid the need to update the index variable, id, within the loop. To this end, you could first find all unique user_ids in the dataframe: user_ids = df['user_id'].unique(), and then loop over them: for id in user_ids:.
create_dendrogram
  1. You could assign value to color_threshold in a single line as follows:
color_treshold = 0.75 if color_treshold is None else color_treshold

and similar to the second definition to simplify the nested if - else statements.

  1. You could also write the if-else statement in plt.xticks in a single line:
 plt.xticks(np.arange(0.0, 1.1, 0.1) if x_max <= 1 else np.arange(0, x_max + 1, 1))
_get_cluster_label_for_user
  1. It seems that index is not used within the function, hence enumerate and updating its value within the loop (index += 1) are not needed.

  2. The try-except block adds complexity and it isn't really needed. You could replace it with an if-else block:

if card in df_u["card_label"].values:
    ...
else:
    print(f'"{card}" is not a valid card label. Removed from list.')
    cluster_cards.remove(card)
    print("Continue with cards: %s" % cluster_cards)
_get_cards_for_label
  1. As mentioned above, I would recommend using loc.
get_cluster_labels
  1. The while loop and the try-except block add complexity and are not necessary here. You could replace the while loop with a for loop after listing all unique user_ids (similar to the suggestion above for get_distance_matrix). This way, the try-except block can be removed as we can guarantee that df_u is not empty (also getting rid of the UnboundLocalError which isn't specific and may be confusing)

  2. PEP-8 recommends using format strings instead of concatenation. For example, you could replace the following line:

print("User " + str(id) + " labeled card(s): " + cluster_label)

with:

print(f"User {id} labeled card(s): {cluster_label}")
get_cluster_labels_df
  1. Similarly to the previous function, I would recommend using a for loop instead of a while loop and removing the try-except block to simplify the code and improve readability.
katoss commented 1 year ago

Thanks a lot for your review @Robaina ! I had a quick look over your comments, and will try to address them over the weekend :)

Batalex commented 1 year ago

That was fast 🚀 Thanks a bunch, @Robaina!

katoss commented 1 year ago

Hi @Robaina, thanks a lot again for the thorough review! 🌟

I have started to address your comments, mainly so far by adapting the functions:

Thank you for helping me improve the code quality! Please let me know if I managed to adapt the functions as you imagined.

Regarding the rest of the feedback, I will try to address it as soon as possible.

katoss commented 1 year ago

I started to wonder about a more general question regarding versioning. My ci/cd pipeline publishes a new version with each commit. Like yesterday I went from 0.2.12. to 0.2.20 by making a commit for each change (since I did not want to make the commits too big). Based on your experience @Robaina @Batalex, is this the usual way to go it or is it too much?

Robaina commented 1 year ago

Hi @Robaina, thanks a lot again for the thorough review! star2

I have started to address your comments, mainly so far by adapting the functions:

  • _get_distance_matrix_for_user (commits: 0c3b4dd)
  • get_distance_matrix (commits: 269a52a, e6be892)
  • create_dendrogram (commits: 8fc8679)
  • _get_cluster_label_for_user (commits: 7bbdb84)
  • _get_cards_for_label (commits: b882486)
  • get_cluster_labels (commits: 031ae0d)
  • get_cluster_labels_df (commits: 2ed2dba)
  • I also added the link to the docs to the about section of the repository

Thank you for helping me improve the code quality! Please let me know if I managed to adapt the functions as you imagined.

Regarding the rest of the feedback, I will try to address it as soon as possible.

Great! Glad my comments were useful! I'll take a deeper look at them later and come back to you.

Robaina commented 1 year ago

I started to wonder about a more general question regarding versioning. My ci/cd pipeline publishes a new version with each commit. Like yesterday I went from 0.2.12. to 0.2.20 by making a commit for each change (since I did not want to make the commits too big). Based on your experience @Robaina @Batalex, is this the usual way to go it or is it too much?

Based on my experience I would say this is too much, I would update the version only after a significant change to the codebase. Of course this "significant" is subjective. Here are some guidelines for semantic versioning.

On the other hand, I actually forgot to ask you about your CI / CD pipeline the other day. I observed that currently the package is built and uploaded to PIP on every push to the main branch. I am by no means an expert in DevOps, but I think that if you are directly pushing every commit to main, then building the package and uploading to PIP is perhaps too much. I would do this only after collecting some changes that address a specific issue (or issues). This workflow would be fine if there were another branch, say, dev (for development) meant to collect changes and then the CD workflow would get triggered on merging dev to main.

Definitely @Batalex will know more about this.

katoss commented 1 year ago

Ok, thank you! Then my intuition wasn't wrong. I followed this tutorial to create the package and CI/CD, and did not question the way they did it, since it was the first time for me setting all of this up. Is it possible to trigger the whole CD manually so it only publishes a new version when I want it to? Or maybe the easiest thing would be that I create a development branch and collect the changes there, as you said.

Robaina commented 1 year ago

Perhaps the easiest solution is to trigger the PIP upload only on release --- automatically triggered after a new release of the repo (with a new version number). I could do a PR with this if you like.

katoss commented 1 year ago

@Robaina please go ahead if you like, that'd be great!

Batalex commented 1 year ago

Or maybe the easiest thing would be that I create a development branch and collect the changes there, as you said.

This is the approach I usually use. I think however that Robaina's approach is easier to implement in this case. The main change would be to run the GH workflow cd on tags rather than on every commit (and letting semantic-release create a new tag anyway).

It might look like this

on:
  push:
    tags:
      - 'v[0-9]+.[0-9]+.[0-9]+'
Robaina commented 1 year ago

Interesting, I was thinking of using:

on:
  release:
    types: [published]

This should trigger the workflow only after a new release is created. Either way should work, though!

Batalex commented 1 year ago

Both approaches should work. The "release published" one would work on all kinds of releases, whereas the tags one is limited by the regexp. The latter has the benefit of being actionable with basic git commands:

git tag x.y.z
git push origin x.y.z

In both cases, we would need to run semantic-release ourselves to get the proper "x.y.z" to use

Robaina commented 1 year ago

Hi @katoss ,

great work! I reviewed your commits and left some minor comments. Only one related to efficiency in _get_distance_matrix_for_user, you forgot to iterate only over the upper triangle.

I suggest that you use GitHub issues and pull requests in future interactions, it's easier to get code reviewed by a peer this way. Also, the code can be reviewed before a merge is done (for instance from a dev branch to main).

Cheers

katoss commented 1 year ago

Only one related to efficiency in _get_distance_matrix_for_user, you forgot to iterate only over the upper triangle.

I suggest that you use GitHub issues and pull requests in future interactions, it's easier to get code reviewed by a peer this way. Also, the code can be reviewed before a merge is done (for instance from a dev branch to main).

Thank you @Robaina ! I fixed the function in a new branch. I invited you and @Batalex as collaborators so I can assign you as reviewers to a pull request

katoss commented 1 year ago

And with regards to the cd, the line that will have to be changed is this one, and the rest can stay the same?

# Only run this job if new work is pushed to the "main" branch if: github.event_name == 'push' && github.ref == 'refs/heads/main'

Batalex commented 1 year ago

And with regards to the cd, the line that will have to be changed is this one, and the rest can stay the same?

With this configuration, you would make a new release each time main is updated. You might want to restrict releases to git tags or github releases (based on git tags)

katoss commented 1 year ago

Ok @Batalex, thank you for the clarification. I have never worked with tags, but am trying to understand them.

Is it right that I would need to put this code in the CD pipeline:

on:
  push:
    tags:
      - 'v[0-9]+.[0-9]+.[0-9]+'

instead of

 # Only run this job if new work is pushed to the "main" branch
 if: github.event_name == 'push' && github.ref == 'refs/heads/main'

?

And then I can assign tags to commits in the command line? If I understood right, this is done retrospectively, e.g. I decide at some point that everything up to commit abcdef123456 shall be included in the new release, so I'd do git tag v1.0.0 abcdef123456, push the tag, and then the release should be triggered?

khynder commented 1 year ago

Hi @katoss @Batalex @Robaina !

Here is my review: First of all, I really like the idea behind this package, the use of a dendrogram to visualize the card sorting is inteligent and usefull. The package is well documented and easy to read and use! Congrats 🎉.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme file requirements The package meets the readme requirements below:

The README should include, from top to bottom:

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:

Functionality

For packages also submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing: 5 hours


Review Comments

General remarks:

It would be safer if a function was added to check that the data format is correct, especially if the package is used with data which does not come from kardsort.com. More specifically:

Function by function:

get_distance_matrix_for_user:

It can be written as a comprehension list if you want (no obligation, if you feel more at ease with the current code keep it 😊)

df_user = df_user.set_index("card_id")
X = np.array(
    [
        [
            df_user.loc[i + 1, "category_label"]
            != df_user.loc[j + 1, "category_label"]
            for i in range(n)
        ]
        for j in range(n)
    ]
).astype(int)

get_distance_matrix:

create_dendrogram:

when you define labels

labels = df.loc[df["user_id"] == 1]["card_label"].squeeze().to_list()

It is very important that the order of the card_label in labels is exactly the same as the one used to create the distance_matrix. The distance_matrix is created with the card_id in the order 1, 2, 3, 4, …, n. So labels should be: first the card_label corresponding to card_id = 1, then card_label corresponding to card_id = 2, then card_label corresponding to card_id = 3, etc. The above code line works fine if the card_id for the user_id 1 are in the order 1, 2, 3 etc, but does not work if the card_id are shuffled. I’m not sure if i'm clear, so I tried with an example: if you replace the user_id results in test_data by:

card_id card_label category_id category_label user_id
9 Mooncake 4 Moon-shaped 1
10 Moon 4 Moon-shaped 1
1 Dog 1 pets 1
2 Tiger 1 pets 1
3 Cat 1 pets 1
4 Apple 2 lunch 1
5 Sandwich 2 lunch 1
6 Banana 3 long food 1
7 Hot Dog 3 long food 1
8 Croissant 4 Moon-shaped 1

Then the dendrogram will be: image which makes no sense.

One way to correct this behavior is to add a sort_values(“card_id”) to the labels definition.

labels = df.loc[df["user_id"] == 1].sort_values(“card_id”)["card_label"].squeeze().to_list()

Note that it should also be changed in the example.ipynb.

get_cluster_labels:

It could look like this:

def get_cluster_labels(df, cluster_cards, print_results=True, return_df_results=False):
    if return_df_results:
        cluster_df = pd.DataFrame(columns=["user_id", "cluster_label", "cards"])

    user_ids = df["user_id"].unique()

    for id in user_ids:
        df_u = df.loc[df["user_id"] == id]
        cluster_label = _get_cluster_label_for_user(df_u, cluster_cards)
        if cluster_label is not None:
            if print_results:
                print("User " + str(id) + " labeled card(s): " + cluster_label)
            if return_df_results:
                cards = _get_cards_for_label(cluster_label, df_u)
                cluster_df = pd.concat(
                    [
                        cluster_df,
                        pd.DataFrame.from_records(
                            [
                                {
                                    "user_id": id,
                                    "cluster_label": cluster_label,
                                    "cards": cards,
                                }
                            ]
                        ),
                    ],
                    ignore_index=True,
                )
        else:
            if print_results:
                print("User " + str(id) + " did not cluster cards together.")

    if return_df_results:
        return cluster_df
if not set(cluster_cards) <= set(df["card_label"]):
    missing_card_labels = set(cluster_cards) - set(df["card_label"])
    print(
        f'"{missing_card_labels}" is/are not a valid card label. Removed from list.'
    )
    cluster_cards = [
        card_label
        for card_label in cluster_cards
        if card_label not in missing_card_labels
    ]
    print("Continue with cards: %s" % cluster_cards)

get_cluster_label_for_user

It can be simplified with the “isin()” pandas method: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.isin.html

    list_cat = df_u.loc[
        df_u["card_label"].isin(cluster_cards), "category_label"
    ].unique()
    if len(list_cat) == 1:
        return list_cat.squeeze()

Documentation

A “)” is missing at the end of the line starting with "distance_matrix", in the “Adapting the dendrogram” section.

Batalex commented 1 year ago

If I understood right, this is done retrospectively, e.g. I decide at some point that everything up to commit abcdef123456 shall be included in the new release

Yes, that would be the process

Batalex commented 1 year ago

Thanks @khynder, I knew you would like this package 🤗

@katoss, I have a few remarks as well. This is not a thorough review, but I like to give my opinion even as the editor.

katoss commented 1 year ago

Thanks a lot for your reviews @khynder and @Batalex ! I will try to address them as soon as possible (it might take a bit because i'm traveling for work in the next days and there are quite some points to address).

I will start with fixing the CD pipeline, so it does not disturb the review process. Also I will start using pull requests, for easier review. I added you three as collaborators for the project, so I can assign PRs to you (@Batalex and @khynder please check your invites, @Robaina already accepted) :)

katoss commented 1 year ago

Ok @Batalex, thank you for the clarification. I have never worked with tags, but am trying to understand them.

Is it right that I would need to put this code in the CD pipeline:

on:
  push:
    tags:
      - 'v[0-9]+.[0-9]+.[0-9]+'

instead of

 # Only run this job if new work is pushed to the "main" branch
 if: github.event_name == 'push' && github.ref == 'refs/heads/main'

?

And then I can assign tags to commits in the command line? If I understood right, this is done retrospectively, e.g. I decide at some point that everything up to commit abcdef123456 shall be included in the new release, so I'd do git tag v1.0.0 abcdef123456, push the tag, and then the release should be triggered?

Just created a PR with these changes :)

Robaina commented 1 year ago

Hi @katoss, @khynder, @Batalex,

I checked the PR above. I left some comments. I'm not entirely sure that the current solution will work as intended. Would appreciate it if you take a look.

katoss commented 1 year ago

Just commented. And I second @Robaina, would be great if you could have a look over our solution ideas, @Batalex and @khynder !

katoss commented 1 year ago

Just pushed a commit and commented on the PR :) Will have a look at the other feedback now.

katoss commented 1 year ago

Made another draft PR for some general formatting issues (linting, type hints, etc).

@Batalex and @khynder, could you accept my invite to be collaborator in this repo? Then I can add you as reviewers to PRs.

So as an overview, as far as I see the main blocks of feedback to address are the CI/CD improvements, general formatting, improvement of functions, and documentation. I think I will wait with further function improvements until CI/CD and general formatting blocks (the two open PRs) are fixed, but will be able to address documentation quite independently.

Thanks again for your reviews, great to see the package progessing! :)

Robaina commented 1 year ago

Great to hear @katoss. Will have a look at the PR in the following days

katoss commented 1 year ago

Hey, I just made a PR for improvements of the documentation @Robaina @khynder @Batalex 🌟

And I converted the linting/code formatting PR from draft to PR, I think it should be ready to be merged.

Once these are merged I will start addressing the feedback from @khynder and @Batalex regarding functions 👩‍💻

katoss commented 1 year ago

Hi, I just started a draft PR for changes in functions based on the reviews of @khynder and @Batalex :)

It is quite a lot, I listed everything, but have addressed only some issues so far, and have some questions for others. @khynder , could you accept the invitation to become a collaborator on cardsort, so I can assign you as a reviewer to the PR?

I also feel there might be still an issue with the trigger for CD, since CD started for all commits for patch changes in the new PR (see Actions).

Batalex commented 1 year ago

Hi, I just started a draft PR for changes in functions based on the reviews of @khynder and @Batalex :)

It is quite a lot, I listed everything, but have addressed only some issues so far, and have some questions for others. @khynder , could you accept the invitation to become a collaborator on cardsort, so I can assign you as a reviewer to the PR?

I also feel there might be still an issue with the trigger for CD, since CD started for all commits for patch changes in the new PR (see Actions).

Can you try removing the pull_request trigger in CI.yml ? According to the documentation, there are many possible sub triggers for this one

katoss commented 1 year ago

Can you try removing the pull_request trigger in CI.yml ? According to the documentation, there are many possible sub triggers for this one

Thank you, just changed it in the fix-functions branch, let's see if it changes something.

Edit: I guess the change was not thought to impact the CD and was more of a global suggestion, but in any case the CD is still getting triggered. Probably the

release: types: [published]

does not work. I will try to think about another solution

katoss commented 1 year ago

I worked on the fix-functions PR, I think it might be good to go @khynder @Batalex

katoss commented 1 year ago

I merged the fix-functions PR 🚀 , thank you for your help @khynder @Batalex . I also added the repo to zenodo and added a badge, as you suggested @Robaina (on branch update-readme so far). I'm also about to add a list of contributors, will try to figure out how to use the all-contributors bot in the next days. I should probably also change the repo-status from WIP to active?

I think I worked through all the feedback now 🤔 What are the next steps? :)

Batalex commented 1 year ago

Congrats @katoss, that was a big one!
Somehow I forgot to keep up, did you manage to fix the CD pipeline?

What are the next steps? :)

For now, I'll ask @Robaina and @khynder to check the final approval checkbox in their review message. Then, I will wrap up the review by checking some other stuff. It's great that you already took care of zenodo and the badge!

katoss commented 1 year ago

Somehow I forgot to keep up, did you manage to fix the CD pipeline?

Yes, I did, I think I forgot to write it. Now the CD only triggers when changes are pushed or merged with the main branch. So far it seems to work :)

For now, I'll ask @Robaina and @khynder to check the final approval checkbox in their review message. Then, I will wrap up the review by checking some other stuff. It's great that you already took care of zenodo and the badge!

Ok amazing!