Closed katoss closed 11 months ago
Hey @katoss, I am Alex, and I will be acting as the editor-in-chief for this submission. I am looking forward to working with you 👋 Over this weekend, I will review the checklist below to see if we need to address anything before undergoing the review process.
Hi there! Thank you for submitting your package for pyOpenSci review. Below are the basic checks that your package needs to pass to begin our review. If some of these are missing, we will ask you to work on them before the review process begins.
Please check our Python packaging guide for more information on the elements below.
import package-name
.README.md
file with clear explanation of what the package does, instructions on how to install it, and a link to development instructions.CONTRIBUTING.md
file that details how to install and contribute to the package.Code of Conduct
file.YAML
header of the issue (located at the top of the issue template).pyproject.toml
does not match the latest release. Is there any way to use semantic release to update the pyproject.toml
file based on the tag? If not, maybe we should set the version here to something like "0.0.0" to indicate that this is not the true version.pandas = "^1.5.3"
numpy = "^1.24.2"
matplotlib = "^3.7.1"
scipy = "^1.10.1"
You could increase the pool of potential users by loosening the constraints. Based on what I saw in the sources, we do not need the latest versions for these four packages.
+ myst-nb
should be a dev dependency
from cardsort import get_distance_matrix
or import cardsort; cardsort.get_distance_matrix()
instead of propagating analysis
all the way up to the user code. This is quite easy, you just need to import the functions in __init__.py
Thank you @Batalex ! I will try to address your comments in the course of the week. :)
Hi @Batalex, thank you very much for your feedback. I tried addressing everything, but have some open questions:
The version in pyproject.toml does not match the latest release. Is there any way to use semantic release to update the pyproject.toml file based on the tag? If not, maybe we should set the version here to something like "0.0.0" to indicate that this is not the true version.
Normally semantic release should bump the version automatically, but I keep running into issues. It only seems to work via getting the release version from the GitHub tag. I set the version to 0.0.0 for now, as you suggested. I will try to find a better solution, but for now at least the version is correctly updated on github and pypi.
This is on the verge of reviewing the code, so I will no go any further at the moment, but the dependencies constraints are quite hard. pandas = "^1.5.3" numpy = "^1.24.2" matplotlib = "^3.7.1" scipy = "^1.10.1" You could increase the pool of potential users by loosening the constraints. Based on what I saw in the sources, we do not need the latest versions for these four packages.
I am sorry if this is a stupid question, but how do I know which are the minimum versions of packages I can allow without breaking my code? I tried using pipdeptree
, but did not find that very helpful, because it only tells me which are the dependencies of the currently installed package versions, not if I could use something lower.
- myst-nb should be a dev dependency
True, thank you. Fixed it.
I believe the import would be clearer if we just used the functionalities from the top module namespace, e.g from cardsort import get_distance_matrix or import cardsort; cardsort.get_distance_matrix() instead of propagating analysis all the way up to the user code. This is quite easy, you just need to import the functions in init.py
I wasn't aware that this was possible. Thank you, fixed it!
Best,
@katoss
I am sorry if this is a stupid question, but how do I know which are the minimum versions of packages I can allow without breaking my code?
This is an excellent question! And a truly difficult one to answer in Python, given the dynamic nature of the language. Maybe someday, a tool will be able to parse the source code and match it against the dependencies changelogs, but I am dreaming 🫠
If the dependencies were perfectly following the semver convention, you could use the latest major releases for each dependency. This is extreme in a practical setting, so I would not advise going so far.
You are using but a few basic numpy functions, so you should be could fine with the same constraint as pandas.
The 1.5.X
series defines the following minimal dependencies: https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.5.0.html#backwards-incompatible-api-changes.
You should be good using
pandas = "^1.5.0"
numpy = "^1.20.3"
matplotlib = "^3.3.2"
scipy = "^1.7.1"
Maybe we could go even lower with pandas, but this change is enough IMO.
With the recent release of pandas 2.0.0, it might be worth loosening the upper constraint users using the latest version. Based on what I see, this should not be an issue for cardsort
.
Thank you for the explanations! So it is really not that easy. For now, I reduced the constraints to the versions you indicated.
Hi @Batalex @katoss just adding one more possibly useful piece of info:
this Scientific Python Ecosystem Coordination (SPEC) document specifies when packages should drop support for versions of core scientific Python packages, which includes your dependencies @katoss:
https://scientific-python.org/specs/spec-0000/
Unfortunately it does not say what is the minimum that you should support 🤷 but I basically go with the minimum for whatever Python is still supported -- right now that's Python 3.9 -> numpy 1.21.0, etc. ...
It looks like you ended up with roughly those versions as your lower bounds anyways.
Not to overwhelm you with info but if you don't know about it already you might want to check out: https://iscinumpy.dev/post/bound-version-constraints/ For now this probably won't matter (we won't get scipy version 2 anytime soon) but in general you might want to use loose lower bounds instead of the poetry-specific caret operator
Hey @katoss!, You are not done with me yet, as I will be the editor for this review! 🙌 I will be recruiting reviewers in the coming days/weeks.
:wave: Hi @robaina and @khynder! Thank you for volunteering to review for pyOpenSci! I am thrilled to have you both on board for this review. Feel free to introduce yourselves to our very own submitter @katoss 🤗
@katoss, meet @robaina, whom I had the pleasure of working with on his submission to pyOpenSci. As for @khynder, I have enjoyed working with her over the past five years and can vouch for her dedication and passion for science.
The following resources will help you complete your review:
Reviewers: @robaina, @khynder Due date: June 5th, 2023
Hi @katoss, nice to meet you! I'll be one of the reviewers of cardsort
. I had such a nice experience when I submitted my package to pyOpenSci. Hope you do too! Looking forward to the review.
Nice to meet you @Robaina and @khynder! Thank you for taking the time to review my package :) Looking forward to working with you.
Hi @Batalex, @katoss,
a quick question: according to pyOpenSci docs I understand that I should review cardsort v0.2.2
and not the latest release (v0.2.12) since this is the version that was submitted to pyOpenSci, right? Just to be sure.
Hey, This is indeed a tricky question. On the one hand, it would be inconvenient to ask submitting authors to freeze the project development during submission. On the other hand, reviewing the complete code base is already very time-consuming for the reviewers; they do not have the time to review the whole code repeatedly. We are going to proceed with the following approach:
The rationale behind this approach is that we want to ensure the review scope stays more or less the same. No one wants to commit to a review on a fixed scope only to discover that the goalposts moved a few weeks later. Furthermore, some of the reviewers' comments might already be addressed in the development branch. The submitting author would then be the most likely to answer with something like "thanks for the comment. Are you satisfied with the solution we made a few days ago on ecb645 / PR # 32?"
Based on what I saw, the most recent releases deal with a few mishaps on the packaging side. Most of the code base remains untouched so this approach should be fine!
Alright! Thanks for the explanation
Hi @katoss, @Batalex,
⭐ I have completed the review ⭐
I find cardsort
a nice contribution! You have put a lot of work into structuring and documenting the package, it has sufficient tests and a cool CI / CD pipeline 🚀🚀 . I have got some comments, mostly suggestions to make the code more efficient and readable. I hope these will be useful to you! Please, let me know if something isn't clear from my comments. Also, I'm willing to help you implement these suggestions with PRs if you want.
Ok, here it goes:
CHECKLIST:
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
It is succinct but sufficient. However, I think the repo would benefit from a little more explanation for non-experts. This may be as easy as providing a link to an url explaining why card sorting is useful in UX design.
The instructions are located within the 'CONTRIBUTING.md' file. I would add the link to this file directly in the Contributing section of the README.md file.
[x] Vignette(s) demonstrating major functionality that runs successfully locally.
[x] Function Documentation: for all user-facing functions.
The documentation is extensive. Hosted in readthedocs.io. I suggest to provided the url to the docs in the About section of the repo to increase visibility.
[x] Examples for all user-facing functions.
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
I suggest adding the link to the contributing file in the contributing section of the README.md file.
pyproject.toml
file or elsewhere.Readme file requirements The package meets the readme requirements below:
The README should include, from top to bottom:
The current name of the package, 'cardsort', is missing as such in the title of the README file. Also, if you like, you could create a simple logo for the package to attract more users (See this repo as an example).
There are no badges in the README. You can read about badges here in case you are not familiar. Also, I can help with this if you need it.
As stated above, I suggest adding the URL to the About section of the repo to increase visibility.
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
[x] Citation information
No citation info was provided. I suggest publishing the repo in Zenodo and adding the citation and the DOI to the README.md file. Once published, Zenodo provides a badge containing the DOI and which you can display in the badge section of the README file. You could also add a CITATON.cff file to the repo so users can easily get the citation string.
Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:
[x] Package documentation is clear and easy to find and use.
[x] The need for the package is clear
While the purpose of this package is indicated in the README, I think a little more explanation for non-experts would be useful. This may be as easy as providing a link to an url explaining why card sorting is useful in UX design.
[x] All functions have documentation and associated examples for use
[x] The package is easy to install
As long as I can tell, the package works as expected.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines. A few notable highlights to look at:
[x] Package supports modern versions of Python and not End of life versions.
[x] Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)
There are no linting tests in the CI. Also, the original code does not follow PEP8 in some cases (see below). I suggest using a linter such as black or ruff (highly recommended) to ensure the code follows PEP8.
Not applicable.
Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.
The package contains a paper.md
matching JOSS's requirements with:
Estimated hours spent reviewing: 3.5
General comments:
While not necessary, I think using type hints would improve the readability of the functions.
Inconsistent use of .query
and .loc
as selectors. I would recommend using .loc
in all cases as the code is not using complex queries and .loc
may be more readable for Python users.
PEP-8 recommends using format strings instead of concatenation as it improves readability. Overall, I would recommend using a linter such as black or ruff (highly recommended) to ensure the code follows PEP8. The linter could also be added to a CI pipeline (let me know if you need help with this)
There are several cases in which a while
loop followed by a try-except
block is used. This choice makes the code more complex and it is unnecessary. In these cases, it can be replaced by a simpler loop
, removing the need for the try-except
block (see below).
'example-data.csv`is located within /docs, you could indicate the relative path in the example to help users find the file within the repo.
Detailed comments for each function:
_get_distance_matrix_for_user
There is no need to update the index variables, i, j
, within the loop, their value is set by the for loop itself.
Given that the distance matrix is symmetric, you could save computational time by only calculating the upper triangle of the matrix and then copying the values to the lower triangle. For example:
for i in range(n):
for j in range(i, n):
cat1 = ...
cat2 = ...
X[i, j] = X[j, i] = 0 if cat1 == cat2 else 1
Note that you could also write the if - else statement in a single line as shown above.
loc
instead of query
would be more readable for Python users, we are not making a complex query here anyway.get_distance_matrix
user_ids = df['user_id'].unique()
, and then loop over them: for id in user_ids:
.create_dendrogram
color_threshold
in a single line as follows:color_treshold = 0.75 if color_treshold is None else color_treshold
and similar to the second definition to simplify the nested if - else statements.
plt.xticks
in a single line: plt.xticks(np.arange(0.0, 1.1, 0.1) if x_max <= 1 else np.arange(0, x_max + 1, 1))
_get_cluster_label_for_user
It seems that index
is not used within the function, hence enumerate
and updating its value within the loop (index += 1
) are not needed.
The try-except block adds complexity and it isn't really needed. You could replace it with an if-else block:
if card in df_u["card_label"].values:
...
else:
print(f'"{card}" is not a valid card label. Removed from list.')
cluster_cards.remove(card)
print("Continue with cards: %s" % cluster_cards)
_get_cards_for_label
get_cluster_labels
The while loop and the try-except block add complexity and are not necessary here. You could replace the while loop with a for loop after listing all unique user_ids (similar to the suggestion above for get_distance_matrix
). This way, the try-except block can be removed as we can guarantee that df_u
is not empty (also getting rid of the UnboundLocalError
which isn't specific and may be confusing)
PEP-8 recommends using format strings instead of concatenation. For example, you could replace the following line:
print("User " + str(id) + " labeled card(s): " + cluster_label)
with:
print(f"User {id} labeled card(s): {cluster_label}")
get_cluster_labels_df
Thanks a lot for your review @Robaina ! I had a quick look over your comments, and will try to address them over the weekend :)
That was fast 🚀 Thanks a bunch, @Robaina!
Hi @Robaina, thanks a lot again for the thorough review! 🌟
I have started to address your comments, mainly so far by adapting the functions:
_get_distance_matrix_for_user
(commits: 0c3b4dd)get_distance_matrix
(commits: 269a52a, e6be892)create_dendrogram
(commits: 8fc8679)_get_cluster_label_for_user
(commits: 7bbdb84)_get_cards_for_label
(commits: b882486)get_cluster_labels
(commits: 031ae0d)get_cluster_labels_df
(commits: 2ed2dba)Thank you for helping me improve the code quality! Please let me know if I managed to adapt the functions as you imagined.
Regarding the rest of the feedback, I will try to address it as soon as possible.
I started to wonder about a more general question regarding versioning. My ci/cd pipeline publishes a new version with each commit. Like yesterday I went from 0.2.12. to 0.2.20 by making a commit for each change (since I did not want to make the commits too big). Based on your experience @Robaina @Batalex, is this the usual way to go it or is it too much?
Hi @Robaina, thanks a lot again for the thorough review! star2
I have started to address your comments, mainly so far by adapting the functions:
_get_distance_matrix_for_user
(commits: 0c3b4dd)get_distance_matrix
(commits: 269a52a, e6be892)create_dendrogram
(commits: 8fc8679)_get_cluster_label_for_user
(commits: 7bbdb84)_get_cards_for_label
(commits: b882486)get_cluster_labels
(commits: 031ae0d)get_cluster_labels_df
(commits: 2ed2dba)- I also added the link to the docs to the about section of the repository
Thank you for helping me improve the code quality! Please let me know if I managed to adapt the functions as you imagined.
Regarding the rest of the feedback, I will try to address it as soon as possible.
Great! Glad my comments were useful! I'll take a deeper look at them later and come back to you.
I started to wonder about a more general question regarding versioning. My ci/cd pipeline publishes a new version with each commit. Like yesterday I went from 0.2.12. to 0.2.20 by making a commit for each change (since I did not want to make the commits too big). Based on your experience @Robaina @Batalex, is this the usual way to go it or is it too much?
Based on my experience I would say this is too much, I would update the version only after a significant change to the codebase. Of course this "significant" is subjective. Here are some guidelines for semantic versioning.
On the other hand, I actually forgot to ask you about your CI / CD pipeline the other day. I observed that currently the package is built and uploaded to PIP on every push to the main branch. I am by no means an expert in DevOps, but I think that if you are directly pushing every commit to main, then building the package and uploading to PIP is perhaps too much. I would do this only after collecting some changes that address a specific issue (or issues). This workflow would be fine if there were another branch, say, dev (for development) meant to collect changes and then the CD workflow would get triggered on merging dev to main.
Definitely @Batalex will know more about this.
Ok, thank you! Then my intuition wasn't wrong. I followed this tutorial to create the package and CI/CD, and did not question the way they did it, since it was the first time for me setting all of this up. Is it possible to trigger the whole CD manually so it only publishes a new version when I want it to? Or maybe the easiest thing would be that I create a development branch and collect the changes there, as you said.
Perhaps the easiest solution is to trigger the PIP upload only on release --- automatically triggered after a new release of the repo (with a new version number). I could do a PR with this if you like.
@Robaina please go ahead if you like, that'd be great!
Or maybe the easiest thing would be that I create a development branch and collect the changes there, as you said.
This is the approach I usually use. I think however that Robaina's approach is easier to implement in this case. The main change would be to run the GH workflow cd
on tags rather than on every commit (and letting semantic-release create a new tag anyway).
It might look like this
on:
push:
tags:
- 'v[0-9]+.[0-9]+.[0-9]+'
Interesting, I was thinking of using:
on:
release:
types: [published]
This should trigger the workflow only after a new release is created. Either way should work, though!
Both approaches should work. The "release published" one would work on all kinds of releases, whereas the tags one is limited by the regexp. The latter has the benefit of being actionable with basic git commands:
git tag x.y.z
git push origin x.y.z
In both cases, we would need to run semantic-release ourselves to get the proper "x.y.z" to use
Hi @katoss ,
great work! I reviewed your commits and left some minor comments. Only one related to efficiency in _get_distance_matrix_for_user, you forgot to iterate only over the upper triangle.
I suggest that you use GitHub issues and pull requests in future interactions, it's easier to get code reviewed by a peer this way. Also, the code can be reviewed before a merge is done (for instance from a dev branch to main).
Cheers
Only one related to efficiency in _get_distance_matrix_for_user, you forgot to iterate only over the upper triangle.
I suggest that you use GitHub issues and pull requests in future interactions, it's easier to get code reviewed by a peer this way. Also, the code can be reviewed before a merge is done (for instance from a dev branch to main).
Thank you @Robaina ! I fixed the function in a new branch. I invited you and @Batalex as collaborators so I can assign you as reviewers to a pull request
And with regards to the cd, the line that will have to be changed is this one, and the rest can stay the same?
# Only run this job if new work is pushed to the "main" branch
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
And with regards to the cd, the line that will have to be changed is this one, and the rest can stay the same?
With this configuration, you would make a new release each time main
is updated. You might want to restrict releases to git tags or github releases (based on git tags)
Ok @Batalex, thank you for the clarification. I have never worked with tags, but am trying to understand them.
Is it right that I would need to put this code in the CD pipeline:
on:
push:
tags:
- 'v[0-9]+.[0-9]+.[0-9]+'
instead of
# Only run this job if new work is pushed to the "main" branch
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
?
And then I can assign tags to commits in the command line? If I understood right, this is done retrospectively, e.g. I decide at some point that everything up to commit abcdef123456
shall be included in the new release, so I'd do git tag v1.0.0 abcdef123456
, push the tag, and then the release should be triggered?
Hi @katoss @Batalex @Robaina !
Here is my review: First of all, I really like the idea behind this package, the use of a dendrogram to visualize the card sorting is inteligent and usefull. The package is well documented and easy to read and use! Congrats 🎉.
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
pyproject.toml
file or elsewhere.Readme file requirements The package meets the readme requirements below:
The README should include, from top to bottom:
NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)
Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:
Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.
The package contains a paper.md
matching JOSS's requirements with:
Estimated hours spent reviewing: 5 hours
It would be safer if a function was added to check that the data format is correct, especially if the package is used with data which does not come from kardsort.com. More specifically:
It can be written as a comprehension list if you want (no obligation, if you feel more at ease with the current code keep it 😊)
df_user = df_user.set_index("card_id")
X = np.array(
[
[
df_user.loc[i + 1, "category_label"]
!= df_user.loc[j + 1, "category_label"]
for i in range(n)
]
for j in range(n)
]
).astype(int)
print
in code, if you have a lot of users in the database it could slow down the execution. You can add a Boolean parameter print_progression
or use loggers (@Batalex could explain loggers better than me I think)user_id = 1
, otherwise distance_matrix_all
won’t exist. It can stay that way, with a check at the beginning of the function to stop the computation if the first user_id is not 1. Or you can create before the for loop a variable distance_matrix_all = None,
and replace the id == 1
condition by a distance_matrix_all is not None
conditionwhen you define labels
labels = df.loc[df["user_id"] == 1]["card_label"].squeeze().to_list()
It is very important that the order of the card_label in labels is exactly the same as the one used to create the distance_matrix. The distance_matrix is created with the card_id in the order 1, 2, 3, 4, …, n. So labels should be: first the card_label corresponding to card_id = 1, then card_label corresponding to card_id = 2, then card_label corresponding to card_id = 3, etc. The above code line works fine if the card_id for the user_id 1 are in the order 1, 2, 3 etc, but does not work if the card_id are shuffled. I’m not sure if i'm clear, so I tried with an example: if you replace the user_id results in test_data by:
card_id | card_label | category_id | category_label | user_id |
---|---|---|---|---|
9 | Mooncake | 4 | Moon-shaped | 1 |
10 | Moon | 4 | Moon-shaped | 1 |
1 | Dog | 1 | pets | 1 |
2 | Tiger | 1 | pets | 1 |
3 | Cat | 1 | pets | 1 |
4 | Apple | 2 | lunch | 1 |
5 | Sandwich | 2 | lunch | 1 |
6 | Banana | 3 | long food | 1 |
7 | Hot Dog | 3 | long food | 1 |
8 | Croissant | 4 | Moon-shaped | 1 |
Then the dendrogram will be: which makes no sense.
One way to correct this behavior is to add a sort_values(“card_id”)
to the labels definition.
labels = df.loc[df["user_id"] == 1].sort_values(“card_id”)["card_label"].squeeze().to_list()
Note that it should also be changed in the example.ipynb
.
get_cluster_labels
and get_cluster_labels_df
. These two functions can be factorized into one function, with 2 boolean parameters to indicate what type of return you want, for example print_results
and return_df_results
.It could look like this:
def get_cluster_labels(df, cluster_cards, print_results=True, return_df_results=False):
if return_df_results:
cluster_df = pd.DataFrame(columns=["user_id", "cluster_label", "cards"])
user_ids = df["user_id"].unique()
for id in user_ids:
df_u = df.loc[df["user_id"] == id]
cluster_label = _get_cluster_label_for_user(df_u, cluster_cards)
if cluster_label is not None:
if print_results:
print("User " + str(id) + " labeled card(s): " + cluster_label)
if return_df_results:
cards = _get_cards_for_label(cluster_label, df_u)
cluster_df = pd.concat(
[
cluster_df,
pd.DataFrame.from_records(
[
{
"user_id": id,
"cluster_label": cluster_label,
"cards": cards,
}
]
),
],
ignore_index=True,
)
else:
if print_results:
print("User " + str(id) + " did not cluster cards together.")
if return_df_results:
return cluster_df
user_id
level:if not set(cluster_cards) <= set(df["card_label"]):
missing_card_labels = set(cluster_cards) - set(df["card_label"])
print(
f'"{missing_card_labels}" is/are not a valid card label. Removed from list.'
)
cluster_cards = [
card_label
for card_label in cluster_cards
if card_label not in missing_card_labels
]
print("Continue with cards: %s" % cluster_cards)
It can be simplified with the “isin()” pandas method: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.isin.html
list_cat = df_u.loc[
df_u["card_label"].isin(cluster_cards), "category_label"
].unique()
if len(list_cat) == 1:
return list_cat.squeeze()
A “)” is missing at the end of the line starting with "distance_matrix", in the “Adapting the dendrogram” section.
If I understood right, this is done retrospectively, e.g. I decide at some point that everything up to commit abcdef123456 shall be included in the new release
Yes, that would be the process
Thanks @khynder, I knew you would like this package 🤗
@katoss, I have a few remarks as well. This is not a thorough review, but I like to give my opinion even as the editor.
__all__
should go below. I would advice as suggested by @Robaina using a code formater and a linter_get_distance_matrix_for_user
could be rewritten using scipy's pdist
, but it is worth checking if the rewrite to compute the upper triangular matrix is more efficient.get_cluster_labels
, get_distance_matrix
and get_cluster_labels_df
use the form id = 1; while id <= max(df.user.id)
. We can make this more robust by using a for loop on the values in the dataframe. The original version would fail for string ids or even if a value is missing, e.g [1, 2, 4, 5]
. Since the value 1
is used to initialize the distance matrix, you may add a boolean condition (to be changed in the first iteration), or you may create a matrix with np.zeros
before the for loop. I like using for user_id, sub_df in df.groupby("user_id"):
, you are guaranteed to loop over all user_id
values in the dataframe.get_cluster_labels
could be cut down by checking the values of cluster_cards
against the possible values before entering the loop. It is quite risky to mutate a datastructure (especially removing values) while using it in a loop, and usually considered a bad practice.Thanks a lot for your reviews @khynder and @Batalex ! I will try to address them as soon as possible (it might take a bit because i'm traveling for work in the next days and there are quite some points to address).
I will start with fixing the CD pipeline, so it does not disturb the review process. Also I will start using pull requests, for easier review. I added you three as collaborators for the project, so I can assign PRs to you (@Batalex and @khynder please check your invites, @Robaina already accepted) :)
Ok @Batalex, thank you for the clarification. I have never worked with tags, but am trying to understand them.
Is it right that I would need to put this code in the CD pipeline:
on: push: tags: - 'v[0-9]+.[0-9]+.[0-9]+'
instead of
# Only run this job if new work is pushed to the "main" branch if: github.event_name == 'push' && github.ref == 'refs/heads/main'
?
And then I can assign tags to commits in the command line? If I understood right, this is done retrospectively, e.g. I decide at some point that everything up to commit
abcdef123456
shall be included in the new release, so I'd dogit tag v1.0.0 abcdef123456
, push the tag, and then the release should be triggered?
Just created a PR with these changes :)
Hi @katoss, @khynder, @Batalex,
I checked the PR above. I left some comments. I'm not entirely sure that the current solution will work as intended. Would appreciate it if you take a look.
Just commented. And I second @Robaina, would be great if you could have a look over our solution ideas, @Batalex and @khynder !
Just pushed a commit and commented on the PR :) Will have a look at the other feedback now.
Made another draft PR for some general formatting issues (linting, type hints, etc).
@Batalex and @khynder, could you accept my invite to be collaborator in this repo? Then I can add you as reviewers to PRs.
So as an overview, as far as I see the main blocks of feedback to address are the CI/CD improvements, general formatting, improvement of functions, and documentation. I think I will wait with further function improvements until CI/CD and general formatting blocks (the two open PRs) are fixed, but will be able to address documentation quite independently.
Thanks again for your reviews, great to see the package progessing! :)
Great to hear @katoss. Will have a look at the PR in the following days
Hey, I just made a PR for improvements of the documentation @Robaina @khynder @Batalex 🌟
And I converted the linting/code formatting PR from draft to PR, I think it should be ready to be merged.
Once these are merged I will start addressing the feedback from @khynder and @Batalex regarding functions 👩💻
Hi, I just started a draft PR for changes in functions based on the reviews of @khynder and @Batalex :)
It is quite a lot, I listed everything, but have addressed only some issues so far, and have some questions for others. @khynder , could you accept the invitation to become a collaborator on cardsort, so I can assign you as a reviewer to the PR?
I also feel there might be still an issue with the trigger for CD, since CD started for all commits for patch changes in the new PR (see Actions).
Hi, I just started a draft PR for changes in functions based on the reviews of @khynder and @Batalex :)
It is quite a lot, I listed everything, but have addressed only some issues so far, and have some questions for others. @khynder , could you accept the invitation to become a collaborator on cardsort, so I can assign you as a reviewer to the PR?
I also feel there might be still an issue with the trigger for CD, since CD started for all commits for patch changes in the new PR (see Actions).
Can you try removing the pull_request
trigger in CI.yml ? According to the documentation, there are many possible sub triggers for this one
Can you try removing the pull_request trigger in CI.yml ? According to the documentation, there are many possible sub triggers for this one
Thank you, just changed it in the fix-functions branch, let's see if it changes something.
Edit: I guess the change was not thought to impact the CD and was more of a global suggestion, but in any case the CD is still getting triggered. Probably the
release: types: [published]
does not work. I will try to think about another solution
I worked on the fix-functions
PR, I think it might be good to go @khynder @Batalex
I merged the fix-functions
PR 🚀 , thank you for your help @khynder @Batalex .
I also added the repo to zenodo and added a badge, as you suggested @Robaina (on branch update-readme
so far). I'm also about to add a list of contributors, will try to figure out how to use the all-contributors bot in the next days. I should probably also change the repo-status from WIP to active?
I think I worked through all the feedback now 🤔 What are the next steps? :)
Congrats @katoss, that was a big one!
Somehow I forgot to keep up, did you manage to fix the CD pipeline?
What are the next steps? :)
For now, I'll ask @Robaina and @khynder to check the final approval checkbox in their review message. Then, I will wrap up the review by checking some other stuff. It's great that you already took care of zenodo and the badge!
Somehow I forgot to keep up, did you manage to fix the CD pipeline?
Yes, I did, I think I forgot to write it. Now the CD only triggers when changes are pushed or merged with the main branch. So far it seems to work :)
For now, I'll ask @Robaina and @khynder to check the final approval checkbox in their review message. Then, I will wrap up the review by checking some other stuff. It's great that you already took care of zenodo and the badge!
Ok amazing!
Submitting Author: (@katoss) All current maintainers: (@katoss) Package Name: cardsort One-Line Description of Package: A python package to analyse data from open card sorting tasks Repository Link: https://github.com/katoss/cardsort Version submitted: 0.2.2 Editor: @Batalex Reviewer 1: @Robaina Reviewer 2: @khynder Archive: JOSS DOI: N/A Version accepted: v 0.2.36 Date accepted (month/day/year): 08/16/2023
Code of Conduct & Commitment to Maintain Package
Description
Card sorting is a standard research method in the human computer interaction field to design information architectures. The
cardsort
package analyzes and visualizes data from open card sorting tasks. Using csv data in the format returned by the kardsort tool (or any other tool outputting the same columns), it outputs a dendrogram based on hierarchical cluster analysis and pairwise similarity scores. It can also return category labels to learn which labels study participants gave to combinations of cards from emerging clusters.Scope
Please indicate which category or categories. Check out our package scope page to learn more about our scope. (If you are unsure of which category you fit, we suggest you make a pre-submission inquiry):
Domain Specific & Community Partnerships
Community Partnerships
If your package is associated with an existing community please check below:
For all submissions, explain how the and why the package falls under the categories you indicated above. In your explanation, please address the following points (briefly, 1-2 sentences for each):
The target audience are researchers or practitioners in human computer interaction and user experience. Open card sorting is a popular user research method used to understand how people order and conceptualize information, in order to design information architectures. In order to make sense of the data, clusters are often visualized in form of a dendrogram. To do so, pairwise similarity scores need to be calculated for all cards, followed by a hierarchical cluster analysis. This functionality is provided by the cardsort package. It also offers functionality to return category labels, in order to learn which labels study participants gave to combinations of cards in the emerging clusters.
I did not find any python packages, nor other open source tools, that accomplish this, which was my motivation to make this package, when I was running a card sorting study in the course of my PhD. While research articles describe the steps of this analysis (i.e. pairwise similarity scores --> hierarchical cluster analysis --> dendrogram, see articles linked in question above), I did not find any open source tools that help to do this analysis. I used the widely used kardsort.com tool to collect the data, which refers to rather dated, closed source Windows software for analysis, which underpins my assumption of a lack of open source tools. The approach is rather simple, making use of scipy for hierarchical cluster analysis and dendrogram, adding a custom function to create the similarity scores, and putting everything in an easy-to-use pipeline. Nevertheless, in doing so, I think the cardsort package can help remove barriers for the application of this user research method (easy to use even for python beginners, no need to use closed source software that only runs on windows or expensive subscription tools). In any case I would have liked to have this package when I started my study :)
@tag
the editor you contacted:101
Technical checks
For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:
Publication Options
JOSS Checks
- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements][JossSubmissionRequirements]. Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements][JossSubmissionRequirements]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements][JossPaperRequirements] with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?
This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.
Confirm each of the following by checking the box.
Please fill out our survey
P.S. Have feedback/comments about our review process? Leave a comment here
Editor and Review Templates
The editor template can be found here.
The review template can be found here.