Unsupervised Learning and K-Means Clustering with Python

The Programming Historian has received the following tutorial on 'Unsupervised Learning and K-Means Clustering with Python' by @thomjur. This lesson is now under review and can be read at:

Old slug: https://programminghistorian.github.io/ph-submissions/lessons/k-means-clustering-with-scikit-learn-in-python

Final url for submission lesson: https://programminghistorian.github.io/ph-submissions/lessons/clustering-with-scikit-learn-in-python

Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.

I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. I have already read through the lesson and provided feedback, to which the author has responded.

Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.

I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me.

Our dedicated Ombudsperson is (Ian Milligan - http://programminghistorian.org/en/project-team). Please feel free to contact him at any time if you have concerns that you would like addressed by an impartial observer. Contacting the ombudsperson will have no impact on the outcome of any peer review.

Anti-Harassment Policy

This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. Thank you for helping us to create a safe space.

This is a solid tutorial @thomjur. It’s very useful for exploring and teaching a specific method for data analysis. There are places I identify below where clarifications and reorganizations will help guide the reader. The key revisions to make before I send this out for peer review are:

This should be restructured to introduce your dataset sooner, and to focus on explicating the algorithmic methods through specific reference to your dataset.
Stylistically, you could aim for slightly shorter paragraphs, with more clear topic sentences going step by step through the lesson.
Framing the lesson slightly differently, with a more generalized title, will also help situate this lesson in a recognizable field and method – perhaps something like K-Means Clustering with Scikit-Learn in Python?
Finally, as you know, the rendering of the Markdown file didn’t go smoothly – when you submit a revised draft, let’s make sure that’s fixed, and any images, bibliographical references, etc. are uploaded correctly to the submissions page.

Comments organized by Section follow below. Can we aim to receive a revised draft back from you by January 31st? Please note Programming Historian editors will be taking a two-week break starting Monday, December 21st until January 4th.

[x] Introduction: The introduction could be a little longer, with a brief gloss of what clustering algorithms are, why scikit-learn, and why the dataset you’re using. It would also be helpful to include more hyperlinks to main reference points like scikit-learn.
[x] The section, Introducing the Case Study” is a little misleading. You mostly focus here on the clustering algorithm and possible questions it could answer. It would be helpful to partition out more clearly what each paragraph/section is doing. Create one section where you describe the case study / sample dataset in detail. Then a separate section where you describe what kind of analytical research question you are hoping to answer by analyzing the data, and what kind of method you’re explicating.
[x] The section, “why k-means clustering” seems worth addressing sooner, once you get into the details of the clustering algorithm, but after you introduce your dataset and the need to analyze it through clustering. It’s not necessary to say you’re doing a tutorial on K-means because it’s easy – instead, focus on why a researcher might want to use this particular clustering algorithm to answer specific research questions.
[x] Section on “Prerequisites for Python” (note the typo in your markdown for ‘Prerequisites’) – throughout this lesson plan, and especially here, instead of recommending readers look to Wikipedia, provide embedded links for each term you think is necessary, and mention and briefly summarize the key ones. If you can explain supervised vs unsupervised machine learning, with links for further information, it will make your lesson more self-sufficient. Some of this you go on to describe in the subsequent paragraphs, so it seems unnecessary to mention previous knowledge of machine learning being necessary. You’re lesson can provide the necessary information, with references to more reading.
[x] The section on “Unsupervised Machine Learning” quickly transitions to a discussion of clustering. Ideally organize this whole section to primarily explain unsupervised vs supervised methods specifically for tasks involving methods like clustering algorithms. In other words, lead with clustering algorithms, and then distinguish between supervised and unsupervised methods for clustering – beyond that, the distinction is not essential for this lesson plan. You could therefore begin with this section being entitled “Different clustering algorithms” and then transition to a discussion of types of machine learning in subsections. Ideally this section would also explicate the differences here by using your case study, rather than a random dataset.
[x] The section on “Different clustering algorithms” is crucial but difficult to follow. Reorganizing some of this discussion to integrate prior sections may help. But also I think you can avoid some of the overly general questions about machine learning, cutting edge methods, etc. What’s most important here is to explain to the reader, they have a research question and a dataset, why does scikit learn’s K-means clustering tool serve to answer the question?
[x] The section, “Intro to K-Means Clustering,” is where your lesson really gets going – ideally get to this point quicker. You’re discussion of scikit-learn’s implementation should also be more central. You conclude paragraph 19 by saying: “the scikit-learn model uses a different approach called "k-means++”, which raises a lot of questions you could answer here.
[x] On “how many clusters to include,” this is the type of question you should answer in relation to the specific dataset you’ve selected. You’re right to say there is no ‘one-size-fits-all,’ but this tutorial should show us how you think through picking a number for your dataset.
[x] “Exploratory data analysis” – this section is good. One thing that I find distracting is when you talk about what the tutorial won’t cover. For instance, when you mention further extensions of this code for integrating a “cluster analysis of the author’s locations,” it’s more confusing and distracting to the main flow of the tutorial. Again, later in this section, it’s distracting to discuss how you will “forgo writing a function that wraps everything up”. If you want to identify what’s outside the scope of this tutorial, ideally do it in a footnote or a separate section near the end where you discuss how a more complicated analysis and pipeline could be developed out of this tutorial.

I think any comments I would make about the latter half of the tutorial are mostly going to reinforce what I’ve said above. Overall the code works, the analysis makes sense, you deal with all the technical issues fluently. At times, you could afford to provide more brief statements defining terms, summarizing ideas. If you get to the code sooner in your tutorial, some of the opening sections explaining K-Means Clustering would be easier to follow as you cite the specific example in your code to explain the ideas behind the analysis.

The last two sections, “Standardizing the data” and “Summary” both seem too long without enough signposts guiding the reader. Add more subsections – for instance, you could include a separate section on “How could we proceed from here?” where you could also discuss some of the alternative approaches you mentioned earlier in the tutorial. These last sections are really the most important part of the tutorial, so guiding the reader through these steps carefully, explicating ideas and the code in more detail, is what will be most important.

Dear Alex @hawc2,

Thanks again for your constructive comments. I agree with all of them (ok, one tiny exception, see elbow-method).

Let me start with the four major issues you mentioned (plus some general comments):

Dataset: I am now introducing the dataset right in the beginning (introduction).
Language: this is still a problem, I guess, but I can ask a native speaker for the language-related issues once we've agreed on the content of my tutorial.
Title: d'accord, your title suggestion is much better.
I tried my best to fix the MD rendering; mathjax seems to be working; however, the figs don't, but I think this is due to a missing link in the yaml - and the form says "LEAVE BLANK," so I assume this is up to the editor to change.
I generally tried to refer to my dataset in the examples more often (for instance, when introducing how k-means clustering works).
I have changed the overall structure, particularly in the first half of the lesson.
I removed most of the footnotes.

Responses to section comments:

[x] the introduction is now longer, explains why scikit-learn, introduces the dataset, tries to implement fewer footnotes and more direct links to scikit-learn docs instead, etc.
[x] deleted the "Introducing Case Study" section; the dataset is now presented right in the beginning.
[x] "Why K-Means Clustering" now follows directly after the Intro and the very short "Prerequisites." I got rid of the "Machine Learning" part in the Prerequisites, too.
[x] elaborated more on "unsupervised machine learning;" added an extra section for clustering.
[x] "Different Clustering Algorithms": I got rid of all the "blabla cutting edge, ask your parents before applying machine learning, etc."
[x] K-Means clustering is now introduced right after the introduction.
[x] Well, the "problem" with the elbow-method remains actually; it does not really make sense to refer to the results of the analysis at this early point; however, I added a sentence to tell the reader that he or she will see another example of how to apply the elbow-method throughout the following analysis of the dataset.
[x] added a separate section on "how you could proceed from here."
[x] I have added quite a few new sections in the second half of the tutorial; I hope this helps the reader better understand where she/he is.

Thanks again for your help, and I am looking forward to your comments!

Best, Thomas

Added a revised version of the tutorial (minor changes, nothing content-related). Exchanged two images (ax labels were cut off in the former version).

Thanks @thomjur. We're sending this out for peer review. You'll hopefully receive feedback in the next month or two.

Alex @hawc2, thanks for the opportunity to review this tutorial. @thomjur, this is a very useful and accessible tutorial about using k-means clustering in Python. I believe someone who is not a humanities scholar and wants to learn k-means clustering will also have interests reading this tutorial. This tutorial will contribute to the goals of Programming Historian. The draft is in good shape and I have the following comments.

Major issues:

[ ] Figure 3 suggests that the decision of selecting 4 clusters is not a clear-cut one. In this type of situation, I understand that a qualitative examination and interpretation of clustering results is crucial. However, I’d suggest that you mention alternative quantitative methods of choosing the number of clusters. For example, it may be helpful to refer readers to, or briefly explain (better) the method that uses the Silhouette coefficient as a criterion. I see that in Paragraph 31, you mentioned alternative methods briefly. Maybe you could refer to the Silhouetter coefficient here.
[ ] Regarding “the K-Means Algorithm” section (starting from Paragraph 21): a) You now have a toy example using one variable. Because the key concept is distance, I think it’d be extremely helpful (especially for a beginner) to add another example using two variables. b) Adding a visual illustration for the toy examples will greatly aid comprehension. I see there is a link to Wikipedia for a visualization, but I think a visualization based on your toy examples would better help a visual learner.

Minor issues:

[x] For the quote blocks (Paragraph 3 & Paragraph 28), the texts don’t seem to be centered vertically.
[x] In Paragraph 8, a) should it be “looking at” or “looking any?” b) the last sentence implies that you are identifying clusters by number of articles only, but I think the goal is to identify clusters by both number of articles and word count?
[x] In Paragraph 9, it says that the Wikipedia links for some basic statistics concepts will be provided. In this draft, I only saw the link for mean (Paragraph 25). Please add the remaining links.
[x] Again in Paragraph 9, should it be “pandas and matplotlib?” Also, please insert a link for matplotlib.
[x] In Paragraph 12, I’m a bit lost reading the phrase “without being explicitly programmed.” Could you clarify a bit more here? Also, please insert a link for the cited source.
[x] In Paragraph 12, it implies that hyperparameter tuning only applies to supervised learning, but the Jupyter notebook (PH_Clustering_Tutorial.ipynb, Heading 5) implies that exploring the number of clusters is also a form of hyperparameter tuning. I may have misunderstood, could you clarify this more?
[x] I appreciate the explanation of the three approaches of machine learning. But I think the part for supervised learning (Paragraph 13) can be more concise.
[x] The name of the dataset in the heading after Paragraph 32 can be removed to improve readability.
[x] In Paragraph 39, to justify data standardization, it may help strengthen the argument by stating that one unit of ‘word_count’ and one unit of ‘articles’ are not treated as the same in the process of calculating distance.
[x] In Paragraph 41, it may help a reader if you can link to a source explaining the relationship between reproducibility and the ‘random_state’ parameter.
[x] Could you make Figure 2 and Figure 4 bigger and with greater dpi to improve readability?
[x] In Paragraph 47, when you refer to outliers, it implies that an outlier is determined based on a visual inspection on Figure 2 only. It is worth noting (maybe in a footnote) that there are formal ways to define and detect outliers. For example, in the case of bivariate outliers, one can use Mahalanobis Distance (see https://towardsdatascience.com/detecting-and-treating-outliers-in-python-part-2-3a3319ec2c33).
[x] Regarding the heading after Paragraph 49, should it be “number of clusters?”

Hi @thomjur, I want to begin by highlighting things that are working well in this tutorial. Your explanation for how to run k-means clustering with scikit-learn is mostly sound and clear — well-done. I also like that you use and discuss variable scaling — that's great! Lastly, I appreciate that you are not using random or artificial data but a dataset that has an actual connection to the humanities.

However, I believe that there are some significant issues with this tutorial (as it is currently written), and they mostly stem from the choice of dataset.

Big points:

[x] While I appreciate your attempt to find a relevant DH dataset, the Der Neue Pauly authors data does not seem well-suited for k-means clustering. The data does not contain coherent clusters from a mathematical perspective and from an intuitive perspective. Additionally, the features — word count and article count — are not particularly interesting or informative. I could imagine an improved version of this tutorial that instead considered the vocabulary of each article in the DNP (words converted to numbers via TF-IDF).
[x] Relatedly, the research question driving the tutorial and motivating the use of k-means clustering — to measure scholars' impact in a specific domain — is not as strong or compelling as it could be. I think it's important to anchor this tutorial with a stronger question that needs k-means clustering to be answered.
[x] Based on the above two points, I would recommend choosing a different dataset and different research question for this tutorial. Believe me, I know it's hard to find a great example dataset, but I think if you do, it will be a huge service to the DH community and ensure that DH researchers and teachers are using k-means clustering appropriately. A lot of the potential alternative datasets that I keep coming back to in my mind are related to text documents that have been converted to numbers via TF-IDF. I'm not sure if you want to go that route, but that could be one possibility, and if you still wanted to plot the clusters, you could reduce the data with PCA or t-SNE.

Smaller points:

[x] The tutorial dedicates a lot of space to parsing the differences between supervised vs unsupervised vs reinforced machine learning. I think a sentence or two would suffice there, and you could spend more time discussing the k-means algorithm in particular.
[x] The tutorial links to Wikipedia for a visualization of how the k-means algorithm works, but I think a visualization within the tutorial itself would be super helpful (perhaps you could even include that GIF or a similar one with credit to the original authors?).
[x] The tutorial instructs the reader to set the random_state parameter to 42 but does not explain the significance of random state or why it should be set to 42 (a common convention in machine learning supposedly inspired by Hitchhiker's Guide to the Galaxy). I would suggest that you either forgo the random_state argument entirely, or, if you choose to set a random state, that you explain the concept and why you are choosing 42.
[x] This is a tiny point, but the tutorial explains that k-means work well for "unknown data," but I feel like digital humanists often know things about their data but want to uncover unknown structures in order to gain a new perspective on their data. Perhaps you could spend a sentence or two explaining the significance of k-means for this audience.

Thanks to @hluling and @melaniewalsh for your thorough reviews of this lesson. @thomjur, I'll leave it to you to address the minor revisions, as they seem straightforward. As for major revisions, @hluling's one suggestion seems easily doable. @melaniewalsh's revision advice will require some change to the lesson's structure and length, but I think you can do it without rewriting the lesson entirely.

It is ok if you find it too difficult to change the dataset for this lesson, but it is still important you explain why you chose this dataset in particular, and how K-Means clustering works and doesn't work for this dataset. More importantly, Melanie's suggestion that you use TF-IDF to enhance your analysis seems worth doing.

Along the way, you can more clearly clarify how this process can be used to answer research questions with K-means clustering of textual features. In the end, the most important thing isn't that this lesson perfectly answers its mock research question, so long as it explains how researchers could use it to answer various types of research questions with varying amounts of success.

Your lesson is currently only approximately 4000 words, so you have a lot of room to expand as necessary. I'm optimistic that some additions to the lesson to address the reviewers concerns will result in a lesson that will prove more persuasive to readers trying to decide when to use K-Means clustering over other algorithmic methods for answering their research questions.

In particular, I think the conclusion could be expanded to better address the issues reviewers raise, giving more time to explain what is useful and not useful about K-Means clustering. Throughout the lesson, you could also weave in more comparisons of K-means with other methods - this will help better draw out for the reader why this method is appropriate. Like Melanie says, it seems less necessary to discuss in detail the differences between supervised and unsupervised machine learning, and more time differentiating K-Means from other algorithms like PCA that offer comparable analyses.

Please try to make your revisions in the next month, or two at most. Let us know if you'll need more time, and please feel free to add here any further thoughts you have as you proceed with revisions.

@hluling @melaniewalsh @hawc2

Thank you all for your constructive suggestions. I am happy to revise my tutorial. However, I am still struggling with the major issue raised by @melaniewalsh. Even though I like the idea of using TF-IDF and actual documents for K-Means clustering, this would mean changing the entire article because such an analysis (including TF-IDF) is simply not possible with the current dataset (which was trying to mock customer data analysis, a viral example to illustrate clustering algorithms).

Thus, as far as I can see, I need to change the dataset in order to fully cope with all your suggestions (since, as I said, implementing TF-IDF is simply not possible with the current dataset). I am willing to do this, but I will first need to think of a dataset that is neither too complex for a tutorial (which would be the case with thousands of documents with very different sizes from Brill's New Pauly) nor poses any legal issues (which would most likely be the case with scraping all articles from Brill's New Pauly as well; it's a protected website with paid content). Of course, it should also make sense actually to analyze the data. I will ventilate this question for a while and see if I can come up with something. I have some first ideas, but I am not sure whether they will actually work. Of course, I will then need to rewrite the tutorial (that is based on the current dataset) and change the code, which is fine. However, I am not sure if I can do this within a month or two (family, covid, and a non-existing vaccination strategy in Germany are a bad combination). However, I honestly like the idea of using documents with K-Means clustering, so I see what I can do (maybe I'll have some sort of inspiration). Ooohhhhmmmm....

I will keep you updated, and thanks again for your evaluation!

Thanks @thomjur for clarifying. I'd be curious to hear what @melaniewalsh or @hluling would say. Please chime in if you have any recommendations or further feedback.

For my part, I think it's possible you could work with this dataset and find another way to make your analysis and results more robust. You do a good job in this tutorial of discussing how to pick a number of clusters, but the transitions between method explication and dataset seem to quick, leaving vague why K-Means clustering is right for this particular dataset and research questions. I do wonder if there's another approach with this current dataset that might work as an implementary or subsequent step.

Your section, "How to Proceed?", in particular, opens more questions than it answers, and is one place I'd encourage you to expand the lesson. You write, "Usually, it is a good idea to implement different clustering algorithms, such as hierarchical clustering and k-means, and to compare their results. Even though we have skipped this part in this tutorial, it should be pretty easy for you to add other algorithms now that you have understood how to work with scikit-learn." To me, this is something this tutorial should go into in some detail, to show why K-Means is appropriate here. Perhaps that is an alternative to a new dataset/TF-IDF route, if that proves too difficult to rewrite into the lesson.

Regardless, considering this will require major revisions, it is fine if you need more than two months to get the final draft in shape. Keep us updated on your progress, and have a good spring!

@thomjur @hawc2 @hluling

I hope it's ok that I discussed this tutorial in some detail with my research group, but they also feel that the tutorial is solid except for the choice of dataset. I still think that incorporating TF-IDF could be useful and cool, but I think what's most important is that whatever dataset you choose has real clusters in it (e.g., in the plot, there would be space between the clusters). One of the drawbacks of K-means clustering is that it will identify clusters even if your data does not contain clusters, as discussed at the beginning of this StackOverflow answer. And, from my perspective, it's important that readers know what kind of data is appropriate to use with this technique.

I'm sorry that this suggestion requires more revision work in such a difficult year, and I can empathize with wanting to get this piece published more quickly because there's so much good material here, but that's my take on the tutorial. @thomjur if you choose to go with another dataset, I would be happy to brainstorm or discuss alternatives. I'm excited about this tutorial and think it could be really useful for the community!

Thanks a lot for your encouraging response, @melaniewalsh!

No worries, I am happy about every kind of feedback that I can get, so I am glad that you discussed my draft in your group. And I am also excited to revise my tutorial. The more I think about it, the more I have to agree with what you suggest. My "lazy" defense, so far, was that real-world data oftentimes does not include strictly separated clusters. But even though this might be true, it does not make sense to use such data in a tutorial. Still, I think that it is, generally speaking, reasonable to use K-Means with such data, and if it's to demonstrate that your customer/author data does not include any clusters (and I agree, two to five outliers are not really a "cluster"), which could also be a valuable insight.

I will think some more about textual data that could be interesting to analyze with K-Means. Due to legal aspects, I tend towards Wikipedia; yet, Wikipedia is also somewhat overused, although maybe not to the same extent as Twitter data.

I am also struggling with this whole data/privacy thing... I am not sure if this is due to my German angst background (since Germans tend to exaggerate this whole data privacy/protection stuff, IMHO), but I see potential problems/lawsuits everywhere.

Glad to hear we're coming to a consensus.

@thomjur, as far as copyright/privacy goes, feel free to email me some of your thoughts. I can guide you through deciding about those copyright/privacy issues, and I'll consult with other Programming Historian editors if necessary. Wikipedia is certainly safe, and not infrequently used by PH, in part because it is more likely to ensure sustainable lessons. The ideal dataset would be one familiar to a large audience, and appropriate to produce meaningful clusters with K-means.

@thomjur @hawc2 @melaniewalsh Sorry to jump in late. I'm glad to see the consensus too. I just wanted to point out that clustering techniques are exploratory in nature. I believe this tutorial reflects a common process whose goal is to explore any possible patterns without knowing whether there are 'real' clusters or not before running the analysis. I agree with the StackOverflow post mentioned by @melaniewalsh that we should examine the required assumptions before running an analysis. However, when there are four or more variables so that it is difficult to visualize the raw data, checking all the assumptions for K-means is not feasible (maybe there are formal tests of these assumptions that I don't know). So my position is "we won't know until we try something out." The current dataset and the exploratory research question, as shown in the current version of the tutorial, do not preclude the use of K-means clustering. That being said, I do agree with @melaniewalsh @hawc2 that the ambiguous result may be problematic for publication due to a lack of coherent narrative. So I'm excited to see a revised version of the tutorial no matter which suggested route is taken.

Just a quick update from my site: I hope to send you a revised version of my tutorial by the end of this week.

@hawc2 @melaniewalsh @hluling

I have just uploaded a revised version of my tutorial. Since I have changed major parts of the tutorial, I will not respond to every minor point mentioned by the reviewers.

Major/Minor changes:

[x] I have changed the dataset; the tutorial now examines two new datasets, one about ancient authors (taken from the DNP) and a dataset containing textual data (abstracts from the journal Religion) using TF-IDF
[x] particularly the first dataset and its examination now show "real" clusters; even though the situation in the case of the second dataset is more difficult, the results are still promising, and querying the clusters of the textual data indeed show some reasonable clustering of the article abstracts
[x] I have added a second clustering algorithm (DBSCAN) to the analysis; consequently, I also changed the title of this tutorial to "Clustering with Scikit-learn in Python"
[x] I have added PCA for dimensionality reduction both for plotting and actual feature reduction
[x] I have also added other ways to reduce high dimensional feature sets (very basic)
[x] I have added the silhouette score and silhouette plots/diagrams as a second measure to evaluate the clustering process
[x] I have shortened the part on "supervised machine learning"
[x] I have added several visualizations related to my own data to illustrate how k-means etc. work

I am looking forward to your comments!

There seems to be problem displaying the tables. Is there a way to fix that, @hawc2? If not, I could simply delete the longer ones.

@hawc2 @melaniewalsh @thomjur I’m glad to see the added text-based dataset and a much more expanded discussion on different techniques involved in clustering. I like the added visualization in the section that explains how k-means clustering works and why standardization is needed, as well as the added parts on the silhouette scores. I also appreciate the detailed explanation about DBSCAN.

I think this version meets publication standards as a tutorial. I just have the following comments for improving clarity.

Major issues:

[ ] On feature selection (para. 68 and para. 70), I don’t think how this is done is clearly explained for a reader. I understand the goal is to reduce the number of features from ten to three based on the silhouette score, but I don’t get how this is done exactly based on my reading. Maybe adding a two-to-three-sentence summary of progressiveFeatureSelection( ) would be helpful.
Specifically on progressiveFeatureSelection( ), why did you set the default n_clusters as 3?
[ ] Similarly, the section of finding the optimal eps value may be unclear to a reader. It took me some time to figure the whole thing out by reading the towardsdatascience piece and the piece cited in the towardsdatascience piece. Regarding para. 56, after “we can plot the distance between each data point in a dataset and its nearest neighbor,” maybe just add “we then sort the distance in ascending order.” I suppose this would make it easier to understand Figure 11. Also, for Figure 11, please add the axis titles.
[ ] I think the big takeaway from DBSCAN is that the result does not lead to groundbreaking findings. This is fine for a tutorial. But I’d like to see a more concrete and forward-looking suggestion for the next step of analysis. For example, what do you think would be some new research questions (para. 118) generated for the next step of research?

Minor issues:

[ ] The language of DBSCAN allowing for (para. 104) or accepting (para. 14) outliers is problematic. It implies that k-means cannot run when there is an outlier. Do you mean that DBSCAN is able to leave the outliers alone and exclude them from the clusters? But maybe the term outlier itself is problematic for this tutorial’s purpose. I prefer the name used in the towardsdatascience piece: noise point. If you look at the last figure in the towardsdatascience piece, the navy blue points (noise) are not really outliers in the traditional statistical sense.
[ ] Figures 1, 2, and 3: It might be a good idea to add the author names as labels next to the points.
[ ] Figure 2: Where is the green dot?
[ ] Could you show a figure (like Figure 9 and the last figure in the towardsdatascience piece) for the result from DBSCAN?

@thomjur @hawc2

I think this revision improves the tutorial a lot. The Greco-Roman author clusters are more coherent, the use of PCA is nice and very helpful, and there seems to be a better communication of how clustering works with real data vs artificial data (e.g., "As we can see, there is no real elbow in our plot this time"). The author has clearly has put a lot of work into this revision. Well-done!

Similar to @hluling, I like the addition of text data, TF-IDF, and DBSCAN. However, it also makes the tutorial lengthier and more complicated. I hesitate to suggest this because I know it would be even more work, but it occurred to me that this version of the tutorial could usefully be split into two parts: Part 1 (K-means) and Part 2 (DBSCAN). That could be one way of capitalizing on all this great material while also paring things down for two easier, more focused reads with two different datasets and clustering methods. Would that be possible?

My comments will mostly focus on the K-means section, but the general concepts apply to the DBSCAN section as well.

Major issues

[x] It would be great if there were a bigger payoff/more discussion of results and conclusions from the K-means clustering of ancient authors. I really wanted to explore some of these beautiful clusters! Perhaps you could include a few tables with especially interesting ones. For example, when I looked at your Jupyter notebook, I was really satisfied to see that Plato, Sophocles, and Aeschylus were all in a cluster together.
[x] The author does a good job explaining important concepts with snippets of data in the subsections "The K-Means Algorithm," "Elbow Method," and "Silhouette Score." But I was wondering if these explanations might be interwoven into the actual exploration of the data in the section "Applying K-Means on the Ancient Authors Dataset from Brill’s New Pauly," rather than coming before it and focusing only on snippets of data. I think one drawback of the snippets is that it makes it seem like you're only considering two features of data when you really you consider more features later on.
[x] I appreciated that potential research questions/hypotheses were included near the beginning of the tutorial, but I wondered if these questions could be moved up a little to come even earlier so that they were almost motivating the dataset choice:

For instance, there might be groups of authors who are discussed at length but to whom limited manuscripts are attributed. On the contrary, other groups may include authors to whom many surviving manuscripts are ascribed but who only have short entries in the DNP. Another scenario could be that we may find groups of authors with many early editions but only a few modern ones.

Maybe the possible research questions could come after the sentence "The original German version of the DNP has been translated into English (starting in 2002)." It might also be helpful to mention/discuss specific authors more, to ground the research questions even further.

[x] The subsection "How Does Clustering Work?" seems slightly redundant. I feel like the previous part of that section does a good job communicating how clustering works, as does the "How Does K-Means Work?" that comes after it.

Minor issues

[x] Figures 2 and 3 use red centroids, while Figure 6 uses yellow centroids. It might be good to keep the centroid colors consistent. It might also be useful to put a little cross over the centroids to further visually distinguish them (you could just manually add that in Photoshop or Preview, etc.)
[x] Paragraph 2 and 3 could be condensed, perhaps concluding with the sentence "The original German version of the DNP has been translated into English (starting in 2002)."
[x] In the Jupyter notebook, you might mention that yellowbrick needs to be installed, since it doesn't come with popular Python distributions like Anaconda, or you might even include a pip/conda install yellowbrick line
[x] I agree with @hluling that author name labels would be very useful for the figures!

Thank you for your constructive criticism, @hluling @melaniewalsh @hawc2.

I agree with most of your comments, and I will try to implement them as soon as possible. Most of them are pretty straightforward (adding axis labels, adding a couple of sentences here and there, etc.).

However, particularly Melanie mentioned a couple of things that I would like to discuss first.

I'd be happy to add some additional sentences or even tables to the analysis. Yet, I am somewhat hesitant to dive too deep into the discussion of these still mock-up use cases. This is simply not a research article, and although I find the results sort of interesting too, I am not sure if this text isn't already too long for a tutorial (depending on if people reading it are looking for a concise recipe of how to use clustering or want to know more about Greco-Roman authors).
Regarding the second point mentione dby Melanie: I personally thought that it might be a good idea to explain the concepts based on a reduced and somewhat adopted dataset. This allows showing the general idea behind these methods more clearly. Also, including only two to three dimensions (features) make the visualization a lot easier. This section is meant to show "how it is supposed to be/look like" (in an ideal world with perfect data) and the actual analysis "what it usually looks like." But I am curious what @hawc2 and @hluling think about this. It would also imply another substantial rework of major parts of the tutorial.
Should I delete the section on "How Does Clustering Work?" I am open to doing so; just let me know what you think (it's becoming more and more difficult for me to have an "objective" view on my tutorial)

Thanks in advance for your help!

@thomjur @melaniewalsh @hawc2

I agree with @thomjur that the explanation about k-means with a reduced dataset is necessary, because the start-from-ground-zero approach with the two-dimensional visualization is learner-friendly.
I agree with @melaniewalsh that the subsection "How Does Clustering Work?" may be redundant, especially now due to the concern about the tutorial's length. It wouldn't hurt to delete this subsection.

Hey all, it sounds like most final revisions are clear and in agreement by all.

As far as walking through some introductory info about clustering using a toy (reduced) dataset, I think that's ok to do, so long as you explain what you've described in your response here, telling the reader why you are using a smaller dataset to explain certain concepts, and pointing out to the reader how this demo analysis differs slightly from the number of features you'll explore later.

I can understand why it might not be worth it to weave this part into the broader exploration of your main dataset, even if it would make the lesson a little more elegant, but it will be important at the very least that you give signposts so readers don't get led off track or get confused about how to connect different sections of the tutorial.

Thanks everyone for your time contributing to such a generative conversation about this important tutorial

Thanks for your follow-up comments, @hawc2 and @hluling.

Also @thomjur, to address your first question about whether or not to briefly show/discuss results, I still think it would be useful even though this isn't a research article because it will help show people how to interpret their own results. And I think that's part of what distinguishes Programming Historian from any other generalized scikit-learn tutorial on the internet. For a good model of briefly discussing results/interpretation, you might look at Matt Lavin's TF-IDF tutorial.

@hawc2 @melaniewalsh @hluling

I have now revised parts of my tutorial. Please let me know what you think. Note that some of the images do not seem to be displayed correctly (maybe it takes some time to refresh gh pages); also, the problem with the tables remains. It might make sense to wait a couple of days before you go through the tutorial again. I still have to adjust a few things. It's just that I had some spare time today and wanted to proceed with my tutorial.

Major issues

[x] progressiveFeatureSelection( ): I have added a couple of sentences to describe the overall idea of the algorithm. I have also tried to explain why I selected n=3 clusters (and why it might make sense to play with this parameter in the context of hyperparameter tuning).
[x] I have added a short explanation why I think it makes sense to introduce the algorithms with the help of reduced datasets (and that we will later work with more features).
[x] eps function: I have added your helpful sentence and also added axis labels
[ ] @hluling 's last point regarding DBSCAN remains somewhat open; I agree, the major outcome is that DBSCAN itself does not add a lot to the results already provided by k-means clustering (which is also underlined by the added figure 12 showing a PCA-reduced version of the data); yet, combined with the results from k-means clustering, it indeed allows some interesting insights into the data; how to proceed from here? I have mentioned some potential next steps such as hyperparameter tuning and/or adding additional clustering algorithms. I also mention that we have found interesting groups (which I also show) and created a basic "recommender system." To be honest, this is all I would expect from an exploratory data analysis. But I am happy to elaborate on this if necessary. It's just that I do not have any immediate ideas for a research question based on DBSCAN right now.
[x] I have added some tables showing the results of the k-means clustering of the ancient authors with some very basic discussion of the authors in these tables.
[x] Even though I did not split the text into two parts, I tried to make it more clear that we have two separate sections that are each dedicated to clustering different datatypes; I have added an additional paragraph in the introduction and corresponding subtitles ("Case Study 1 & 2").
[x] I moved the motivation to the part suggested by @melaniewalsh
[x] Deleted the subsection "How Does Clustering Work?"

Minor Issues

[x] I have changed some parts where I discuss outliers/noise (and how k-means clustering/DBSCAN deals with them); I mostly replaced outliers with "noise points;" however, in some cases, I left "noise points or outliers;" please let me know if this use is correct @hluling
[x] figures 1,2,3 added author name labels
[x] Regarding your question why Ambrosius (the green dot) does not show up in figure 4: This is because the centroid in a single point cluster has the same position as the single point (which makes sense when looking at the algorithm and the way the cluster centers are calculated). This is also explained in my tutorial when introducing inertia (where I mention that the inertia is zero when n=number of data points).
[x] added a cross to the centroids (which hopefully also makes the coloring less important)
[x] condensed paragraphs 2 and 3 as suggested by @melaniewalsh
[x] added an additional figure for DBSCAN (figure 12)
[x] added yellowbrick pip install to jupyter notebook

Just a short update: The images should now be displayed correctly. I have also added a couple of sentences to the DBSCAN discussion and the summary related to @hluling claim to discuss/propose potential outcomes of the clustering.

@hluling @hawc2 @melaniewalsh

I have now revised my tutorial. In addition to the points mentioned above, I have also added some paragraphs to discuss potential further steps in the DSCAN section (see @hluling comments). Besides, I have also corrected some other mistakes; for instance, I realized that I was first talking about using n=3 clusters in the first part of the k-means clustering of the ancient author data and then suddenly changing to n=5 clusters in the actual analysis. I hope that I have now corrected everything to n=5.

Also, I was wondering about my use of "standardization" and "normalization." I think it's ok to use both terms since standardization is z-score normalization to my understanding, but maybe @hluling, as a trained statistician, could say some more about whether this use is correct, misleading, or simply wrong (I've made it through half a BA program in computer sciences by now, but I have not taken the - optional - class on statistics so far^^).

Thanks in advance for your help, and I am very much looking forward to your comments!

@thomjur Thanks for the updated version and some of the clarifications. I'm not a trained statistician, but I think the term 'standardization' better describes what you did based on para. 33. I think normalization typically means rescaling values to [0, 1].

Two more comments:

In para. 100, regarding choosing the number of components in PCA, it may be useful to add a footnote saying that the elbow method also applies here.
Thanks for elaborating on some future steps generated from the DBSCAN results. You pointed out that "combining the results of different clustering algorithms also helps to better discover structures and thematic clusters in or data" (para. 121). It seems that a more direct comparison between k-means (Case Study 2, Sec. 4) and DBSCAN (Case Study 2, Sec. 5 [there is a typo in section number here, btw]) would be important. Therefore, I wonder whether you have tried running a k-means with much fewer clusters. You started with n = 100 for k-means (Case Study 2, Sec. 4), but choosing such a large number of clusters may not be helpful (even though you interpreted 3 clusters successfully to some extent). Speaking of Figure 10, I wonder whether an elbow-like pattern may appear with much fewer clusters. I'm curious about the result with a much smaller n for k-means, because then we can have a more direct comparison between k-means and DBSCAN by using figures and analyzing the titles within each cluster. It seems that how noise/outliers is dealt with is the major distinction between k-means and DBSCAN. So if we have a figure from k-means, and a figure from DBSCAN, with same or similar number of clusters, then we may be able to see whether it makes sense to categorize an article as a member of a cluster (k-means) or noise (DBSCAN). In general, I'm echoing @melaniewalsh here regarding what distinguishes PH from other more generic tutorials: Interpretation and discussion of results that attract readers in humanities. I understand that this may be the most challenging part. But feel free to let us know what you find and/or what you think.

@hluling Thanks for your comments. I've replaced "normalized/normalization" with "standardized/standardization" in the tutorial and the Jupyter notebook. Thanks as well for the recommendation to mention scree plots (that's what you meant, I guess?). I've added a sentence to paragraph 100.

Regarding your second question: First of all, I tried running k-means with only a few clusters in the beginning. However, the elbow plot shows an almost linear line in this case, and the results when looking at the articles in each cluster are not very promising. If you want, you can try this out yourself in the Jupyter notebook (which is something I recommend doing throughout the entire tutorial anyway, because obviously, my choice of clusters might not be the perfect solution - which is nothing I claim anyway).

The clusters are too big and include too many diverse articles with such a low number of clusters. I am also mentioning this in my article when discussing the elbow plot: "But is it likely that a journal such as Religion that covers a vast spectrum of phenomena (which are all, of course, related to religion) only comprises a few thematic clusters? Probably not." (para. 103) I think that the "nature" of the data demands a more fine-grained solution because the articles are indeed diverse. This is why I run k-means with n=100, and I really think that the results are astonishingly good. I checked again with several different randomly selected clusters in my notebook, and I am honestly surprised how well this works (I am only showing three exemplary clusters in the text, since I think that this is enough to demonstrate the basic idea and encourage the reader to explore the data him- or herself). In addition, the rather small number of articles in each cluster allows checking the coherency of the clustering results more easily (again, there are obviously clusters that don't make much sense).

To be honest, I was not really sold when @melaniewalsh initially told me to use k-means with textual data because I thought that this is not the best way to analyze textual data. Yet, I am really surprised how well this worked. Is it perfect? Probably not. Could I sell this solution to a major company as a professional recommendation system? For sure not. But it delivers decent/coherent results and insights into the data, and it hopefully encourages the reader to try it out with his/her own (textual) data.

Now, when it comes to DBSCAN, I think we both agree (and I also say this in the tutorial several times) that the clustering, in this case, isn't really convincing. This is, among others, also because at least some of the clusters are too big. Yet, it is difficult to play around with the eps value until I have a similar outcome as the k-means algorithm because this is not how one would use DBSCAN. On the other hand, some of the few DBSCAN clusters actually show interesting results, which is why I recommend combining these results with the results provided by k-means, may it be as part of a recommender system or as an insight into the data as part of the exploratory data analysis. I mean, we are not restricted to only using one algorithm during the exploratory data analysis. We can apply as many as we like, look at the results (of which some are more and some are less indicative), and just combine the insights we gained from each in our final evaluation (in the way ensemble learning does for classification tasks by using different models and then combining their results).

I just read through the revised version of the tutorial, and I have some thoughts about the DBSCAN conversation. I also found some typos and small issues while re-reading, which I've noted below.

Major While I agree with @hluling that a better illustration of how DBSCAN deals with noise/outliers would be nice, I think I understand what you're saying @thomjur and that you're more interested in demonstrating that DBSCAN produces some interesting clusters, too, which we might think of as additional insights in our exploration.

Either way, upon re-reading, I noticed that it's hard to read the Religion article titles for each DBSCAN cluster, and that it's also hard to read the sample abstract when the Religion data is first introduced in paragraph 11 (the abstract simply says "In contemporary..." before it gets cut off).

In the DBSCAN section at the end, would it be possible to make the clusters into regular tables instead of code outputs so that they're easier to read? Something like...

`df_abstracts_labeled[df_abstracts_labeled["cluster"] == 75]["title"]`	Religion Article Title	Cluster
Checking the heavenly ‘bank account of karma’: cognitive metaphors for karma in Western perception and early Theravāda Buddhism	75
Karma accounts: supplementary thoughts on Theravāda, Madhyamaka, theosophy, and Protestant Buddhism	75
Resonant paradigms in the study of religions and the emergence of Theravāda Buddhism	75

Minor

[x] In Paragraph 30, "This, in turn" instead of "This, in return"
[x] In Paragraph 45, "machine learning" instead of "ML"
[x] In Paragraph 60, what about "in Manoranjan Dash and Huan Liu's paper 'Feature Selection for Clustering' and Salem Alelyani, Jiliang Tang, and Huan Liu's 'Feature Selection for Clustering: A Review" with hyperlinks instead of simply hyperlinking "this and this paper"? (I'm not sure if Programming Historian has a house style for citation and hyperlinks @hawc2)
[ ] In Paragraph 79, do you need the block quote from Ankur Patel? Maybe that could be a quick parenthetical citation
[x] In Paragraph 81, when you say, "refers to a famous film (which one is it...?)", are you being playful and trying to get the reader to guess? If so, maybe you could also link to something that has a real answer
[x] In Paragraph 82, the wording seems off in this phrase "Our plot looks like in figure 9"
[x] In Paragraph 85, small typo in this phrase "many authors with only very few kown works," and also I think it should be "no" instead of "none" in the phrase "few to none commentaries"
[x] In Paragraph 86, "including the first then entries in cluster 0," first ten?
[x] In Paragraph 89, maybe put "curse of dimensionality" in quotation marks instead of italics and hyperlink these words to the Wikipedia page?
[x] In Paragraph 107, "interpret" in "harder to interprete"
[x] In Paragraph 116, when it says, "As we can see, fitting a DBSCAN instance under this circumstances results in only four clusters and a vast noise points cluster (-1)" I'm not sure what we're supposed to "see" at that point — where do we look to see the four clusters?
[x] In Paragraph 124, maybe "despite" instead of "albeit" in "Albeit this general ambiguity when applying machine learning algorithms"

@melaniewalsh Thanks a lot for your constructive comments! I've changed the code output to markdown tables in both the k-means and DBSCAN sections, and indeed, it looks much cleaner and is more readable. Great, thanks a lot! However, I left the abbreviated "In contemporary (...)" in the introductory part since this table only illustrates the structure of the table (otherwise, the table would be way too big/long when displaying the entire abstract). Yet, I am happy to change this if you would like me to. This also depends on the general question of how the tables will be displayed in the final tutorial. For instance, some of them currently include too many columns that are not shown when I open the draft web version in my browser. I am not sure whether there will be automatic breaks or something like this in the final template. Maybe @hawc2 knows more.

I've also corrected all the minor issues, except the quote by Patel . I personally like it, and I find it very helpful (plus, it makes my tutorial more "professional" when quoting someone who actually is an expert in this field), but I am also happy to delete it. I've also deleted the awkward film reference, which was, on top, wrong. -.-

Thanks again for your help, and have a great weekend!

@thomjur @hluling @melaniewalsh thank you all for this incredibly generative peer review process! It's amazing how much this tutorial has grown in complexity and detail over the last couple months. We have a few loose ends to tie up, but the big problems sound resolved. I'm recommending we now close the review and move on to the next stage of the publication process.

@thomjur I'll follow up with you about next steps. If you have any further chagnes to make, please go ahead and do so in the coming weeks.

@hluling and @melaniewalsh thank you so much for your investment in reviewing and improving this tutorial. It is a very difficult subject, especially to make user-friendly, and your expertise and attention to detail has really ensured it will be a popular Programming Historian lesson!

I the author Thomas Jurczyk hereby grant a non-exclusive license to ProgHist Ltd to allow The Programming Historian English to publish the tutorial in this ticket (including abstract, tables, figures, data, and supplemental material) under a CC-BY license.

@svmelton, this lesson is ready for next stages of the publication process. I’m providing the location of all relevant files, let me know if you need anything else.

lesson file - /lessons/clustering-with-scikit-learn-in-python.md
images folder - /images/clustering-with-scikit-learn-in-python/
assets folder - /assets/clustering-with-scikit-learn-in-python/
gallery image - /gallery/clustering-with-scikit-learn-in-python.png
gallery original - /gallery/originals/clustering-with-scikit-learn-in-python.png

The following author bio is for adding to the /_date/authors.yml file on the other repository:

- name: Thomas Jurczyk 
  team: false
  orcid: 0000-0002-5943-2305
  bio:
      en: |
            Thomas Jurczyk is a German historian and religious studies' scholar based in Bochum (GER)

Thank you @hawc2! This will be the first lesson that our new publishing assistant will work on for the copyediting! She's just started and I'll be meeting with her next week, and I'll let you know once the copyediting is complete and we're ready to publish.

Hi @Anisa-ProgHist! This is the lesson that's now ready for copyediting. Please let me know if you have any questions.

Thank you, @svmelton! I will take a look and be in touch if questions arise. Looking forward to working on this!

Hello @thomjur and @hawc2 ,

Thank you for your patience. My copyedits of this lesson are now ready for review.

I have applied my suggested revisions directly to the markdown file in our Submissions Repository, where you be able to view the additions and subtractions I’ve made in the Commit History

In addition, I will paste a list of my comments and suggested revisions here, so that the copyediting process is transparent and can invite discussion.

I have formatted my comments/suggested revisions as a task list. You will notice that many of these tasks are ‘checked’, because I have made the changes directly. A small number remain ‘unchecked’ because they ask questions.

I hope my comments will prove clear and useful. I’m happy to talk through anything which isn’t, and will not be offended if you choose not to absorb my suggestions.

A few general thoughts first, if I may.

In Paragraph 37, you provide a link to the source of your dataset for the First Case Study. I note that access to this data is not free, and I would like to ask if you think it is important to clarify that?
In Paragraph 82, you establish the lesson's prerequisites of knowledge and understanding. I think it would also be useful to establish the technological dependencies and operating system requirements, and I would like to suggest that this is established as early as possible in the text. As a clustering novice (I realise that this lesson is defined as 'difficult') I am not sure whether the K-means and DBSCAN algorithms are suitable for use across Mac, Windows and Linux. I think it is important that the lesson makes clear which technical environments has the lesson been developed and tested in, and which technical environments it can be applied in.

[x] Para. 30. Italicise the k in k-means throughout
[x] Para. 30. Suggestion: add a link to Wikipedia definition k-means clustering
[x] Para. 30. Suggestion: add a link to Wikipedia definition DBSCAN
[x] Para. 30. Suggested revision: replace ‘among’. Instead: ’in combination with’
[x] Para. 30. Suggested revision: replace 'with', Instead 'using'
[x] Para. 32. Suggested revision: replace ‘the’ with ‘these two datasets’
[x] Para. 32. Suggested revision: replace ‘that’ with ‘how clustering algorithms’
[x] Para. 32. Suggested revision: replace ‘cover’ with ‘be applied to a broad range’
[x] Line 34. Suggested revision: replace ‘part’ with ‘section’
[x] Heading Line 36. Italicise Brill's New Pauly
[x] Para. 37. Suggested revision: replace ‘exemplary’ with ‘example’. I think it is rare to use exemplary as a noun in this way. Or is this an exemplary use case?
[x] Para. 37. Suggested revision: In this example, we will use k-means to analyze …
[x] Para. 37. Do you think it is important to say that access is not free? This access appears to be available for purchase (both for individuals or institutions)

It isn't clear until you read on, what the abbreviation DNP represents. Perhaps simpler to introduce the abbreviation after the first use of the original German title. Also, if you restructure this sentence, you can avoid the awkwardness of two apostrophes in Brill's New Pauly’s. Suggest: revision of sentence,

The data was taken from the official Brill's New Pauly website, and originates from Supplement I Volume 2: Dictionary of Greek and Latin Authors and Texts. Der Neue Pauly: Realenzyklopädie der Antike (1996–2002) is a well-known encyclopedia of the ancient world with contributions from established international scholars. The original German version has been translated into English since 2002. I will refer to the text using its German abbreviation (DNP) from here onwards.

[ ] Q: ’encyclopedia’ or ‘encyclopaedia’?
[x] Para. 39. Suggested revision: add comma after 'structure,' and replace ‘hypotheses that we already had’ with ‘existing hypotheses about the data.’ or simply, ‘existing hypotheses.’
[x] Para. 39. Suggested revision: replace ‘limited’ with ‘few’
[x] Para. 39. Suggested revision: replace ‘One the contrary’ with ‘Meanwhile’
[x] Para 39. The word 'may' is not needed in addition to the conditional ‘could’. Suggest delete ‘may’. ‘Another scenario could be that we find’
[x] Para. 39. Suggested revision: add 'associated' - 'groups of authors associated with many'
[x] Para. 39. Suggested revision: replace ‘still’ with ‘continue to rely’
[x] List, lines 45-52. Full stops / periods are not required after each list item
[x] Line 54. Suggested revision: replace ‘Consequently’ with ‘So, a single row …’
[x] Para. 62. I’m not sure if the word ‘posses’ is right here. They did not have scrape-able abstracts, or did not have abstracts at all? Suggested revision: However, some articles, particularly those published in older volumes, do not have abstracts, and thus were not scraped.
[ ] In your original sentence, Q: ‘scrapable’ or ‘scrape-able’?
[x] Para. 62. I’m not clear about the meaning of this sentence: ‘The same holds true for other contribution types such as reviews or miscellanea.’ What holds true? Their data was not scrape-able? Reviews and miscellanea have been left out of the dataset?
[x] List, lines 66-69. Full stops / periods are not required after each list item
[x] Line 71. Suggest: replace ‘Consequently’ with ‘So, a single row …’
[x] Line 79. Replace ‘with’ with ‘containing’ or ‘including’, […] a Jupyter notebook containing the code […]’
[x] Para 82. This paragraph briefly sets out the lesson's prerequisites of knowledge and understanding. However, the technological dependencies and operating system requirements are not defined anywhere in the lesson. Are K-means and DBSCAN suitable for use across Mac, Windows and Linux? Which environments has the lesson been developed and tested in?
[x] Heading Line 84. Italicise K-means
[x] Para. 85. Suggest replace ‘k-means’ with 'it': ‘which makes it an excellent model to start with,’
[x] Para. 85. Suggested revision: ‘Among others’ to ‘Among other capabilities’
[x] Para. 85. Suggested revision: ‘allows focusing on’ to ‘allows you to focus on’
[x] Para. 85. Suggest deleting ‘of’ to revise ‘while leaving noise points or outliers outside of the dense clusters’ to ‘while leaving noise points or outliers outside the dense clusters’
[x] Para. 85. Suggest addition: which is something that k-means cannot do independently
[x] Para. 87. Suggest addition: implementing other clustering algorithms in combination with scikit-learn […]
[x] Para. 87. Suggest revise ‘should be fairly easy’ to ‘should be fairly straight-forward’ to avoid repetition of easy/easily in this paragraph.
[x] Para. 87 Suggest choose between either ‘in general’ or ‘it is often’. Suggested revision: ’In general, it is advisable ...'
[x] Para. 90 Suggested revision: Machine learning is an artificial intelligence process by which computers can learn from data without being explicitly programmed
[x] Para. 90 Suggest add links: supervised, unsupervised, and reinforcement
[x] Para.92 Suggest add link: labeled data
[x] Para. 92. This is unclear: ‘predict already labeled data’ Suggested revision: ’One way to assess a supervised machine learning model's accuracy is to test it on some pre-labeled data, then compare the machine learning model's labeling predictions with the original output.
[x] Para. 92. Suggested revision: ‘Among others’ to ‘Among other things,’
[x] Para. 92. Suggested revision: ‘amount’ to ‘quantity’
[x] Para. 94. Suggested revision: ‘mostly works with unlabeled data.’ to ‘is applied to unlabeled data’
[x] Para. 94. Suggested revision: ‘Among others’ to ‘Among other things,’
[ ] Para. 94. Q: prelabeled or pre-labeled? appears twice in this paragraph
[x] Para. 94. Suggested revision: ’A good example are the datasets in our tutorial: We are only passing our model the author or abstract data,’ to ‘The datasets in this tutorial are a good example: we are only feeding our model either the author or abstract data,
[x] Para.99 Suggest add link: so-called naive k-means
[x] Para.99 This is unclear. Could you consider revising: ‘meaning that the cluster centers’ to ‘in which the cluster centers’
[ ] Para.99 Suggest add link to definition of ‘centroid’. I can’t find a straight-forward definition on Wikipedia, but I found this “A centroid is a data point (imaginary or real) at the center of a cluster” at https://www.fon.hum.uva.nl/praat/manual/k-means_clustering_1__How_does_k-means_clustering_work_.html Do you think this would be suitable?
[x] Para. 102. Suggested revision: ‘it is helpful to only consider very few features in this introductory part’ to ‘it is helpful to focus on a few key features in this introductory section’
[x] Para. 111. Suggested revision: ‘Since you mostly don't know how many clusters exist in your data’ to ‘In most cases, you will not know how many clusters exist in your data, so choosing the appropriate initial number of clusters is already a tricky question.’
[x] Para. 125. Suggested revision: ‘centroids are placed between’ to ‘centroids are positioned between’
[x] Para. 125. Suggested revision: ‘that we intuitively assume to represent one cluster’
[x] Para. 138. Suggested revision: ‘a change of words’ to ‘a change of word count’
[x] Para. 138. Suggested revision: ‘three authors who have all entries of approximately the same length in the DNP, but vary significantly concerning the number of their known works’ to ‘three authors who each have entries of approximately the same word count in the DNP, but who have a significantly different number of known published works’.
[x] Para. 138. Suggest: delete 'still'
[x] Para. 138. Suggested revision: the same number of documented works
[x] Para. 153. Suggested revision: ‘changing the number of words’ to ‘changing the word count’
[x] Para. 158. Suggested revision: There is no one-size-fits-all solution to this problem Q: unless, in this case, there is a one-size-fits-all solution?
[x] Para. 158. I intuit that ’n’ = number? If so, then it may be simpler to write: ‘right number of clusters for your data’, and in the following sentence ‘different numbers of clusters’
[x] Para. 158. Suggest add link: so-called elbow method
[ ] Para. 162. Q: should n be written as code?
[ ] Para. 162. This is unclear: ‘The inertia decreases with the n number of clusters to the extreme that the inertia is zero when n is equal to the number of data points’. Can it be broken into two sentences?

Suggested revision: ‘The inertia decreases with the number of clusters. The extreme is that inertia will be zero when n is equal to the number of data points’.

[x] Para. 169. Suggested revision: ‘Another possibility to evaluate’ to ‘Another possible way to evaluate’
[x] Para. 173. Suggested revision: ‘Plotting the average silhouette score of all data points combined with the silhouette score of each data point in a cluster’ … to ‘Plotting the average silhouette score of all data points against the silhouette score of each data point in a cluster’
[x] Para. 173. Suggested revision: ‘can help to evaluate the model quality and the current choice of parameter values’ to ‘can help you to evaluate the quality of your model and the suitability of your current choice of parameter values’
[x] Para. 175. Q: do the words ‘dummy’ and ‘fictive’ need to be in brackets?
[x] Line 196. k-keams I think this is a typing error: k-means
[x] Para. 200. With accessibility in mind, I think it might be best to adjust ’vertical red line’ to ‘vertical dashed line’
[x] Para. 206. Suggested revision: ‘classifying data points without any direct neighbors as noise points’’ to ‘classifying data points without any direct neighbors as outliers or ‘noise points’.’
[x] Para. 206. Suggested revision: ‘include dense regions with data points’ to ‘include dense regions of data points’
[x] Para. 211. Suggested revision: ‘the minimum amount of data points that should be located in the neighborhood of data point to define a region as dense’ to ‘the minimum number of data points that should be located in the neighborhood of a data point to define its region as dense’
[x] Para. 212. Suggested revision: ‘minimum amount’ to ‘minimum number’
[x] Para. 213. Suggested revision: ‘If an initial cluster was found’ to ‘If an initial cluster is found’
[x] Para. 213. Suggested revision: ‘they are also included’ to ‘they will also be included’
[x] Para. 213. Suggested revision: ‘In case border points are part of different clusters, they are associated with the nearest cluster’ to ‘In cases where border points are part of different clusters, they will be associated with the nearest cluster’
[x] Para. 216. Suggested revision: ‘A helpful method to find the proper eps value is explained’ to ‘A helpful method for finding the proper eps value is explained’
[x] Para. 216. These sentences seem fragmentary and are unclear when the paragraph is read: ‘In short, we can plot the distance between each data point in a dataset and its nearest neighbor. We then sort the distance in ascending order. Finally, we look for the point in the plot with the steepest ascent (which allows a visual evaluation of the eps value, which is quite similar to the elbow method in the case of k-means)’.

How about:

‘In short, DBSCAN enables us to plot the distance between each data point in a dataset and identify its nearest neighbor. It is then possible to sort by distance in ascending order. Finally, we can look for the point in the plot which initiates the steepest ascent and make a visual evaluation of the eps value, similar to the ‘elbow’ evaluation method described above in the case of k-means)’.

[x] Heading Line 220. Suggested revision: Applying K-Means on the Ancient Authors Dataset from Brill's New Pauly to Applying K-Means to the Ancient Authors Dataset from Brill's New Pauly
[ ] Line 224. Q: Should pandas be italicised?
[ ] Para. 277. Q: Is Euclidean distance the same as ε-distance (eps) (Line 211)?
[x] Para. 279. Suggested revision: 75th percentile value and 90th percentile range
[x] Para. 365. Suggested revision: ‘The algorithm first looks for a’ to ‘The algorithm first identifies a’
[x] Para. 365. Q: What is a ‘feature’?
[x] Para. 415. Suggest add comma: ‘cast the mean of the columns to approximately zero and the standard deviation to one, to account for the huge differences between’
[x] Para. 425. Suggested revision: As we are only dealing with ten features, we could theoretically do this manually. However, because we have already implemented a basic algorithm to help us find potentially interesting combinations of features, we can also use our progressiveFeatureSelection() function.

Replacing the word ‘since’ with ‘as’ and ‘because’

[x] Para. 425. Suggested revision: ‘was a random choice which unexpectedly led to some exciting results; however, this does not mean that there are no other promising combinations which might be worth examining.
[x] Para 459. Suggested revision: Yet, because the two other clusters are far below the average silhouette score for n=3 clusters, we decide to analyze the dataset with k-means using n=5 clusters.
[x] Para. 462. Suggested revision: I prefer plotting in two dimensions in Python, so we will use PCA() (Principal Component Analysis) to reduce the dimensionality of our dataset to two dimensions.

Deleting the word ‘since’ and adding ‘so’

[x] Para 466. Suggested revision: One huge disadvantage of using PCA is that we lose our initial features and create new ones that are somewhat nebulous to us, as they do not allow us to look at specific aspects of our data anymore (such as word counts or known works).

Deleting the word ‘since’ and adding comma + as

[x] Para 468. Suggest: deleting the word ‘however’ and beginning the sentence with ’42 is an arbitrary choice’. I feel that ‘however’ reads a little strangely when combined with ‘but’ in this sentence.
[x] Para 491. Suggested revision: ‘We were able to observe’ and ‘analyze whether their grouping’
[x] Para 493. Suggested revision: ‘only few known works’ to ‘very few known works’ and ‘it mostly consists of’ to ‘it largely consists of’
[x] Para. 508. Suggest revise this sentence: As we can see in this snippet showing the first ten entries in cluster 0, the authors' names (except Aesop) more or less support our initial hypothesis that we are mainly dealing with relatively unknown authors, particularly compared to the authors in cluster 4.

There are a lot of adverbs in this sentence. The combination of mainly, relatively and particularly make what is being said here quite vague. Can you be more specific?

relatively unknown authors = authors whose work is produced in few modern editions?

How about:

we are predominately dealing with authors whose work is produced in few modern editions, particularly compared to the authors in cluster 4.

[x] Para. 510. Suggested revision: The authors in cluster 4 (the less cohesive cluster at the upper right of our plot) comprise well-known and extensively discussed authors including Plato and Aristophanes, who have all written several works that are still famous and have remained relevant over the centuries, demonstrated by the high number of modern editions and commentaries.
[x] Para 526. Suggested revision: For instance, we could now take these clusters and apply our hypothesis about their relevance to further explore clustering the authors, based on their early and modern translations/editions. However, this is beyond the scope of this tutorial, which is primarily concerned with introducing tools and methods to examine such research questions.
[x] Para 529. Suggested revision: The second section of this tutorial will deal with textual data,
[x] Para 529. Suggest add link: TF-IDF
[x] Para 532. Suggested revision: Using a similar method as that used to analyze the DNP_ancient_authors.csv dataset, we will first load the RELIGION_abstracts.csv into our program and look at some summary statistics.
[x] Para. 550. I am still not clear on the definition of ‘features’…
[x] Para. 558. Suggest adding clarification of the verb ‘lemmitize’ because it is very unusual.

How about: As an optional step, I have implemented a function called lemmatizeAbstracts() that groups, or ‘lemmatizes’ the abstracts

[x] Para. 558. Suggest revise: Considering that we are not interested in stylistic similarities between the abstracts, this step helps to reduce the overall amount of features (words) in our dataset.

Q: features = words? Q: what is meant by ‘lemmatized version of the abstracts’? This is unclear.

[x] Para. 585. Suggest format the word argument as code at each mention because it is a Reserved Word.
[x] Para. 596. Suggested revision: in at least five different documents but in no more than 200.
[x] Line 598. Suggest add link: Principal Component Analysis
[x] Line 598. Suggest add link: reduce the dimensionality
[x] Para. 609. Suggested revision: setting the dimensionality to d=10 was a random choice that happened to produce promising results. However, feel free to play around with these parameters while conducting a more elaborate hyperparameter tuning. Maybe you can find values for these parameters that result in an even more effective clustering of the data. For instance, you might want to use a scree plot to figure out the optimal number of principal components in PCA, which works quite similarly to our elbow method in the context of k-means.
[x] Para. 619. Suggested revision: Next, we try to find a reasonable method for clustering the abstracts using k-means. As we did in the case of the DNP_ancient_authors.csv dataset, we will start by searching for the right number of clusters applying the elbow method and the silhouette score.
[x] Para. 623. Suggested revision: Therefore, let us continue by skipping the silhouette score plots (which are most likely of no value with such a huge number of clusters
[x] Para. 678. Suggest replace ‘albeit’ with ‘despite’
[x] Para 678. This is unclear. ‘able to assist us as a basic recommender system’ How about: ‘able to assist us by doing the work of a basic recommender system’
[x] Para. 678. Suggest add link: recommender system
[x] Para. 680. Suggested revision: Yet, as the textual data in this example is rather difficult to cluster and includes noise points or clusters that contain very few articles, it might make better sense to apply a different clustering algorithm and see how it performs.
[x] Para 686. Capitalise Euclidean
[x] Para 686. Suggested revision: If we were to use the initial TF-IDF matrix with 250 features, we would need to consider changing the underlying metric to cosine distance, which is more suitable when dealing with sparse matrices, as in the case of textual data.
[x] Para. 708. This is unclear. ‘fitting a DBSCAN instance under this circumstances’ ‘using a DBSCAN instance under these circumstances’?
[x] Para. 708. Word missing? ‘and a vast noise points cluster (-1) with more than 150 entries and an even bigger cluster with more than 500 entries’
[x] Para. 708. This is unclear. ‘As is also visible in the plot of these clusters in figure 12 (using a PCA-reduced dataset), this is not really helpful, and we should consider using the original TF-IDF matrix with cosine distance instead’. Could this be revised?

How about:

‘These clusters are plotted in figure 12 (using a PCA-reduced dataset), where the inconclusive results become even more visible. In this case, we could consider using the original TF-IDF matrix with cosine distance instead.

[x] Para. 710. Suggested revision: Its shortcomings aside, the current version of our DBSCAN instance does give some promising insights
[x] Para. 749. Suggested revision: Although the clustering was far from perfect in this case, it did produce some valuable information, which we could use in combination with the more promising results of the k-means clustering.
[x] Para. 749. Suggested revision: ‘Of course, we could also apply some other clustering algorithms and then combine the results’.
[x] Para. 751. Suggested revision: As a next step, we could pursue the idea of building a basic recommender system which suggests articles with similar topics to readers based on their previous readings.
[x] Para. 751. Suggested revision: When applied in combination, the rather unsatisfactory results of the DBSCAN model might be less problematic because they are now used as additional information only.
[x] Para. 753. Suggested revision: Of course, we as scholars in the humanities will be more likely to use these techniques as part of our research during the exploratory data analysis phase.
[x] Para. 753. Suggested revision: In this case, combining the results of different clustering algorithms helps us to discover structures and thematic clusters in our data.
[x] Para. 753. Suggested revision: an overview of research trends in the study of religion throughout recent decades.
[x] Bibliography Lines 762-764.

e.g.: https://www.chicagomanualofstyle.org/tools_citationguide/citation-guide-1.html Smith, Zadie. Swing Time. New York: Penguin Press, 2016.

Géron, Aurélien. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. Concepts, tools, and techniques to build intelligent systems, 2nd ed. Sebastopol: O’Reilly, 2019.

Mitchell, Ryan. Web scraping with Python. Collecting more data from the modern web, 1st ed. Sebastopol: O’Reilly, 2018.

Patel, Ankur A. Hands-on unsupervised learning using Python: How to build applied machine learning solutions from unlabeled data, 1st ed. Sebastopol: O’Reilly, 2019.

[x] Footnote Para. 770. Suggest revise: particularly in the second plot on the first row

Hello @thomjur.

A couple of additional notes re: links. At the moment, you define a number of terms/link to a number of sources more than once in this lesson. Each technical term or resource only needs to be defined/linked to upon its first mention.

[x] https://en.wikipedia.org/wiki/K-means_clustering is linked in Paras.30, 85 and 119
[x] https://en.wikipedia.org/wiki/DBSCAN is linked in Paras.30, 85 (alternative link: https://www.machinecurve.com/index.php/2020/12/09/performing-dbscan-clustering-with-python-and-scikit-learn/), 206 and 209
[x] https://www.tandfonline.com/action/aboutThisJournal?show=editorialBoard&journalCode=rrel20 is linked in Paras.32, 62 and 529 (alternative link: https://www.tandfonline.com/toc/rrel20/current)
[x] https://requests.readthedocs.io/en/master/ is linked in both Paras.41 and 62
[x] https://www.crummy.com/software/BeautifulSoup/bs4/doc/ is linked in both Paras.41 and 62
[x] https://pandas.pydata.org/ is linked in Paras.41, 62 and 82
[x] https://github.com/programminghistorian/ph-submissions/tree/gh-pages/assets/clustering-with-scikit-learn-in-python is linked in Para.41, 62, 79 and 524
[x] https://en.wikipedia.org/wiki/Mean is linked in both Para.82 and 117
[ ] https://en.wikipedia.org/wiki/Silhouette_(clustering) is linked in both Para.94 and 169
[x] https://en.wikipedia.org/wiki/Elbow_method_(clustering) is linked in para 94. I have suggested another link in 158, which you may like to ignore, but in this sentence I was struck by the words 'so-called', which I feel imply that the term is commonly used incorrectly or inappropriately?
[x] https://en.wikipedia.org/wiki/Principal_component_analysis is linked in Para.462, and I have mistakenly suggested another link in 598 which again you can ignore. Perhaps the definition of 'dimensionality' which I have suggested in this paragraph would also be better placed in Para.462?

Let me know what you think.

Hi @Anisa-ProgHist ,

Thank you so much for all the work you put into copyediting this article! This looks great, and I agree with (almost) all your revisions.

First of all, let me respond to the first two issues you raised:

Indeed, it should be mentioned that the data is not freely accessible (actually, I thought I had already done so in one of my footnotes... but I have not...). Maybe we can add a sentence to the first paragraph in the DNP introduction right after "... is a well-known encyclopedia of the ancient world with contributions from established international scholars." Something like: "It should be noted that access to the texts (and thus the data) in the New Pauly is not free of charge. I used my university's access to obtain the data from the author entries. For the following analyses, I have not copied any texts from the New Pauly to the dataset. The numerical data in the dataset was extracted and partially accumulated from the author entries in the New Pauly." Please feel free to change this sentence or add some additional information. Also, I agree with your suggestion to add "I will refer to the text using its German abbreviation (DNP) from here onwards." to the paragraph.
Regarding the second point, I think I understand what you mean. However, I assume that people reading this tutorial already know that Python is a programming language that can be used (interpreted) on different operating systems (Linux, Windows, Mac OS, etc.). Since I am using Python (and scikit-learn is a ML framework for Python), I am not sure whether this information is necessary. But I am happy to add this information if you think it's necessary. Maybe it's helpful to ask the @hawc2 what he thinks about this?

Concerning the "minor" issues you mentioned:

I would like to leave the language-related decisions to you. I am (obviously) not a native speaker, so I think it would be better if you decide whether it's "Q:' encyclopedia' or 'encyclopaedia'?" or" 'scrapable' or 'scrape-able'?" I am fine with whatever version you choose.
Para. 62: Yes, this is awkward. It indeed means that "Reviews and miscellanea have been left out of the dataset." Maybe replace with: "I have also excluded other contribution types such as reviews or miscellanea from the dataset."?
Para. 99: I agree with your suggestion. Also, we can add the link to the page you found (I took the term centroid from the book mentioned in the bibliography, but this does not really help).
Para. 162: Good question. I changed this back and forth when writing the tutorial. I think it might make sense to either use italics or monotype fonts (?) such as in the code highlighting. Furthermore, I agree with splitting the sentence into two parts. I am happy with your suggestion.
Para. 175: You are right; I think we should delete the brackets.
Line 224: Should pandas be italicized. This is also something I was not entirely sure about. However, if you think it's better to italicize pandas here, we also have to italicize the other modules/packages such as scikit-learn, requests, etc., that appear in the text.
Para. 277. Q: Is Euclidean distance the same as ε-distance (eps) (Line 211)? Yes and no. According to the scikit-learn documentation (sklearn.cluster.DBSCAN — scikit-learn 0.24.2 documentation), they are using Euclidean distance as the default metric for the ε-distance. However, the DBSCAN "ε-distance" can also apply a different (non-Euclidean) metric (and so does k-Means, by the way). If I remember correctly from my Math classes in university, epsilon is often used when talking about distances/neighborhoods regardless of the actual metric in use (for instance, in calculus; we used to call this "epsilon Umgebung" in German). But this leads too far, I guess, and I am not pretending to be an expert here.
Concerning the "feature" problem: I am briefly (and implicitly) introducing features when describing the datasets right in the beginning of the tutorial. But we could also simply refer to this Wikipedia entry when the term "feature" is first mentioned: Feature (machine learning) - Wikipedia. What do you think?
Para. 558: Maybe we can change the sentence to "In the following analysis, we will continue working with the abstracts that have been lemmatized with the help of the lemmatizeAbstracts() function."?
Concerning the links: Of course, we can only link each term when it is first mentioned.

I have not changed/accepted anything yet, since I feel that there are still a few things that we need to discuss/agree on first. Afterward, I am happy to implement the changes myself if you would like me to. Just let me know whatever works best for you.

Again, thank you very much for your thorough revision of my tutorial! Great work!

All the best from Bochum, Thomas

Thank you for your swift and thoughtful replies, @thomjur.

@hawc2 and I are meeting next week, so perhaps he and I can talk through the points above, and come back to you with our thoughts on these remaining questions.

I appreciate your time on this, as well as your willingness to discuss the suggested adjustments.

Very best for now, Anisa

@thomjur it would be helpful to add a few more sentences describing how to get the code up and running in a Jupyter notebook.

Where you say, "You can download both datasets as well as a Jupyter notebook containing the code we are writing in this tutorial from this GitHub repository," you could tell the reader they can learn more about getting started by looking at this PH tutorial: https://programminghistorian.org/en/lessons/jupyter-notebooks.

Maybe also say something like "This lesson will work on any operating system, as long as you follow these instructions to set up an environment with Anaconda or Google Colab to run the Jupyter notebook locally or in the cloud."

Thank you, @hawc2. Hello there, @thomjur. Thank you for your patience.

[x] 1. I do think it would be useful to clarify that the data is not openly available. You could incorporate these sentences to follow para 37 where you are setting up your First Case Study. By this stage, you have established that you are using the abbreviation DNP.

A slight adjustment to your suggestion:

It should be noted that access to texts (and thus data) in the DNP is not free of charge. I was able to obtain the data discussed in this Case Study via my university’s institutional access provision. However, I have not copied any texts from the DNP to build this dataset. Numerical data used in the following analyses was extracted and partially accumulated from the author entries.

[x] 2. I think it would be useful to clarify the system requirements for this lesson. As Alex suggests (above), we think it would be good practice to add a short description of the technical context within the Introduction. In this case, Alex thinks that citing another PH lesson that would support a reader to gain the skills they'll require would be helpful.

—

[x] 1. Let’s stick with ‘encyclopedia’ in para. 37, as this seems to be the standard American English spelling. This will be consistent with your other choices, analyze, initialized, normalize, etc.
[x] 2. Para. 62. I mentioned that I wasn’t sure about the choice of the word ‘posses’ here, because it is unclear whether you are saying that the articles did not have scrape-able abstracts, or did not have abstracts at all. I have suggested this revision: However, some articles, particularly those published in older volumes, do not have abstracts, and thus were not scraped. If you are happy with my suggestion, then we avoid my question of ‘scrapable' or 'scrape-able’. With your clarification, I think the final sentence of para.62 could be adjusted to read: Other contribution types, including reviews and miscellanea, have also been excluded from this dataset.
[x] 3. Para. 99. I’m pleased if you’re happy with the link I suggested to provide readers with a definition of ‘centroid’.
[x] 4. Para. 162. Alex and I talked about this and agree that n is not to be written as code, so we can leave that as it is. I'm pleased that you're happy with my suggestion to spilt the sentence about inertia into two shorter ones.
[x] 5. Para 175. We’ve agreed that the brackets around the words ‘dummy’ and ‘fictive’ can be removed.
[x] 6. Line 224. Alex and I talked about this too, and agree that pandas does not need to be italicised.
[x] 7. Para. 277. I think it sounds as though part of your explanation here ”according to the scikit-learn documentation (sklearn.cluster.DBSCAN — scikit-learn 0.24.2 documentation), they are using Euclidean distance as the default metric for the ε-distance. However, the DBSCAN "ε-distance" can also apply a different (non-Euclidean) metric” could be usefully incorporated into the text. As a (non-specialist) reader, I was unclear about whether these two were interchangeable.
[x] 8. Para. 365, let’s add a link to the Wikipedia definition of ‘feature’ in the context of machine learning, as you suggest.
[x] 9. Para. 558, my suggested revision was just intended to clarify the meaning of the verb ‘lemmitize’ which I have never heard before. Are you happy with my suggested addition of that groups, or ‘lemmatizes’ the abstracts? Otherwise, I think that sentence is fine and does not need changing.
[x] 10. Excellent. I’m pleased you agreed with us about the links. When the final adjustments are complete, I can go through and take out the repetitions, ensuring that links are associated with the first mention of a technical term/concept only.

If you are able to deal with the two general points 1 and 2 above, then I am happy to make the other agreed changes 1-10 below. Let me know if this sounds good to you.

Very best, Anisa

@hawc2 @Anisa-ProgHist

Thanks for the suggestions! I've uploaded a new version including few more changes. I left those issues that you both need to decide.

[x] Added your suggestion by adding the following lines to the section you mentioned:

This lesson will work on any operating system, as long as you follow these instructions to set up an environment with Anaconda or Google Colab to run the Jupyter notebook locally or in the cloud. If you do not know how to set up a Jupyter notebook locally, this excellent PH tutorial might help you get started.

[x] Added the following phrase:

It should be noted that access to the texts (and thus the data) in the New Pauly is not free of charge. I used my university's access to obtain the data from the author entries. For the following analyses, I have not copied any texts from the New Pauly to the dataset. However, the numerical data in the dataset was extracted and partially accumulated from the author entries in the New Pauly. The original German version has been translated into English since 2002. I will refer to the text using its German abbreviation (DNP) from here onwards.

[x] Replaced "The same holds true for other contribution types such as reviews or miscellanea." with "I have also excluded other contribution types such as reviews or miscellanea from the dataset."
[x] Deleted brackets from "fictive" and "dummy."
[x] Added a link to the Wiki "Feauter (ML)" entry when the term is first mentioned.
[x] Changed ‘meaning that the cluster centers’ to ‘in which the cluster centers’
[x] Deleted the redundant links (however, I think that some of them are actually necessary, see the Wiki link to k-means in line 119. What do you think? I also added you changes to the links (for instance, linking dimenionality reduction in line 462)

Okay! Thank you, @thomjur!

I'll take another look.

Very best, Anisa

@Anisa-ProgHist Thanks! I think I'll leave it to you for the moment. I've adjusted the parts you asked for, I guess. however, you might still need to replace the part about the accessibility of the data with your revised version. But I'll stop working on the text for the moment, does not make much sense if we both work on the text at the same time. Thank you for all your help and let me know if there's anything else I can do!

Thank you for your work on this, @thomjur!

I have made some final small adjustments to the lesson file, which you will soon be able to review via the Commit History. The changes I've made include:

adding a link to the definition of TF-IDF at paragraph 77, and removing it from paragraph 529
removing the link to the definition of hyperparameter tuning in paragraph 94, as it is already available in paragraph 92
adding a link to the definition of centroids to paragraph 99, as discussed above
removing the link to the definition of elbow method at paragraph 158, as it is already available in paragraph 94
In paragraph 99, I'd like to suggest adding a link to the wikipedia definition of k-means++, rather than repeating the link to the Stanford article. I have adjusted the wording very slightly to reflect that: "I recommend reading this article by David Arthur and Sergei Vassilvitskii if you want to learn more."
You're absolutely right, in places the sentence structure means that it makes best sense to repeat a link, for example, as in paragraph 119 where you refer directly to a Wikipedia entry. This is also true at paragraph 169 "Silhouette (clustering)" where the link is repeated from 95, as well as at paragraph 209 where the DBSCAN wikipedia article link is repeated from paragraph 30.
On reflection, I think repeated links are also necessary in sentences where you are directing a reader to assets or data files associated with the lesson. Therefore, I have added back in the links to our GitHub repository at paragraph 79 and 524.
[ ] The only remaining aspect for you to consider relates to paragraph 277 and 211. Do you feel it could be useful to include an additional sentence to explain the relationship/difference between Euclidean distance and ε-distance? Or would you rather leave this as it is? (As I said, I am a non-specialist reader, and we have provided a link to the wikipedia definition of Euclidean distance in paragraph 115, so readers will have access to the information).

Almost there!

N.B.

One other thing you will notice, is that I have substituted all live links in the lesson for perma.cc archived links (excluding those formatted within tables and code). This is a new part of our publishing workflow, which will improve the sustainability of our lessons. If you follow one of the links, for example this one to the wikipedia definition of k-means clustering, you will be able to view the archived page. If you click "Show Record Details" in the upper left of the header banner, you will be able to see the original source page URL, alongside some basic metadata.

Thank you for this, @Anisa-ProgHist! This looks great!

Regarding the remaining issue: I am not sure if we should make things too complex, but it might be helpful to, at least, mention the issue you raised by adding a short statement. I was thinking of something like:

The first step consists of defining an ε-distance (eps) that defines the neighborhood region (radius) of a data point. Just like in the case of k-means-clustering, scikit-learn's DBSCAN implementation uses Euclidean distance as the standard metric to calculate distances between data points.

What do you think, @Anisa-ProgHist and @hawc2?

Have a lovely weekend!

Thank you, @thomjur! I think that's helpful, and I have made this final change. @hawc2 has already provided the file directory locations, and shared your author bio in his comment above, so I think we are ready to hand this lesson over to @svmelton.

Great! I'll start working on this and let you know when it's been published.

We're live! Thanks to everyone for their great work on this. @hawc2 we're ready to add this to the twitter spreadsheet whenever you're good to go, let me know if you have questions.

Thank you all for the tremendous work you put into the publication of this paper. I am happy that it's out there!

Cheers, Thomas

@svmelton @anisa-hawes Just one question: the DOI https://doi.org/10.46430/phen0092 seems to point to the article "Text Reuse ..."; is this only a placeholder or did you assign the wrong DOI? Thanks!

Thank you, @thomjur! Apologies! @svmelton has already noted this, we are working together to resolve it today.

programminghistorian / ph-submissions

Unsupervised Learning and K-Means Clustering with Python #325

Anti-Harassment Policy