programminghistorian / ph-submissions

The repository and website hosting the peer review process for new Programming Historian lessons
http://programminghistorian.github.io/ph-submissions
137 stars 111 forks source link

Unsupervised Learning and K-Means Clustering with Python #325

Closed hawc2 closed 2 years ago

hawc2 commented 3 years ago

The Programming Historian has received the following tutorial on 'Unsupervised Learning and K-Means Clustering with Python' by @thomjur. This lesson is now under review and can be read at:

Old slug: https://programminghistorian.github.io/ph-submissions/lessons/k-means-clustering-with-scikit-learn-in-python

Final url for submission lesson: https://programminghistorian.github.io/ph-submissions/lessons/clustering-with-scikit-learn-in-python

Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.

I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. I have already read through the lesson and provided feedback, to which the author has responded.

Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.

I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me.

Our dedicated Ombudsperson is (Ian Milligan - http://programminghistorian.org/en/project-team). Please feel free to contact him at any time if you have concerns that you would like addressed by an impartial observer. Contacting the ombudsperson will have no impact on the outcome of any peer review.

Anti-Harassment Policy

This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. Thank you for helping us to create a safe space.

hawc2 commented 3 years ago

This is a solid tutorial @thomjur. It’s very useful for exploring and teaching a specific method for data analysis. There are places I identify below where clarifications and reorganizations will help guide the reader. The key revisions to make before I send this out for peer review are:

Comments organized by Section follow below. Can we aim to receive a revised draft back from you by January 31st? Please note Programming Historian editors will be taking a two-week break starting Monday, December 21st until January 4th.

I think any comments I would make about the latter half of the tutorial are mostly going to reinforce what I’ve said above. Overall the code works, the analysis makes sense, you deal with all the technical issues fluently. At times, you could afford to provide more brief statements defining terms, summarizing ideas. If you get to the code sooner in your tutorial, some of the opening sections explaining K-Means Clustering would be easier to follow as you cite the specific example in your code to explain the ideas behind the analysis.

The last two sections, “Standardizing the data” and “Summary” both seem too long without enough signposts guiding the reader. Add more subsections – for instance, you could include a separate section on “How could we proceed from here?” where you could also discuss some of the alternative approaches you mentioned earlier in the tutorial. These last sections are really the most important part of the tutorial, so guiding the reader through these steps carefully, explicating ideas and the code in more detail, is what will be most important.

thomjur commented 3 years ago

Dear Alex @hawc2,

Thanks again for your constructive comments. I agree with all of them (ok, one tiny exception, see elbow-method).

Let me start with the four major issues you mentioned (plus some general comments):

Responses to section comments:

Thanks again for your help, and I am looking forward to your comments!

Best, Thomas

thomjur commented 3 years ago

Added a revised version of the tutorial (minor changes, nothing content-related). Exchanged two images (ax labels were cut off in the former version).

hawc2 commented 3 years ago

Thanks @thomjur. We're sending this out for peer review. You'll hopefully receive feedback in the next month or two.

hluling commented 3 years ago

Alex @hawc2, thanks for the opportunity to review this tutorial. @thomjur, this is a very useful and accessible tutorial about using k-means clustering in Python. I believe someone who is not a humanities scholar and wants to learn k-means clustering will also have interests reading this tutorial. This tutorial will contribute to the goals of Programming Historian. The draft is in good shape and I have the following comments.

Major issues:

Minor issues:

melaniewalsh commented 3 years ago

Hi @thomjur, I want to begin by highlighting things that are working well in this tutorial. Your explanation for how to run k-means clustering with scikit-learn is mostly sound and clear — well-done. I also like that you use and discuss variable scaling — that's great! Lastly, I appreciate that you are not using random or artificial data but a dataset that has an actual connection to the humanities.

However, I believe that there are some significant issues with this tutorial (as it is currently written), and they mostly stem from the choice of dataset.

Big points:

Smaller points:

hawc2 commented 3 years ago

Thanks to @hluling and @melaniewalsh for your thorough reviews of this lesson. @thomjur, I'll leave it to you to address the minor revisions, as they seem straightforward. As for major revisions, @hluling's one suggestion seems easily doable. @melaniewalsh's revision advice will require some change to the lesson's structure and length, but I think you can do it without rewriting the lesson entirely.

It is ok if you find it too difficult to change the dataset for this lesson, but it is still important you explain why you chose this dataset in particular, and how K-Means clustering works and doesn't work for this dataset. More importantly, Melanie's suggestion that you use TF-IDF to enhance your analysis seems worth doing.

Along the way, you can more clearly clarify how this process can be used to answer research questions with K-means clustering of textual features. In the end, the most important thing isn't that this lesson perfectly answers its mock research question, so long as it explains how researchers could use it to answer various types of research questions with varying amounts of success.

Your lesson is currently only approximately 4000 words, so you have a lot of room to expand as necessary. I'm optimistic that some additions to the lesson to address the reviewers concerns will result in a lesson that will prove more persuasive to readers trying to decide when to use K-Means clustering over other algorithmic methods for answering their research questions.

In particular, I think the conclusion could be expanded to better address the issues reviewers raise, giving more time to explain what is useful and not useful about K-Means clustering. Throughout the lesson, you could also weave in more comparisons of K-means with other methods - this will help better draw out for the reader why this method is appropriate. Like Melanie says, it seems less necessary to discuss in detail the differences between supervised and unsupervised machine learning, and more time differentiating K-Means from other algorithms like PCA that offer comparable analyses.

Please try to make your revisions in the next month, or two at most. Let us know if you'll need more time, and please feel free to add here any further thoughts you have as you proceed with revisions.

thomjur commented 3 years ago

@hluling @melaniewalsh @hawc2

Thank you all for your constructive suggestions. I am happy to revise my tutorial. However, I am still struggling with the major issue raised by @melaniewalsh. Even though I like the idea of using TF-IDF and actual documents for K-Means clustering, this would mean changing the entire article because such an analysis (including TF-IDF) is simply not possible with the current dataset (which was trying to mock customer data analysis, a viral example to illustrate clustering algorithms).

Thus, as far as I can see, I need to change the dataset in order to fully cope with all your suggestions (since, as I said, implementing TF-IDF is simply not possible with the current dataset). I am willing to do this, but I will first need to think of a dataset that is neither too complex for a tutorial (which would be the case with thousands of documents with very different sizes from Brill's New Pauly) nor poses any legal issues (which would most likely be the case with scraping all articles from Brill's New Pauly as well; it's a protected website with paid content). Of course, it should also make sense actually to analyze the data. I will ventilate this question for a while and see if I can come up with something. I have some first ideas, but I am not sure whether they will actually work. Of course, I will then need to rewrite the tutorial (that is based on the current dataset) and change the code, which is fine. However, I am not sure if I can do this within a month or two (family, covid, and a non-existing vaccination strategy in Germany are a bad combination). However, I honestly like the idea of using documents with K-Means clustering, so I see what I can do (maybe I'll have some sort of inspiration). Ooohhhhmmmm....

I will keep you updated, and thanks again for your evaluation!

hawc2 commented 3 years ago

Thanks @thomjur for clarifying. I'd be curious to hear what @melaniewalsh or @hluling would say. Please chime in if you have any recommendations or further feedback.

For my part, I think it's possible you could work with this dataset and find another way to make your analysis and results more robust. You do a good job in this tutorial of discussing how to pick a number of clusters, but the transitions between method explication and dataset seem to quick, leaving vague why K-Means clustering is right for this particular dataset and research questions. I do wonder if there's another approach with this current dataset that might work as an implementary or subsequent step.

Your section, "How to Proceed?", in particular, opens more questions than it answers, and is one place I'd encourage you to expand the lesson. You write, "Usually, it is a good idea to implement different clustering algorithms, such as hierarchical clustering and k-means, and to compare their results. Even though we have skipped this part in this tutorial, it should be pretty easy for you to add other algorithms now that you have understood how to work with scikit-learn." To me, this is something this tutorial should go into in some detail, to show why K-Means is appropriate here. Perhaps that is an alternative to a new dataset/TF-IDF route, if that proves too difficult to rewrite into the lesson.

Regardless, considering this will require major revisions, it is fine if you need more than two months to get the final draft in shape. Keep us updated on your progress, and have a good spring!

melaniewalsh commented 3 years ago

@thomjur @hawc2 @hluling

I hope it's ok that I discussed this tutorial in some detail with my research group, but they also feel that the tutorial is solid except for the choice of dataset. I still think that incorporating TF-IDF could be useful and cool, but I think what's most important is that whatever dataset you choose has real clusters in it (e.g., in the plot, there would be space between the clusters). One of the drawbacks of K-means clustering is that it will identify clusters even if your data does not contain clusters, as discussed at the beginning of this StackOverflow answer. And, from my perspective, it's important that readers know what kind of data is appropriate to use with this technique.

I'm sorry that this suggestion requires more revision work in such a difficult year, and I can empathize with wanting to get this piece published more quickly because there's so much good material here, but that's my take on the tutorial. @thomjur if you choose to go with another dataset, I would be happy to brainstorm or discuss alternatives. I'm excited about this tutorial and think it could be really useful for the community!

thomjur commented 3 years ago

Thanks a lot for your encouraging response, @melaniewalsh!

No worries, I am happy about every kind of feedback that I can get, so I am glad that you discussed my draft in your group. And I am also excited to revise my tutorial. The more I think about it, the more I have to agree with what you suggest. My "lazy" defense, so far, was that real-world data oftentimes does not include strictly separated clusters. But even though this might be true, it does not make sense to use such data in a tutorial. Still, I think that it is, generally speaking, reasonable to use K-Means with such data, and if it's to demonstrate that your customer/author data does not include any clusters (and I agree, two to five outliers are not really a "cluster"), which could also be a valuable insight.

I will think some more about textual data that could be interesting to analyze with K-Means. Due to legal aspects, I tend towards Wikipedia; yet, Wikipedia is also somewhat overused, although maybe not to the same extent as Twitter data.

I am also struggling with this whole data/privacy thing... I am not sure if this is due to my German angst background (since Germans tend to exaggerate this whole data privacy/protection stuff, IMHO), but I see potential problems/lawsuits everywhere.

hawc2 commented 3 years ago

Glad to hear we're coming to a consensus.

@thomjur, as far as copyright/privacy goes, feel free to email me some of your thoughts. I can guide you through deciding about those copyright/privacy issues, and I'll consult with other Programming Historian editors if necessary. Wikipedia is certainly safe, and not infrequently used by PH, in part because it is more likely to ensure sustainable lessons. The ideal dataset would be one familiar to a large audience, and appropriate to produce meaningful clusters with K-means.

hluling commented 3 years ago

@thomjur @hawc2 @melaniewalsh Sorry to jump in late. I'm glad to see the consensus too. I just wanted to point out that clustering techniques are exploratory in nature. I believe this tutorial reflects a common process whose goal is to explore any possible patterns without knowing whether there are 'real' clusters or not before running the analysis. I agree with the StackOverflow post mentioned by @melaniewalsh that we should examine the required assumptions before running an analysis. However, when there are four or more variables so that it is difficult to visualize the raw data, checking all the assumptions for K-means is not feasible (maybe there are formal tests of these assumptions that I don't know). So my position is "we won't know until we try something out." The current dataset and the exploratory research question, as shown in the current version of the tutorial, do not preclude the use of K-means clustering. That being said, I do agree with @melaniewalsh @hawc2 that the ambiguous result may be problematic for publication due to a lack of coherent narrative. So I'm excited to see a revised version of the tutorial no matter which suggested route is taken.

thomjur commented 3 years ago

Just a quick update from my site: I hope to send you a revised version of my tutorial by the end of this week.

thomjur commented 3 years ago

@hawc2 @melaniewalsh @hluling

I have just uploaded a revised version of my tutorial. Since I have changed major parts of the tutorial, I will not respond to every minor point mentioned by the reviewers.

Major/Minor changes:

I am looking forward to your comments!

There seems to be problem displaying the tables. Is there a way to fix that, @hawc2? If not, I could simply delete the longer ones.

hluling commented 3 years ago

@hawc2 @melaniewalsh @thomjur I’m glad to see the added text-based dataset and a much more expanded discussion on different techniques involved in clustering. I like the added visualization in the section that explains how k-means clustering works and why standardization is needed, as well as the added parts on the silhouette scores. I also appreciate the detailed explanation about DBSCAN.

I think this version meets publication standards as a tutorial. I just have the following comments for improving clarity.

Major issues:

Minor issues:

melaniewalsh commented 3 years ago

@thomjur @hawc2

I think this revision improves the tutorial a lot. The Greco-Roman author clusters are more coherent, the use of PCA is nice and very helpful, and there seems to be a better communication of how clustering works with real data vs artificial data (e.g., "As we can see, there is no real elbow in our plot this time"). The author has clearly has put a lot of work into this revision. Well-done!

Similar to @hluling, I like the addition of text data, TF-IDF, and DBSCAN. However, it also makes the tutorial lengthier and more complicated. I hesitate to suggest this because I know it would be even more work, but it occurred to me that this version of the tutorial could usefully be split into two parts: Part 1 (K-means) and Part 2 (DBSCAN). That could be one way of capitalizing on all this great material while also paring things down for two easier, more focused reads with two different datasets and clustering methods. Would that be possible?

My comments will mostly focus on the K-means section, but the general concepts apply to the DBSCAN section as well.

Major issues

Maybe the possible research questions could come after the sentence "The original German version of the DNP has been translated into English (starting in 2002)." It might also be helpful to mention/discuss specific authors more, to ground the research questions even further.

Minor issues

thomjur commented 3 years ago

Thank you for your constructive criticism, @hluling @melaniewalsh @hawc2.

I agree with most of your comments, and I will try to implement them as soon as possible. Most of them are pretty straightforward (adding axis labels, adding a couple of sentences here and there, etc.).

However, particularly Melanie mentioned a couple of things that I would like to discuss first.

  1. I'd be happy to add some additional sentences or even tables to the analysis. Yet, I am somewhat hesitant to dive too deep into the discussion of these still mock-up use cases. This is simply not a research article, and although I find the results sort of interesting too, I am not sure if this text isn't already too long for a tutorial (depending on if people reading it are looking for a concise recipe of how to use clustering or want to know more about Greco-Roman authors).

  2. Regarding the second point mentione dby Melanie: I personally thought that it might be a good idea to explain the concepts based on a reduced and somewhat adopted dataset. This allows showing the general idea behind these methods more clearly. Also, including only two to three dimensions (features) make the visualization a lot easier. This section is meant to show "how it is supposed to be/look like" (in an ideal world with perfect data) and the actual analysis "what it usually looks like." But I am curious what @hawc2 and @hluling think about this. It would also imply another substantial rework of major parts of the tutorial.

  3. Should I delete the section on "How Does Clustering Work?" I am open to doing so; just let me know what you think (it's becoming more and more difficult for me to have an "objective" view on my tutorial)

Thanks in advance for your help!

hluling commented 3 years ago

@thomjur @melaniewalsh @hawc2

hawc2 commented 3 years ago

Hey all, it sounds like most final revisions are clear and in agreement by all.

As far as walking through some introductory info about clustering using a toy (reduced) dataset, I think that's ok to do, so long as you explain what you've described in your response here, telling the reader why you are using a smaller dataset to explain certain concepts, and pointing out to the reader how this demo analysis differs slightly from the number of features you'll explore later.

I can understand why it might not be worth it to weave this part into the broader exploration of your main dataset, even if it would make the lesson a little more elegant, but it will be important at the very least that you give signposts so readers don't get led off track or get confused about how to connect different sections of the tutorial.

Thanks everyone for your time contributing to such a generative conversation about this important tutorial

melaniewalsh commented 3 years ago

Thanks for your follow-up comments, @hawc2 and @hluling.

Also @thomjur, to address your first question about whether or not to briefly show/discuss results, I still think it would be useful even though this isn't a research article because it will help show people how to interpret their own results. And I think that's part of what distinguishes Programming Historian from any other generalized scikit-learn tutorial on the internet. For a good model of briefly discussing results/interpretation, you might look at Matt Lavin's TF-IDF tutorial.

thomjur commented 3 years ago

@hawc2 @melaniewalsh @hluling

I have now revised parts of my tutorial. Please let me know what you think. Note that some of the images do not seem to be displayed correctly (maybe it takes some time to refresh gh pages); also, the problem with the tables remains. It might make sense to wait a couple of days before you go through the tutorial again. I still have to adjust a few things. It's just that I had some spare time today and wanted to proceed with my tutorial.

Major issues

Minor Issues

thomjur commented 3 years ago

Just a short update: The images should now be displayed correctly. I have also added a couple of sentences to the DBSCAN discussion and the summary related to @hluling claim to discuss/propose potential outcomes of the clustering.

thomjur commented 3 years ago

@hluling @hawc2 @melaniewalsh

I have now revised my tutorial. In addition to the points mentioned above, I have also added some paragraphs to discuss potential further steps in the DSCAN section (see @hluling comments). Besides, I have also corrected some other mistakes; for instance, I realized that I was first talking about using n=3 clusters in the first part of the k-means clustering of the ancient author data and then suddenly changing to n=5 clusters in the actual analysis. I hope that I have now corrected everything to n=5.

Also, I was wondering about my use of "standardization" and "normalization." I think it's ok to use both terms since standardization is z-score normalization to my understanding, but maybe @hluling, as a trained statistician, could say some more about whether this use is correct, misleading, or simply wrong (I've made it through half a BA program in computer sciences by now, but I have not taken the - optional - class on statistics so far^^).

Thanks in advance for your help, and I am very much looking forward to your comments!

hluling commented 3 years ago

@thomjur Thanks for the updated version and some of the clarifications. I'm not a trained statistician, but I think the term 'standardization' better describes what you did based on para. 33. I think normalization typically means rescaling values to [0, 1].

Two more comments:

  1. In para. 100, regarding choosing the number of components in PCA, it may be useful to add a footnote saying that the elbow method also applies here.
  2. Thanks for elaborating on some future steps generated from the DBSCAN results. You pointed out that "combining the results of different clustering algorithms also helps to better discover structures and thematic clusters in or data" (para. 121). It seems that a more direct comparison between k-means (Case Study 2, Sec. 4) and DBSCAN (Case Study 2, Sec. 5 [there is a typo in section number here, btw]) would be important. Therefore, I wonder whether you have tried running a k-means with much fewer clusters. You started with n = 100 for k-means (Case Study 2, Sec. 4), but choosing such a large number of clusters may not be helpful (even though you interpreted 3 clusters successfully to some extent). Speaking of Figure 10, I wonder whether an elbow-like pattern may appear with much fewer clusters. I'm curious about the result with a much smaller n for k-means, because then we can have a more direct comparison between k-means and DBSCAN by using figures and analyzing the titles within each cluster. It seems that how noise/outliers is dealt with is the major distinction between k-means and DBSCAN. So if we have a figure from k-means, and a figure from DBSCAN, with same or similar number of clusters, then we may be able to see whether it makes sense to categorize an article as a member of a cluster (k-means) or noise (DBSCAN). In general, I'm echoing @melaniewalsh here regarding what distinguishes PH from other more generic tutorials: Interpretation and discussion of results that attract readers in humanities. I understand that this may be the most challenging part. But feel free to let us know what you find and/or what you think.
thomjur commented 3 years ago

@hluling Thanks for your comments. I've replaced "normalized/normalization" with "standardized/standardization" in the tutorial and the Jupyter notebook. Thanks as well for the recommendation to mention scree plots (that's what you meant, I guess?). I've added a sentence to paragraph 100.

Regarding your second question: First of all, I tried running k-means with only a few clusters in the beginning. However, the elbow plot shows an almost linear line in this case, and the results when looking at the articles in each cluster are not very promising. If you want, you can try this out yourself in the Jupyter notebook (which is something I recommend doing throughout the entire tutorial anyway, because obviously, my choice of clusters might not be the perfect solution - which is nothing I claim anyway).

The clusters are too big and include too many diverse articles with such a low number of clusters. I am also mentioning this in my article when discussing the elbow plot: "But is it likely that a journal such as Religion that covers a vast spectrum of phenomena (which are all, of course, related to religion) only comprises a few thematic clusters? Probably not." (para. 103) I think that the "nature" of the data demands a more fine-grained solution because the articles are indeed diverse. This is why I run k-means with n=100, and I really think that the results are astonishingly good. I checked again with several different randomly selected clusters in my notebook, and I am honestly surprised how well this works (I am only showing three exemplary clusters in the text, since I think that this is enough to demonstrate the basic idea and encourage the reader to explore the data him- or herself). In addition, the rather small number of articles in each cluster allows checking the coherency of the clustering results more easily (again, there are obviously clusters that don't make much sense).

To be honest, I was not really sold when @melaniewalsh initially told me to use k-means with textual data because I thought that this is not the best way to analyze textual data. Yet, I am really surprised how well this worked. Is it perfect? Probably not. Could I sell this solution to a major company as a professional recommendation system? For sure not. But it delivers decent/coherent results and insights into the data, and it hopefully encourages the reader to try it out with his/her own (textual) data.

Now, when it comes to DBSCAN, I think we both agree (and I also say this in the tutorial several times) that the clustering, in this case, isn't really convincing. This is, among others, also because at least some of the clusters are too big. Yet, it is difficult to play around with the eps value until I have a similar outcome as the k-means algorithm because this is not how one would use DBSCAN. On the other hand, some of the few DBSCAN clusters actually show interesting results, which is why I recommend combining these results with the results provided by k-means, may it be as part of a recommender system or as an insight into the data as part of the exploratory data analysis. I mean, we are not restricted to only using one algorithm during the exploratory data analysis. We can apply as many as we like, look at the results (of which some are more and some are less indicative), and just combine the insights we gained from each in our final evaluation (in the way ensemble learning does for classification tasks by using different models and then combining their results).

melaniewalsh commented 3 years ago

I just read through the revised version of the tutorial, and I have some thoughts about the DBSCAN conversation. I also found some typos and small issues while re-reading, which I've noted below.

Major While I agree with @hluling that a better illustration of how DBSCAN deals with noise/outliers would be nice, I think I understand what you're saying @thomjur and that you're more interested in demonstrating that DBSCAN produces some interesting clusters, too, which we might think of as additional insights in our exploration.

Either way, upon re-reading, I noticed that it's hard to read the Religion article titles for each DBSCAN cluster, and that it's also hard to read the sample abstract when the Religion data is first introduced in paragraph 11 (the abstract simply says "In contemporary..." before it gets cut off).

In the DBSCAN section at the end, would it be possible to make the clusters into regular tables instead of code outputs so that they're easier to read? Something like...

df_abstracts_labeled[df_abstracts_labeled["cluster"] == 75]["title"] Religion Article Title Cluster
Checking the heavenly ‘bank account of karma’: cognitive metaphors for karma in Western perception and early Theravāda Buddhism 75
Karma accounts: supplementary thoughts on Theravāda, Madhyamaka, theosophy, and Protestant Buddhism 75
Resonant paradigms in the study of religions and the emergence of Theravāda Buddhism 75

Minor

thomjur commented 3 years ago

@melaniewalsh Thanks a lot for your constructive comments! I've changed the code output to markdown tables in both the k-means and DBSCAN sections, and indeed, it looks much cleaner and is more readable. Great, thanks a lot! However, I left the abbreviated "In contemporary (...)" in the introductory part since this table only illustrates the structure of the table (otherwise, the table would be way too big/long when displaying the entire abstract). Yet, I am happy to change this if you would like me to. This also depends on the general question of how the tables will be displayed in the final tutorial. For instance, some of them currently include too many columns that are not shown when I open the draft web version in my browser. I am not sure whether there will be automatic breaks or something like this in the final template. Maybe @hawc2 knows more.

I've also corrected all the minor issues, except the quote by Patel . I personally like it, and I find it very helpful (plus, it makes my tutorial more "professional" when quoting someone who actually is an expert in this field), but I am also happy to delete it. I've also deleted the awkward film reference, which was, on top, wrong. -.-

Thanks again for your help, and have a great weekend!

hawc2 commented 3 years ago

@thomjur @hluling @melaniewalsh thank you all for this incredibly generative peer review process! It's amazing how much this tutorial has grown in complexity and detail over the last couple months. We have a few loose ends to tie up, but the big problems sound resolved. I'm recommending we now close the review and move on to the next stage of the publication process.

@thomjur I'll follow up with you about next steps. If you have any further chagnes to make, please go ahead and do so in the coming weeks.

@hluling and @melaniewalsh thank you so much for your investment in reviewing and improving this tutorial. It is a very difficult subject, especially to make user-friendly, and your expertise and attention to detail has really ensured it will be a popular Programming Historian lesson!

thomjur commented 3 years ago

I the author Thomas Jurczyk hereby grant a non-exclusive license to ProgHist Ltd to allow The Programming Historian English to publish the tutorial in this ticket (including abstract, tables, figures, data, and supplemental material) under a CC-BY license.

hawc2 commented 3 years ago

@svmelton, this lesson is ready for next stages of the publication process. I’m providing the location of all relevant files, let me know if you need anything else.

lesson file - /lessons/clustering-with-scikit-learn-in-python.md
images folder - /images/clustering-with-scikit-learn-in-python/
assets folder - /assets/clustering-with-scikit-learn-in-python/
gallery image - /gallery/clustering-with-scikit-learn-in-python.png
gallery original - /gallery/originals/clustering-with-scikit-learn-in-python.png

The following author bio is for adding to the /_date/authors.yml file on the other repository:

- name: Thomas Jurczyk 
  team: false
  orcid: 0000-0002-5943-2305
  bio:
      en: |
            Thomas Jurczyk is a German historian and religious studies' scholar based in Bochum (GER)
svmelton commented 3 years ago

Thank you @hawc2! This will be the first lesson that our new publishing assistant will work on for the copyediting! She's just started and I'll be meeting with her next week, and I'll let you know once the copyediting is complete and we're ready to publish.

svmelton commented 3 years ago

Hi @Anisa-ProgHist! This is the lesson that's now ready for copyediting. Please let me know if you have any questions.

anisa-hawes commented 3 years ago

Thank you, @svmelton! I will take a look and be in touch if questions arise. Looking forward to working on this!

anisa-hawes commented 3 years ago

Hello @thomjur and @hawc2 ,

Thank you for your patience. My copyedits of this lesson are now ready for review.

I have applied my suggested revisions directly to the markdown file in our Submissions Repository, where you be able to view the additions and subtractions I’ve made in the Commit History

In addition, I will paste a list of my comments and suggested revisions here, so that the copyediting process is transparent and can invite discussion.

I have formatted my comments/suggested revisions as a task list. You will notice that many of these tasks are ‘checked’, because I have made the changes directly. A small number remain ‘unchecked’ because they ask questions.

I hope my comments will prove clear and useful. I’m happy to talk through anything which isn’t, and will not be offended if you choose not to absorb my suggestions.

--

A few general thoughts first, if I may.

  1. In Paragraph 37, you provide a link to the source of your dataset for the First Case Study. I note that access to this data is not free, and I would like to ask if you think it is important to clarify that?
  2. In Paragraph 82, you establish the lesson's prerequisites of knowledge and understanding. I think it would also be useful to establish the technological dependencies and operating system requirements, and I would like to suggest that this is established as early as possible in the text. As a clustering novice (I realise that this lesson is defined as 'difficult') I am not sure whether the K-means and DBSCAN algorithms are suitable for use across Mac, Windows and Linux. I think it is important that the lesson makes clear which technical environments has the lesson been developed and tested in, and which technical environments it can be applied in.

--

It isn't clear until you read on, what the abbreviation DNP represents. Perhaps simpler to introduce the abbreviation after the first use of the original German title. Also, if you restructure this sentence, you can avoid the awkwardness of two apostrophes in Brill's New Pauly’s. Suggest: revision of sentence,

The data was taken from the official Brill's New Pauly website, and originates from Supplement I Volume 2: Dictionary of Greek and Latin Authors and Texts. Der Neue Pauly: Realenzyklopädie der Antike (1996–2002) is a well-known encyclopedia of the ancient world with contributions from established international scholars. The original German version has been translated into English since 2002. I will refer to the text using its German abbreviation (DNP) from here onwards.

Suggested revision: ‘The inertia decreases with the number of clusters. The extreme is that inertia will be zero when n is equal to the number of data points’.

How about:

‘In short, DBSCAN enables us to plot the distance between each data point in a dataset and identify its nearest neighbor. It is then possible to sort by distance in ascending order. Finally, we can look for the point in the plot which initiates the steepest ascent and make a visual evaluation of the eps value, similar to the ‘elbow’ evaluation method described above in the case of k-means)’.

Replacing the word ‘since’ with ‘as’ and ‘because’

Deleting the word ‘since’ and adding ‘so’

Deleting the word ‘since’ and adding comma + as

There are a lot of adverbs in this sentence. The combination of mainly, relatively and particularly make what is being said here quite vague. Can you be more specific?

relatively unknown authors = authors whose work is produced in few modern editions?

How about:

we are predominately dealing with authors whose work is produced in few modern editions, particularly compared to the authors in cluster 4.

How about: As an optional step, I have implemented a function called lemmatizeAbstracts() that groups, or ‘lemmatizes’ the abstracts

Q: features = words? Q: what is meant by ‘lemmatized version of the abstracts’? This is unclear.

How about:

‘These clusters are plotted in figure 12 (using a PCA-reduced dataset), where the inconclusive results become even more visible. In this case, we could consider using the original TF-IDF matrix with cosine distance instead.

e.g.: https://www.chicagomanualofstyle.org/tools_citationguide/citation-guide-1.html Smith, Zadie. Swing Time. New York: Penguin Press, 2016.

Géron, Aurélien. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. Concepts, tools, and techniques to build intelligent systems, 2nd ed. Sebastopol: O’Reilly, 2019.

Mitchell, Ryan. Web scraping with Python. Collecting more data from the modern web, 1st ed. Sebastopol: O’Reilly, 2018.

Patel, Ankur A. Hands-on unsupervised learning using Python: How to build applied machine learning solutions from unlabeled data, 1st ed. Sebastopol: O’Reilly, 2019.

anisa-hawes commented 3 years ago

Hello @thomjur.

A couple of additional notes re: links. At the moment, you define a number of terms/link to a number of sources more than once in this lesson. Each technical term or resource only needs to be defined/linked to upon its first mention.

Let me know what you think.

thomjur commented 3 years ago

Hi @Anisa-ProgHist ,

Thank you so much for all the work you put into copyediting this article! This looks great, and I agree with (almost) all your revisions.

First of all, let me respond to the first two issues you raised:

  1. Indeed, it should be mentioned that the data is not freely accessible (actually, I thought I had already done so in one of my footnotes... but I have not...). Maybe we can add a sentence to the first paragraph in the DNP introduction right after "... is a well-known encyclopedia of the ancient world with contributions from established international scholars." Something like: "It should be noted that access to the texts (and thus the data) in the New Pauly is not free of charge. I used my university's access to obtain the data from the author entries. For the following analyses, I have not copied any texts from the New Pauly to the dataset. The numerical data in the dataset was extracted and partially accumulated from the author entries in the New Pauly." Please feel free to change this sentence or add some additional information. Also, I agree with your suggestion to add "I will refer to the text using its German abbreviation (DNP) from here onwards." to the paragraph.
  2. Regarding the second point, I think I understand what you mean. However, I assume that people reading this tutorial already know that Python is a programming language that can be used (interpreted) on different operating systems (Linux, Windows, Mac OS, etc.). Since I am using Python (and scikit-learn is a ML framework for Python), I am not sure whether this information is necessary. But I am happy to add this information if you think it's necessary. Maybe it's helpful to ask the @hawc2 what he thinks about this?

Concerning the "minor" issues you mentioned:

  1. I would like to leave the language-related decisions to you. I am (obviously) not a native speaker, so I think it would be better if you decide whether it's "Q:' encyclopedia' or 'encyclopaedia'?" or" 'scrapable' or 'scrape-able'?" I am fine with whatever version you choose.
  2. Para. 62: Yes, this is awkward. It indeed means that "Reviews and miscellanea have been left out of the dataset." Maybe replace with: "I have also excluded other contribution types such as reviews or miscellanea from the dataset."?
  3. Para. 99: I agree with your suggestion. Also, we can add the link to the page you found (I took the term centroid from the book mentioned in the bibliography, but this does not really help).
  4. Para. 162: Good question. I changed this back and forth when writing the tutorial. I think it might make sense to either use italics or monotype fonts (?) such as in the code highlighting. Furthermore, I agree with splitting the sentence into two parts. I am happy with your suggestion.
  5. Para. 175: You are right; I think we should delete the brackets.
  6. Line 224: Should pandas be italicized. This is also something I was not entirely sure about. However, if you think it's better to italicize pandas here, we also have to italicize the other modules/packages such as scikit-learn, requests, etc., that appear in the text.
  7. Para. 277. Q: Is Euclidean distance the same as ε-distance (eps) (Line 211)? Yes and no. According to the scikit-learn documentation (sklearn.cluster.DBSCAN — scikit-learn 0.24.2 documentation), they are using Euclidean distance as the default metric for the ε-distance. However, the DBSCAN "ε-distance" can also apply a different (non-Euclidean) metric (and so does k-Means, by the way). If I remember correctly from my Math classes in university, epsilon is often used when talking about distances/neighborhoods regardless of the actual metric in use (for instance, in calculus; we used to call this "epsilon Umgebung" in German). But this leads too far, I guess, and I am not pretending to be an expert here.
  8. Concerning the "feature" problem: I am briefly (and implicitly) introducing features when describing the datasets right in the beginning of the tutorial. But we could also simply refer to this Wikipedia entry when the term "feature" is first mentioned: Feature (machine learning) - Wikipedia. What do you think?
  9. Para. 558: Maybe we can change the sentence to "In the following analysis, we will continue working with the abstracts that have been lemmatized with the help of the lemmatizeAbstracts() function."?
  10. Concerning the links: Of course, we can only link each term when it is first mentioned.

I have not changed/accepted anything yet, since I feel that there are still a few things that we need to discuss/agree on first. Afterward, I am happy to implement the changes myself if you would like me to. Just let me know whatever works best for you.

Again, thank you very much for your thorough revision of my tutorial! Great work!

All the best from Bochum, Thomas

anisa-hawes commented 3 years ago

Thank you for your swift and thoughtful replies, @thomjur.

@hawc2 and I are meeting next week, so perhaps he and I can talk through the points above, and come back to you with our thoughts on these remaining questions.

I appreciate your time on this, as well as your willingness to discuss the suggested adjustments.

Very best for now, Anisa

hawc2 commented 3 years ago

@thomjur it would be helpful to add a few more sentences describing how to get the code up and running in a Jupyter notebook.

Where you say, "You can download both datasets as well as a Jupyter notebook containing the code we are writing in this tutorial from this GitHub repository," you could tell the reader they can learn more about getting started by looking at this PH tutorial: https://programminghistorian.org/en/lessons/jupyter-notebooks.

Maybe also say something like "This lesson will work on any operating system, as long as you follow these instructions to set up an environment with Anaconda or Google Colab to run the Jupyter notebook locally or in the cloud."

anisa-hawes commented 3 years ago

Thank you, @hawc2. Hello there, @thomjur. Thank you for your patience.

A slight adjustment to your suggestion:

It should be noted that access to texts (and thus data) in the DNP is not free of charge. I was able to obtain the data discussed in this Case Study via my university’s institutional access provision. However, I have not copied any texts from the DNP to build this dataset. Numerical data used in the following analyses was extracted and partially accumulated from the author entries.

If you are able to deal with the two general points 1 and 2 above, then I am happy to make the other agreed changes 1-10 below. Let me know if this sounds good to you.

Very best, Anisa

thomjur commented 3 years ago

@hawc2 @Anisa-ProgHist

Thanks for the suggestions! I've uploaded a new version including few more changes. I left those issues that you both need to decide.

This lesson will work on any operating system, as long as you follow these instructions to set up an environment with Anaconda or Google Colab to run the Jupyter notebook locally or in the cloud. If you do not know how to set up a Jupyter notebook locally, this excellent PH tutorial might help you get started.

It should be noted that access to the texts (and thus the data) in the New Pauly is not free of charge. I used my university's access to obtain the data from the author entries. For the following analyses, I have not copied any texts from the New Pauly to the dataset. However, the numerical data in the dataset was extracted and partially accumulated from the author entries in the New Pauly. The original German version has been translated into English since 2002. I will refer to the text using its German abbreviation (DNP) from here onwards.

anisa-hawes commented 3 years ago

Okay! Thank you, @thomjur!

I'll take another look.

Very best, Anisa

thomjur commented 3 years ago

@Anisa-ProgHist Thanks! I think I'll leave it to you for the moment. I've adjusted the parts you asked for, I guess. however, you might still need to replace the part about the accessibility of the data with your revised version. But I'll stop working on the text for the moment, does not make much sense if we both work on the text at the same time. Thank you for all your help and let me know if there's anything else I can do!

anisa-hawes commented 3 years ago

Thank you for your work on this, @thomjur!

I have made some final small adjustments to the lesson file, which you will soon be able to review via the Commit History. The changes I've made include:

Almost there!

N.B.

thomjur commented 3 years ago

Thank you for this, @Anisa-ProgHist! This looks great!

Regarding the remaining issue: I am not sure if we should make things too complex, but it might be helpful to, at least, mention the issue you raised by adding a short statement. I was thinking of something like:

The first step consists of defining an ε-distance (eps) that defines the neighborhood region (radius) of a data point. Just like in the case of k-means-clustering, scikit-learn's DBSCAN implementation uses Euclidean distance as the standard metric to calculate distances between data points.

What do you think, @Anisa-ProgHist and @hawc2?

Have a lovely weekend!

anisa-hawes commented 3 years ago

Thank you, @thomjur! I think that's helpful, and I have made this final change. @hawc2 has already provided the file directory locations, and shared your author bio in his comment above, so I think we are ready to hand this lesson over to @svmelton.

svmelton commented 3 years ago

Great! I'll start working on this and let you know when it's been published.

svmelton commented 2 years ago

We're live! Thanks to everyone for their great work on this. @hawc2 we're ready to add this to the twitter spreadsheet whenever you're good to go, let me know if you have questions.

thomjur commented 2 years ago

Thank you all for the tremendous work you put into the publication of this paper. I am happy that it's out there!

Cheers, Thomas

@svmelton @anisa-hawes Just one question: the DOI https://doi.org/10.46430/phen0092 seems to point to the article "Text Reuse ..."; is this only a placeholder or did you assign the wrong DOI? Thanks!

anisa-hawes commented 2 years ago

Thank you, @thomjur! Apologies! @svmelton has already noted this, we are working together to resolve it today.