Lesson proposal: Clustering and Visualising Documents using Word Embeddings (PH/JISC/TNA)

tiagosousagarcia commented 2 years ago

The Programming Historian has received the following proposal for a lesson on 'Clustering and Visualising Documents using Word Embeddings' by @jreades and @jenniewilliams. The proposed learning outcomes of the lesson are:

The ability to generate word embeddings from a large corpus.
The ability to use dimensionality reduction and clustering techniques for visualisation and analysis purposes.
The ability to use these steps to find and explore groups of similar documents within a large data set.

In order to promote speedy publication of this important topic, we have agreed to a submission date of no later than April 2022. The author(s) agree to contact the editor in advance if they need to revise the deadline.

If the lesson is not submitted by April 2022, the editor will attempt to contact the author(s). If they do not receive an update, this ticket will be closed. The ticket can be reopened at a future date at the request of the author(s).

The main editorial contact for this lesson is @tiagosousagarcia.

Our dedicated Ombudsperson is (Ian Milligan - http://programminghistorian.org/en/project-team). Please feel free to contact him at any time if you have concerns that you would like addressed by an impartial observer. Contacting the ombudsperson will have no impact on the outcome of any peer review.

svmelton commented 2 years ago

@hawc2 has offered to edit this piece.

hawc2 commented 2 years ago

Hi @jreades and @jenniewilliams, I look forward to reading your submission. Please let me know if you have any questions in the meantime. Feel free to email me or post questions on this ticket.

jreades commented 2 years ago

Hi sorry -- between strikes, childcare, and general... aaaaaaaargh... I'm behind where I'd hoped to be with this! I do have a perfectly serviceable draft of the core explanatory part (what are Word Embeddings, etc.) and I have separate code that I've already used in other analysis of the same data so I know the path to completion...

However, is it helpful for me to share this early draft, or would you prefer to see only a full submission? Having open review creates more opportunities to shape the work as it develops rather than afterwards, but it could also be confusing/unhelpful. Let me know!

If helpful then I can: 1) share access to the GitHub repo where I'm writing the draft (so that we don't pollute this 'timeline'); 2) attach a draft to this thread; 3) submit a draft but recognise that it will need be versioned later (a pull request or similar).

Best,

Jon

hawc2 commented 2 years ago

If you'd prefer me to look at an early draft and give you feedback over email, I can, but otherwise I'd say let's go ahead and get a rough draft uploaded as a markdown file to the PH-Submissions repo with previews working, I can give you a first round of feedback in this ticket thread, and then you can finalize for submission to peer review. Does that sound good?

On Wed, 9 Mar 2022 at 09:59, Jon Reades @.***> wrote:

Hi sorry -- between strikes, childcare, and general... aaaaaaaargh... I'm behind where I'd hoped to be with this! I do have a perfectly serviceable draft of the core explanatory part (what are Word Embeddings, etc.) and I have separate code that I've already used in other analysis of the same data so I know the path to completion...

However, is it helpful for me to share this early draft, or would you prefer to see only a full submission? Having open review creates more opportunities to shape the work as it develops rather than afterwards, but it could also be confusing/unhelpful. Let me know!

If helpful then I can: 1) share access to the GitHub repo where I'm writing the draft (so that we don't pollute this 'timeline'); 2) attach a draft to this thread; 3) submit a draft but recognise that it will need be versioned later (a pull request or similar).

Best,

Jon

— Reply to this email directly, view it on GitHub https://github.com/programminghistorian/ph-submissions/issues/415#issuecomment-1063008848, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADXF4EG3TD3QOMBWREOJPDLU7C4EDANCNFSM5HGW6GQA . You are receiving this because you were assigned.Message ID: @.***>

--

Alex Wermer-Colan, PhD

Digital Scholarship Coordinator

Temple University, Scholars Studio

jreades commented 2 years ago

It’s ok, I’ll have a go at finishing a proper draft before sending anything over — think I was feeling guilty that I’d been radio-silent for so long and wanted to have a “See, I have done work on this!” moment. ;-)

Jon

-- mob: 07976987392 email: @. skype: jreades On 9 Mar 2022, 16:55 +0000, Alex Wermer-Colan @.>, wrote:

If you'd prefer me to look at an early draft and give you feedback over email, I can, but otherwise I'd say let's go ahead and get a rough draft uploaded as a markdown file to the PH-Submissions repo with previews working, I can give you a first round of feedback in this ticket thread, and then you can finalize for submission to peer review. Does that sound good?

On Wed, 9 Mar 2022 at 09:59, Jon Reades @.***> wrote:

Hi sorry -- between strikes, childcare, and general... aaaaaaaargh... I'm behind where I'd hoped to be with this! I do have a perfectly serviceable draft of the core explanatory part (what are Word Embeddings, etc.) and I have separate code that I've already used in other analysis of the same data so I know the path to completion...

However, is it helpful for me to share this early draft, or would you prefer to see only a full submission? Having open review creates more opportunities to shape the work as it develops rather than afterwards, but it could also be confusing/unhelpful. Let me know!

If helpful then I can: 1) share access to the GitHub repo where I'm writing the draft (so that we don't pollute this 'timeline'); 2) attach a draft to this thread; 3) submit a draft but recognise that it will need be versioned later (a pull request or similar).

Best,

Jon

— Reply to this email directly, view it on GitHub https://github.com/programminghistorian/ph-submissions/issues/415#issuecomment-1063008848, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADXF4EG3TD3QOMBWREOJPDLU7C4EDANCNFSM5HGW6GQA . You are receiving this because you were assigned.Message ID: @.***>

--

Alex Wermer-Colan, PhD

Digital Scholarship Coordinator

Temple University, Scholars Studio — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you were mentioned.Message ID: @.***>

jreades commented 2 years ago

I assume the first draft is submitted as an attachment to this issue... so here goes!

The article is a README from our private repo (will make a public version prior to publication): README.md

Images are here:

UMAP_Output DDC_Plot Dendogram-euclidean-100 DDC_Cloud-c4-ddcBiology-tfidf DDC_Cloud-c4-ddcEconomics-tfidf DDC_Cloud-c4-ddcPhysics-tfidf DDC_Cloud-c4-ddcSocial sciences-tfidf Word_Cloud-c15-tfidf

hawc2 commented 2 years ago

@jreades, I'll try to get the lesson set up, and I'll email you with more specific questions/issues with the files. More soon!

tiagosousagarcia commented 2 years ago

@hawc2 -- I can setup the lesson later today, if you haven't had a chance

jreades commented 2 years ago

This was my bad — in discussing the submission with Alex it become obvious to me that I’d deviated a long way from a format that worked as a standalone tutorial. I’ve just this morning sent a substantially rewritten version that I hope will work a lot better: you can copy+paste the code ‘as is’ from the Markdown document to create a new notebook, but I can also supply a standalone notebook that is ready to run as well. We’ve discussed whether or not to split the tutorial at the point where there is a shift from word embeddings to dimensionality reduction, but until Alex has had a chance to have a look it’s TBD.

Apologies again for making this such a protracted, difficult process — as I said to Alex: I’d not realised the extent to which my approach to writing has been deeply reshaped by the academic article format.

Jon

On 19 Apr 2022, 07:39 +0100, tiagosousagarcia @.***>, wrote:

@hawc2 -- I can setup the lesson later today, if you haven't had a chance — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

tiagosousagarcia commented 2 years ago

no worries @jreades, it's all part of the process. I haven't seen any development here, that's why I was asking if there was something I could do -- but if @hawc2 has the matter in hands, then we are all good (though the offer to help if needed still stands)

hawc2 commented 2 years ago

It's looking good now, here is a preview link to the lesson: https://programminghistorian.github.io/ph-submissions/en/drafts/originals/clustering-visualizing-word-embeddings

@jreades can you let me know if you see anything basic in the markdown rendering that might be incorrect?

I'll follow up with some preliminary feedback on the lesson itself in the coming week. Once you finish that round of edits, I'll work on sending this out for peer review.

Thanks also @jreades for putting together now a Github repo that will be linked in the lesson. The repo will include the Python code in a Jupyter notebook runnable in Google Colab for testing purposes

jreades commented 2 years ago

Definitely some rendering issues (around some of the maths especially) and a few typos that I’ve just spotted now (naturally).

If you can give me editing access I’ll get this tidied up today.

Jon

-- mob: 07976987392 email: @. skype: jreades On 21 Apr 2022, 03:34 +0100, Alex Wermer-Colan @.>, wrote:

It's looking good now, here is a preview link to the lesson: https://programminghistorian.github.io/ph-submissions/en/drafts/originals/clustering-visualizing-word-embeddings @jreades can you let me know if you see anything basic in the markdown rendering that might be incorrect? I'll follow up with some preliminary feedback on the lesson itself in the coming week. Once you finish that round of edits, I'll work on sending this out for peer review. Thanks also @jreades for putting together now a Github repo that will be linked in the lesson. The repo will include the Python code in a Jupyter notebook runnable in Google Colab for testing purposes — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

hawc2 commented 2 years ago

I'm giving you write access now. Can you also post the Github repo and Colab notebook here for reference?

jreades commented 2 years ago

I’ve updated the tutorial Markdown file with links across to the public GitHub repo and Colab. Fixed the minor typos and one substantive content area that I wanted to correct. Have committed this back to:

https://github.com/programminghistorian/ph-submissions/blob/gh-pages/en/drafts/originals/clustering-visualizing-word-embeddings.md

Jon

-- mob: 07976987392 email: @. skype: jreades On 21 Apr 2022, 14:19 +0100, Alex Wermer-Colan @.>, wrote:

I'm giving you write access now. Can you also post the Github repo and Colab notebook here for reference? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

hawc2 commented 2 years ago

This is looking like a very solid first draft. My main feedback is pretty general, so I’ll hold off from giving you specific line edits, and just ask for some broad revisions before we send out for review.

My main observation is that this is quite a difficult lesson, and more work will be required to translate terminology for beginner audiences, signposting where the lesson is going, and onboarding the reader to each phase of the methodology. It will be helpful for you to do some basic revisions in this direction before I send it out for reviewers, so they don’t need worry as much about how this lesson caters to its audience.

My only other concern is that this lesson is very long. Lessons usually don’t go over 8,000 words. I’d rather not see it bulge into a two part lesson, although that is a possible solution. For now, I’d encourage you to focus on the difficult task of editing this draft for both clarity and length, making it ideally more concise and more concrete at the same time.

As an example of clarifying your language for introductory steps, in your first Learning Outcome, you say: “we use a selection of nearly 50,000 records relating to U.K. PhD completions.” Right off the bat, you should use language that more clearly identifies what kind of data your tutorial works with. What kind of records are these? As an American, I’m not sure what “records relating to U.K. PhD completions” would look like, nor why someone would do word embedding analysis on this type of data. I would’ve expected “a corpus of doctoral dissertations” as the main dataset. In this vein, on Paragraph 9, where you introduce this dataset in more detail, it’s still not clear yet what “textual data” you will be analyzing within the “metadata” about dissertations. I have to admit that the section on the Case Study gets so technical and detailed about the metadata that I lost the main thread: What is the text you are going to model?

The part where you explain word embeddings and compare them to other text mining algorithms also requires more revision. In the learning outcomes, the tutorial jumps right into ‘dimensionality reduction’ and ‘hierarchal clustering,’ but maybe a preliminary learning outcome should be something about teaching the reader why these methods are appropriate next steps once you’ve created a word embedding model, in order to pursue a research question about the dataset. Putting it in these less technical terms will help readers understand how the algorithmic processes relate to broader scholarly work.

The subsequent paragraphs do a good job of distinguishing PCA, LDA, and TF-IDF from WEs, but they do assume that the reader knows something about what all these have in common. In these opening paragraphs, try to find more ways to spell this out, in terms of approaches like predictive modeling and latent meaning. For example, this sentence clause doesn’t really clarify what TF-IDF is, so its comparison with WEs remains a bit vague: “ The benefit of simple frequency-based analyses such as TF/IDF is that they are readily intelligible and fairly easy to calculate . . .” What seems essential to highlight here is the type of meaning WEs offer us insight into about the text the other approaches overlook. There’s some explanation in the Word Embedding section (beginning Paragraph 39) that helpfully explains why dimensionality reduction is necessary; a brief version of this could be included early on in the tutorial to explain why that tutorial leads the reader through this specific series of steps. Similarly, under Prerequisites, you explain how this lesson differs from the Scikit Learn Clustering lesson, but you don’t really explain first what the two lessons have in common. Alot of these comparison examples are useful for clarifying what your lesson on word embeddings does, but ideally they’d all occur in one section, and focus mostly on clarifying what word embedding analysis can show about the text.

In this context, the Word Embedding section, in particular paragraphs 40-44, jumps very quickly from the mathematical to the semantic. Could you spend more time here explaining the analogical nature of word embedding models and vector relationships?

The Sample Output section similarly jumps right into the weeds. Could you have a little more introductory info here about the outputs, and how this is a useful sample for elucidating some key points?

A couple of your Tables take up a lot of real estate. Could they be condensed? Table 3 for example, could just be one row? Table 5 is also long.

Paragraph 57 - I agree this is a good break point. You can remove the signpost you put here for review. I think the next section on Words to Documents is very useful, and could be contextualized a bit in terms of Word2Vec and Doc2Vec or how this method differs from those. I see why it’s useful to get into Manifold Learning, but if TSNE-UMAP is the main point, you should get to that sooner. I kinda got lost in this section. Generally I think you go into too much behind the scenes background detail about alternative options, and not enough info on the specific thing you are teaching. Try to offload some of the secondary comparisons with other methods to footnotes.

The Visualization section seems like a good place to conclude. Right now that Figure isn’t rendering in Markdown. But Visualizing and Clustering the data ideally could’ve been foreshadowed earlier in the lesson. The current version of these sections can be condensed to focus on concluding the lesson with some first steps in these visualization directions. What about this section is really essential to this lesson? Is all the validation and related steps necessary, or could that be included as supplemental material on your Github repo and Colab notebook for more advanced users? There’s a bunch, like the Confusion Matrix, that just seems so dense and complicated, that you’d have to do a lot more work to justify its inclusion for the proof-of-concept word embedding methodology. Since that would take up more space, so I’m inclined to think a bunch of it can be removed?

If you can try to take a shot at edits along these lines in the next couple weeks, after one round of revision, it’ll be ready to send out for review. Let me know if you have any questions

jreades commented 2 years ago

I've nearly finished -- I just need to review the final bits of analysis in light of the edits above, but have been able to prune the tutorial down to about 9,800 words. I've fixed issues with maths rendering (GItHub doesn't actually do this directly for Markdown) and tried to generally tidy up.

jreades commented 2 years ago

Done. I've gone the whole way through and yanked as much as I think we can while preserving the overall intention of the submission. I'm sure there's more that could be done but I'm not able to see it at this point. The commit is in. The only thing I wasn't sure about is the images: I can see that they eventually go into an images/<tutorial_name>/ folder but figured you'd want to do this yourselves.

Let me know if you need anything else or have any further comments/ideas before sending out for review. As you can see, your initial comments prompted a major rethink and I hope you'll think we've done a good job acting on them.

jreades commented 2 years ago

Quick note to self: clarify that Euclidean distance works well with UMAP in this case because the abstracts don't vary enormously in length; this means that the magnitude of the averaged document vector isn't an issue. Cosine would probably be a better choice where there was significant variation in the length of the documents.

jreades commented 2 years ago

Quick note to self: clarify that Euclidean distance works well with UMAP in this case because the abstracts don't vary enormously in length; this means that the magnitude of the averaged document vector isn't an issue. Cosine would probably be a better choice where there was significant variation in the length of the documents.

I've now fixed this. This is ready for a review... I hope!

hawc2 commented 2 years ago

@jreades regarding the images, let's make sure those are all rendering correctly. I put them all in the directory: https://github.com/programminghistorian/ph-submissions/tree/gh-pages/images/clustering-visualizing-word-embeddings

Can you make sure your markdown file has each embedded in the appropriate place with alt-text? You can see information on naming the image files and inserting them into the markdown here: https://programminghistorian.org/en/author-guidelines

Once the lesson is rendering correctly, I'll do one last skim and send it out for peer review. Thanks so much for your thorough edits

jreades commented 2 years ago

Done. <fingers crossed I’ve done it right>

-- mob: 07976987392 email: @. skype: jreades On 6 May 2022, 14:54 +0100, programminghistorian/ph-submissions @.>, wrote:

https://github.com/programminghistorian/ph-submissions/tree/gh-pages/images/clustering-visualizing-word-embeddings

hawc2 commented 2 years ago

So images don't need the whole directory link, just the name of the image file. You should be able to look at this preview to know when everything looks right: https://programminghistorian.github.io/ph-submissions/en/drafts/originals/clustering-visualizing-word-embeddings

I edited the final image in your lesson to show you what it should look like. The last image now renders correctly. I'd rather you finalize it since you know how it should look. Once you think it looks like there, I'll send this out for review. I don't have any other immediate feedback, but after we get reviewer feedback, I'll synthesize their feedback and add any remaining thoughts I have for further revision.

jreades commented 2 years ago

I’m definitely missing something here: I’ve removed the full path and followed the convention used in the other tutorials that I peeked at (image name only, no other path info) but I still can’t get the images to display even though they appear to me to be in the right place for the includes to work. I don’t know if the GitHub pages are only rebuilt intermittently or, more likely, if I’m still mucking up something in the placement of the images/code… but I’m stuck.

I’m sure the images are ‘fine' in the sense that if you can get them working we can sort out any issues that they might present during the review stage. I’m not too worried about minor look-and-feel issues since the reviewers will presumably also comment on these if they noticed anything wrong.

Jon

-- mob: 07976987392 email: @. skype: jreades On 6 May 2022, 15:34 +0100, Alex Wermer-Colan @.>, wrote:

So images don't need the whole directory link, just the name of the image file. You should be able to look at this preview to know when everything looks right: https://programminghistorian.github.io/ph-submissions/en/drafts/originals/clustering-visualizing-word-embeddings I'd rather you finalize it since you know how it should look. Once you think it looks like there, I'll send this out for review. I don't have any other immediate feedback, but after we get reviewer feedback, I'll synthesize their feedback and add any remaining thoughts I have for further revision. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

tiagosousagarcia commented 2 years ago

I’m definitely missing something here: I’ve removed the full path and followed the convention used in the other tutorials that I peeked at (image name only, no other path info) but I still can’t get the images to display even though they appear to me to be in the right place for the includes to work. I don’t know if the GitHub pages are only rebuilt intermittently or, more likely, if I’m still mucking up something in the placement of the images/code… but I’m stuck. I’m sure the images are ‘fine' in the sense that if you can get them working we can sort out any issues that they might present during the review stage. I’m not too worried about minor look-and-feel issues since the reviewers will presumably also comment on these if they noticed anything wrong. Jon

@jreades and @hawc2, there's apparently something wrong with preview in the submissions repo which means that the relative paths don't work, we need the full path to the image for the preview to display it, according to what @anisa-hawes told me here

tiagosousagarcia commented 2 years ago

I’m definitely missing something here: I’ve removed the full path and followed the convention used in the other tutorials that I peeked at (image name only, no other path info) but I still can’t get the images to display even though they appear to me to be in the right place for the includes to work. I don’t know if the GitHub pages are only rebuilt intermittently or, more likely, if I’m still mucking up something in the placement of the images/code… but I’m stuck. I’m sure the images are ‘fine' in the sense that if you can get them working we can sort out any issues that they might present during the review stage. I’m not too worried about minor look-and-feel issues since the reviewers will presumably also comment on these if they noticed anything wrong. Jon

@jreades and @hawc2, there's apparently something wrong with preview in the submissions repo which means that the relative paths don't work, we need the full path to the image for the preview to display it, according to what @anisa-hawes told me here

I'll go through the file and correct the paths, give me 5 mins

tiagosousagarcia commented 2 years ago

ignore what I just said, clearly not the problem, I'll revert the changes made in 86a1130

tiagosousagarcia commented 2 years ago

after ed3d398 the images are now displaying (to me, at least), though not sure what was wrong with @jreades earlier attempt then: https://programminghistorian.github.io/ph-submissions/en/drafts/originals/clustering-visualizing-word-embeddings

hawc2 commented 2 years ago

Lesson looks great to me! The site can take a bit to rebuild, @jreades, hopefully that was it. Feel free to keep tweaking, but seems to me you got it all right. I'll start the search for our peer reviewers!

jreades commented 2 years ago

Just checking there’s nothing needed on our end? I assume it’s a case of trying to find reviewers alongside the rest of the PH workload and you’ll let us know when things are ready, but just in case…

hawc2 commented 2 years ago

Yep you're good. Finding reviewers make take a month or two, especially at this time of the year. You can wait until you hear back from both reviewers and I make some overarching comments, then you'll have a chance to do comprehensive revisions

hawc2 commented 1 year ago

@jreades just an update, we're having a hard time finding reviewers for this lesson, but I'll follow up with more soon. Apologies for delays

jreades commented 1 year ago

Ah, I was wondering!

Perhaps Barbara McGillivray (https://scholar.google.co.uk/citations?user=71qKahgAAAAJ&hl=en&oi=ao, and she just published this: https://methods.sagepub.com/how-to-guide/use-word-embeddings-natural-language-processing) or one of the other Turing folks could recommend someone? She’s a very sophisticated user of word embeddings but might know someone…

Full disclosure: I approached her to co-supervise the PhD that generated this tutorial, but various PhD regulations ultimately made it an issue.

-- On 1 Aug 2022, 15:10 +0100, Alex Wermer-Colan @.***>, wrote:

@jreades just an update, we're having a hard time finding reviewers for this lesson, but I'll follow up with more soon. Apologies for delays — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

hawc2 commented 1 year ago

Thanks I'll inquire!

hawc2 commented 1 year ago

@jreades good news, @quinnanya and @BarbaraMcG have agreed to review your lesson, previewable here: https://programminghistorian.github.io/ph-submissions/en/drafts/originals/clustering-visualizing-word-embeddings

We will hopefully have reviews completed by mid-September or early October. To ensure we keep the lesson moving through our editorial pipeline, I'll save any actionable revision feedback I have until after we hear from the reviewers.

I do want to share a couple notes about this lesson, however, for both the author and reviewers to keep in mind. This is one of the most difficult lessons I've seen PH consider for publcation. It goes from zero to one hundred very quickly, and gets into a lot of jargon and complex mathematical problems without sufficient introduction or expostulation. Just as one example, the "Load the Data" section skips over all the basic info, and gets into very complex nitty gritty about unicode way too fast (meanwhile, libraries like pandas are never explained).

The lesson is already 9000 words long - our max lesson length is typically 8000 words. For this reason, the lesson is only going to have to expand in length as terminology is defined, concepts explained, etc.

So my recommendation to the reviewers is to consider what can be clarified up front, and what can be cut to hone the lesson. A word embedding lesson that covers UMAP and hierarchal clustering is already incredibly valuable. Do we also need to explore other visualization methods in this same lesson? If so, does it need to be a two-part lesson? Or would it be better to spend more time explaining hierarchal clustering rather than briefly reviewing multiple different methods? Can the code chunks be broken down to better explain the steps in any one of these algorithms?

On a related note, @quinnanya recently proposed and is drafting with a couple co-authors a relevant lesson on Introducing Word Embeddings. I am hoping @quinnanya's lesson can do a lot of the introductory work that allows this lesson to be useable for our readership. Quinn has expressed interest in reviewing with this in mind, and potentially collaborating with @jreades to make the two lessons complementary.

jreades commented 1 year ago

It goes from zero to one hundred very quickly, and gets into a lot of jargon and complex mathematical problems without sufficient introduction or expostulation. Just as one example, the "Load the Data" section skips over all the basic info, and gets into very complex nitty gritty about unicode way too fast (meanwhile, libraries like pandas are never explained).

😞 And this is after we cut the use of the Confusion Matrix!

On a related note, @quinnanya recently proposed and is drafting with a couple co-authors a relevant lesson on Introducing Word Embeddings. I am hoping @quinnanya's lesson can do a lot of the introductory work that allows this lesson to be useable for our readership. Quinn has expressed interest in reviewing with this in mind, and potentially collaborating with @jreades to make the two lessons complementary.

We'd be happy to explore this: it was impossible to talk about dimensionality reduction and clustering (and visualisation) without explaining what Word Embeddings are in the first place, but that also seems to mean trying to cover too much ground for one tutorial.

tiagosousagarcia commented 1 year ago

Hello all -- just a quick note to say that this is my last week working for PH. It's been an absolute pleasure working on this, and I'm only sorry I'm not going to be around for its publication (from this side -- I'll definitely be reading and using it as a regular joe). Big thanks to @jreades and @jenniewilliams for writing it, and @hawc2 for taking it forward. Well done everyone!

quinnanya commented 1 year ago

Thanks for this draft lesson! It's been really helpful to read through this, as a starting point for thinking about how we'll tackle a lesson specifically and exclusively about word vectors. Especially given the length challenge of this lesson, I'm writing this feedback with an eye towards how we might be able to take some of the background context off your shoulders, so you can focus on the most interesting end-point of the lesson, specifically, what you're doing with some different clustering methods than have been covered in other lessons.

I've included some comments on most of the sections below, broken down by sections. I think a place to start might be reimagining the introduction and goals (what you've got under "Prerequisites" actually feels like a good starting place for that) to emphasize the word vectors less, and the other clustering methods more.

There's a couple places where you've got really good explanations about decisions you've made, or what particular algorithms are doing, and the lesson is at its best in those moments. I've tried to flag some of the other areas that could use more fleshing out in the same way, and hopefully taking the word vectors part off your hands will free up some space to do that.

Happy to answer any questions you've got, look at more drafts, etc!

Lesson goals

Of the five bullet points in the overview, I think the first one (text cleaning, and why you've made the choices you have) is a really important one, and will be a little different than what we're doing with our example. From there, I think we can take care of bullet points 2 and 3; with some rewording and pointing to the intro to word vectors lesson, you can cut down on the word count in the intro + those two bullet points and the sections they represent. Maybe instead you can replace them with a deeper dive into understanding another clustering method. You could do a very brief recap of the clustering lesson's major takeaways on why you'd cluster things, and then explain a little about why word vectors (again, recap the intro lesson) would give us different results, and when we might want to use it compared to what's in the clustering lesson. Bullet point 5 might be worth putting first in a newly rearranged ordering of things: the goal of the lesson is to understand the advantages and disadvantages of a different method for clustering, and when it makes sense to use it compared to other methods discussed on PH or elsewhere.

Case study

The case study itself seems reasonable enough, but I'd like a little more detail about "in aggregate, the data provide a unique perspective on U.K. Higher Education". There's lots of ways to get a unique perspective, but what insight does it give us compared to looking at metadata?

With regard to the selection, there's some jargon you use that could use some more spelling-out for this audience. For starters, you say this will challenge automated text-analysis and classification. Why? What is it that this method is doing that you expect will get tripped up here? You mention you deliberately have the counts unbalanced -- why did you do that? When would you want to make a different choice in creating your sample? I do wonder if biology vs. physics and economics vs. social sciences is the most resonant set of subjects for the PH audience, or how hard it would be to find a similar example from the humanities?

The reference back to TF-IDF at the end of the case study section seems weird -- we haven't tried that to see how well it performs, so I'm not sure how you could quantify the comparison.

Prerequisites

The things you list in the prerequisites actually seem like really good goals for this lesson: the major point is to understand when and why you'd use these different methods for clustering, how the results differ, and what kind of additional insight you get by trying it this way.

Libraries

It might be useful to directly include a block of notebook code (or a list of commands for the command line) for installing the libraries, instead of making people guess which libraries need conda vs. pip.

I'd probably leave off the Docker reference -- newer users might not understand that that's an alternative to the other options, and you don't want to go down the rabbit hole of explaining Docker.

"skip computationally expensive steps by caching the outputs" -- this could use a little more spelling-out of for novice users. Also, I'd probably give a little more explanation of what the deal is with the random seed.

If you're going to get into fonts for the word cloud, I'd give a little more explanation. Why are you changing it? Is it just to make it look nicer? How do you choose a good font for a word cloud? Is there an easy way to preview the fonts you have? (Maybe not via matplotlib font manager?)

Load the data

I'd mention what a .gz file is, and why you're compressing the CSV ("it's very large" is okay as an explanation, but just leaving it without explanation might be confusing to people more familiar with .zip). At first, I was expecting the file to be one that people would have saved locally, so it might be worth expanding on "we're ready to begin" by explaining what you'll be doing in the next code cell in prose, first.

I appreciate the explanation for ast.literal_eval! It might be worth tweaking the wording slightly to clarify that you're talking about the Python list data structure, since you can write lists of things to a CSV file.

Data cleaning

Will confess to having multilingual DH feelings about removing accents. That one might merit another sentence or two about why you're doing that (and, presumably, why that wouldn't necessarily be something you'd do for languages other than English where accent marks carry a lot of meaning).

Since you won't be under such a space constraint, it could be valuable, honestly, to talk through each of those things that you're doing and explain why and what value they add.

The TCM

This may be the part where we'll have covered most or all of this in the introductory word vectors lesson, and you can give the code you're using but largely elide the details. Same with the next section.

From words to documents

I think this is the point where it might make sense to pick things up again, since we aren't planning to talk about averaging term embeddings and the like.

Dimensionality reduction with UMAP

Might be worth linking the first reference to Euclidean distance to the distance metrics lesson? Might be worth mentioning it again explicitly under "Configuring the process"

Could you spell out a little more when you would use spherical k-means vs. dimensionality reduction?

Visualizing the results

Are there any strategies for addressing overplotting in a visualization?

I like the explanation of why you chose 4 dimensions here.

Hierarchical clustering

Typo with sentence that ends too soon: "First, while all PhDs are, in some sense, related to one another as they must build on (and distinguish themselves from) what has come before."

Configuring the process

Say more about "Because we’ve used a manifold learning approach to dimensionality reduction it is not appropriate to use a cosine-based approach here." -- I'm not finding the connection obvious.

"A mixture of experimentation and reading indicated that Euclidean distance with Ward’s quality measure is best" -- how do you know when an experiment "works"? What did you read that led you to that conclusion? I appreciate the honesty here, but a little more guidance would be helpful.

The RAM-intensive nature of this is probably something worth mentioning upfront. Colab vs. on your own computer were presented as roughly coequal options (as I read it, anyhow) but it might be worth more explicitly pushing people towards Colab because of this step.

Validation

I wasn't sure what a silhouette plot was until I saw one, and I'm still not entirely sure why it's particularly useful here?

2 clusters

I'm struggling to follow how we're back to clusters after were talking dendrogram not that long ago. I feel like I missed a step somehow. It might be worth spelling out what we're clustering? I think I understand the "how" (most frequent DDC label)?

Are the experts wrong?

Should be a question mark at the end of "Are time-pressured, resource-constrained librarians going to be able to glance at an abstract and always select the most appropriate DDC." and the following sentence

I feel like "misclassified" is kind of a loaded term here, even though you frame it as a hypothetical. Maybe framing it as a "discrepancy" would be a little more neutral?

You describe your approach as producing an output that aligns with the needs of the WordCloud library, which feels a little like jumping ahead. Where did the word cloud come from? Before you get into that, mention that that's going to be your approach to investigate the discrepancies.

15 clusters

"give the clustering process greater importance and simply use the DDC as support for labelling the resulting outputs" -- I'm not sure how we were giving it less importance before? I'm finding that a little confusing.

If you're going to get into scree plots, you need to explain what it is and how it works. Same with the kneed utility.

jreades commented 1 year ago

Thank you Quinn! This all seems very sensible, but I'll wait on Barbara's feedback and then work through the tutorial section-by-section with both sets of feedback together for guidance.

As a thought: would it make any sense to use the same EThOS data for the separate embeddings tutorial? I realise that we all have our preferred data sets to teach from, but if it were to work for you, then I think there can be advantages to having more than one tutorial using the same data since readers can see connections/applications.

quinnanya commented 1 year ago

We might be able to include an example from EThOS that could lead into this tutorial! We've got a couple other things in mind as our main use case (e.g. a corpus of prose fiction), but it could be handy to have something really different to illustrate how the concepts hold across very different text types.

BarbaraMcG commented 1 year ago

This is a very relevant, rich and interesting tutorial. I agree with some of the previous comments regarding the breadth of its scope. I wonder if one way to resolve this would be to state more clearly that the aim of the tutorial isn’t a general introduction to word embeddings, clustering and dimensionality reduction, but rather an analysis of the PhD thesis corpus in terms of document classification? Following the sections recommended in the Reviewer guidelines (http://programminghistorian.org/en/reviewer-guidelines), here are my comments:

Audience This tutorial is advanced in its content and at times the authors do explain the concepts using a tone that would be suitable for beginners. However, there is some inconsistency, for example between the way word embeddings are introduced, which is quite accessible, and the way manifold learning is explained, which (perhaps unavoidably) gets very technical. Word embeddings are only briefly explained so I would recommend pointing the readers to other resources where they can find fuller explanations (see my comments further down). Skimmability The lesson contains clearly defined learning objectives listed near the top of the lesson. However, useful secondary skills could be more clearly expressed, for example on corpus cleaning. Below I recommend the use of diagrams and images to make it easier for readers to follow the technical aspects. Payoff The tutorial suggests why the explained tools or techniques are useful in a general way, and suggests how a reader could apply the concepts to their own work in general terms. Workflow This lesson is quite long. I wonder if it would be best to divide it into two: one on word embedding and UMAP and one on clustering. Sustainability All software versions and dependencies are listed in the submission and the most recent versions. The methodology is generally up-to-date, although in NLP token embeddings (contextualised embeddings) are now the state of the art over type embeddings like word2vec. And some more specific comments on the content: Introduction The text here contains quite a bit of jargon which I think should be simplified. Manifold learning is a very technical term and it doesn’t look to me like it would be appropriate to an audience of beginners. How about focussing on dimensionality reduction instead? The phrase itself is already parsable by non-specialists. Also, I would recommend always adding a gloss next to every technical term you introduce, even in this initial section. For example, “will introduce you to word embeddings, which are numerical representations of words that can be used to capture aspects of their meanings based on their contexts of use”. Also, the introduction isn’t clear about whether the corpus of 50,000 texts are a case study or the focus of the lesson. §4: I would recommend starting from the problem: what are we trying to do and why it is important to automate this type of analysis? Also, this section is quite dense: I would consider adding visual examples along the way. For example, when you talk about the term-document matrix, consider showing a simplified example of it. It’s also not clear how you can go from a matrix to a clustering. Word embeddings are introduced quite quickly, which I understand given the space constraints, but maybe consider adding an image that shows a semantic space and also adding references to texts that explain them in more details, such as https://methods-sagepub-com-christuniversity.knimbus.com/how-to-guide/use-word-embeddings-natural-language-processing (open access version: https://kclpure.kcl.ac.uk/portal/en/publications/how-to-use-word-embeddings-for-natural-language-processing(29e7fd42-728f-4315-aee3-3daba07aab8e).html ). §11: explain that DDC is Dewey decimal classification field and also explain how the classification presented in Table 2 was made. §13: it’s not obvious to me that TF/IDF would reproduce the distinctions in Table 2 either: something that would need to be proven. §14: this section (and the following one) would be better placed before the description of the corpus. The section on Required Libraries, Load the data, Data Cleaning are very clear! §35-37: consider dropping these paragraphs as they are not essential to the lesson, which is more about text similarity than word similarity. §54-57: this part should be simplified and possibly cut altogether, as it contains technical details that aren’t necessary to understand the rest of the lesson. §64: consider dropping this too. Clusters: I find the analysis presented here very interesting. Worth maybe citing https://www.nature.com/articles/s41599-022-01267-5 as an alternative approach to representing disciplines (in that case using a different type of embeddings).

jreades commented 1 year ago

Thank you @BarbaraMcG and @quinnanya for your thoughtful, extensive feedback. We look forward to implementing it and improving the tutorial!

@hawc2: before @jenniewilliams and I do any work on these, do we need to have a chat (possibly via email or as a separate issue so as not to clutter things too much?) including @quinnanya about whether and what to split off into a separate tutorial?

drjwbaker commented 1 year ago

@jreades The way our process works is that the editor should summarise the reviews https://programminghistorian.org/en/editor-guidelines#summarising-the-review before you need to take any actions.

hawc2 commented 1 year ago

@quinnanya and @BarbaraMcG, thank you for your thorough reviews! @jreades and @jenniewilliams there's alot here for you to parse through in terms of concrete revisions to sections, and if I were you I'd start with those detailed verisions to check them off your list. @quinnanya and @BarbaraMcG have given you really helpful and insightful pointers for where to tease out all the jargon in this lesson, and introduce it incrementally as the lesson progresses.

They also captured the big picture questions, which may take some time for you to resolve. After you do a round of revision, it could be useful to share an updated draft for further feedback. But I think a couple key takeaways for where to focus the lesson are:

@quinnanya's point that "I think a place to start might be reimagining the introduction and goals (what you've got under "Prerequisites" actually feels like a good starting place for that) to emphasize the word vectors less, and the other clustering methods more."
And @BarbaraMcG's question: "I wonder if one way to resolve this would be to state more clearly that the aim of the tutorial isn’t a general introduction to word embeddings, clustering and dimensionality reduction, but rather an analysis of the PhD thesis corpus in terms of document classification?"

I don't think this means you need abandon "word embeddings," but you can revise this lesson knowing @quinnanya will be composing an introduction to word embeddings, and you can focus on what it means to cluster/classify word vectors, or something along those lines. You've got alot of material here to work with, it's mostly a matter of narrowing and focusing the tutorial so it has a few clear phases that progress a research inquiry with a concrete methodological approach. A single powerful visualization well-suited to the dataset would be more than enough to get across some core ideas and skills for how to do this kind of analysis.

@jreades and @jenniewilliams, does it seem feasible to get a revised draft done in the next month or so, by early-mid November?

jreades commented 1 year ago

I can see them too now (on phone, maybe caching issue on laptop?). Thanks for investigating!

-- mob: 07976987392 email: @. skype: jreades On 9 May 2022, 17:15 +0100, tiagosousagarcia @.>, wrote:

after ed3d398 the images are now displaying (to me, at least), though not sure what was wrong with @jreades earlier attempt then: https://programminghistorian.github.io/ph-submissions/en/drafts/originals/clustering-visualizing-word-embeddings — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

jreades commented 1 year ago

Realistically, a revision on this timescale is going to be tough by mid-November as I'm in the middle of my main teaching term (thus the delay in responding -- getting 95+ students up and running with Python, Quarto, and so forth being rather... intense) and Jennie is on a short break. I think that (for me at least) one challenge here is that we're being told to reduce the technical content (which I am totally on board with) and conceptual scope (again, totally on board with that); however to make the explanations more clear when we have to gesture very vaguely indeed towards a tutorial on word embeddings whose content doesn't exist yet makes this seem quite difficult. I'm also unclear as to whether we will be asked to have input into that lesson: the content we did write on WEs did seem to be well-received (while allowing for the need to further reduce the technical language) but if we're not I don't know how confidently I can pull the WE content when some of it might need keeping in order to bridge from from that (unwritten) tutorial to this revised, more focussed one.

Sometimes, with the passage of time, making major edits gets a lot easier and goes a lot more quickly than you expect, but this just feels like it is going to need a lot of my attention and the end of the teaching term feels (right now) like that's when I can get my head around what's needed to make this really work. I will print out a copy so that I can quickly sketch out the main edits at block-level and get what that might look like back to you rather sooner than though!

hawc2 commented 1 year ago

@jreades if you need more time to review, that's not a problem. That often happens, and is quite understandable during the semester.

For now, as far as linking your tutorial to any other word embedding tutorial, I see why that's a challenge at the moment. What's important for any PH lesson to do is introduce the terminology it uses, including with wikipedia links or other secondary resources for readers to catch up. In this case, that may mean a few introductory paragraphs that get the reader up to speed. You can write that now without worrying about the other Intro to Word Embedding lesson, and if it becomes available, we can slot it in there and slightly condense your introductory section. You can certainly provide feedback on the Intro to Word Embedding lesson as well, but I can't promise that it will be published in tandem with yours. In case it's helpful for you, we also have this old lesson that was never revised to a full draft that you can look at as another example of someone attempting a Intro to Word Embedding lesson for PH.

Otherwise, after introductory sections of your lesson, I think the rest of your lesson can be edited without any relationship to the forthcoming Intro lesson, as it gets into more intermediate and advanced visualizations that will need to be streamlined regardless of the intro

jreades commented 1 year ago

I've had a read through the draft and tried to sketch out a revision:

The entire section on "Word Embeddings with Word2Vec" will be cut except the part on "From Words to Documents". That section will be trimmed: I'll provide the rationale but remove the code and processing steps. So the download available to users who want to replicate the work will now be the averaged document embedding, and not the raw or cleaned text. If people want to go back earlier they'll have to rely on the WE tutorial. We'll pick up with dimensionality reduction (P54) and this will have some knock-on benefits in streamlining the code later in P62-64).
I will cut the silhouette plot and code -- since this isn't core to the tutorial I think it's fine for it to go.
I will cut the 15-Clusters result and focus on the 2 and 4 result. I can change the selected disciplines but would invite suggestions as to what should be in/out: I'd be inclined to drop Social Science/Economics and replace them with... History would obviously be a good candidate, but what else? This will take some time since processing the 25k texts is slow even on a good laptop (plus I have to find time around my teaching).
I noticed that the unifying thread across document averaging, dimensionality reduction, and clustering, could be seen as ways of dealing with "The Curse of Dimensionality" -- while this would be an addition to the existing tutorial coming at/near the start of the tutorial but I think it provides a useful framing (and very important concept!) for why we do each these things in turn and helps to connect the short section on data cleaning (which the reviewers seemed to appreciate) to the rest of the tutorial.

I think that will reorient the tutorial in line with reviewer feedback, though I'll obviously also go through the more detailed feedback as well.

hawc2 commented 1 year ago

That sounds like a good plan to me. It would be a much needed addition to PH in this context to provide useful guidelines on dimensionality reduction. Looking forward to seeing the next draft!

It might also be helpful to look at some of the other lessons we've published, especially the half dozen we've published in the past year on machine learning. Contextualizing this lesson in relation to the other PH lessons can also help onboard the beginner reader and provide sign posts for further learning throughout the lesson.

drjwbaker commented 1 year ago

@hawc2 Just checking in as this is a Jisc/TNA article. Is there is timeline here for post-peer review revisions?

hawc2 commented 1 year ago

Thanks @drjwbaker for checking in. @jreades have you found time to do your revisions? I'm available to review drafts/questions as needed.

programminghistorian / ph-submissions