programminghistorian / ph-submissions

The repository and website hosting the peer review process for new Programming Historian lessons
http://programminghistorian.github.io/ph-submissions
136 stars 111 forks source link

Lesson proposal: Clustering and Visualising Documents using Word Embeddings (PH/JISC/TNA) #415

Closed tiagosousagarcia closed 1 year ago

tiagosousagarcia commented 2 years ago

The Programming Historian has received the following proposal for a lesson on 'Clustering and Visualising Documents using Word Embeddings' by @jreades and @jenniewilliams. The proposed learning outcomes of the lesson are:

In order to promote speedy publication of this important topic, we have agreed to a submission date of no later than April 2022. The author(s) agree to contact the editor in advance if they need to revise the deadline.

If the lesson is not submitted by April 2022, the editor will attempt to contact the author(s). If they do not receive an update, this ticket will be closed. The ticket can be reopened at a future date at the request of the author(s).

The main editorial contact for this lesson is @tiagosousagarcia.

Our dedicated Ombudsperson is (Ian Milligan - http://programminghistorian.org/en/project-team). Please feel free to contact him at any time if you have concerns that you would like addressed by an impartial observer. Contacting the ombudsperson will have no impact on the outcome of any peer review.

jreades commented 1 year ago

It’s next on my list after I get through a talk at the Turing today and a one other fairly small task that must be finished by the middle of next week. I’ve just been a bit hamstrung by the timing of the reviews and the ways they changed what we were doing in the tutorial: arriving just before my teaching-intensive term and necessitating going back to rethink/rewrite the whole piece has made it… challenging. This tutorial will be the better for it, but it certainly hasn’t helped with turnaround time.

-- mob: 07976987392 email: @. skype: jreades On 9 Jan 2023, 16:06 +0000, Alex Wermer-Colan @.>, wrote:

Thanks @drjwbaker for checking in. @jreades have you found time to do your revisions? I'm available to review drafts/questions as needed. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

anisa-hawes commented 1 year ago

This is very useful feedback. Thank you, @jreades.

hawc2 commented 1 year ago

That makes sense @jreades. Thanks for taking the time to revise. I'm available to review a draft or discuss any questions you have in the meantime. Looking forward to the updated version!

jreades commented 1 year ago

Quick update: in retrospect I probably should have left well enough alone, but I have changed the analysis to use more Humanities-oriented DDCs and it took quite a bit of mucking about to get 'good results'. I wanted to have something that balanced discrimination (in terms of enough clarity that the effort looked worth it) with complexity (showing that dimensionality reduction and clustering aren't going to solve all your problems). That led to a lot of re-running of code and time-consuming processing time. That's now done and I have also written a new introduction that shifts the focus away from word embeddings. It's going to need a trim, but it helped me get my head around the new focus at least. With luck the latter two-thirds of the tutorial will need less work. 🤞

hawc2 commented 1 year ago

Sounds great @jreades. Have you updated it on Github so it previews here yet?: https://programminghistorian.github.io/ph-submissions/en/drafts/originals/clustering-visualizing-word-embeddings

Just to clarify, what do you mean by humanities oriented DDC? DDC is acronym for what?

Let me know if you'd like me to review anything more preliminary. Once we have an updated draft previewable, I'll ask the reviewers to look over it and we can then give you additional feedback on any necessary revisions.

jreades commented 1 year ago

DDC=Dewey Decimal System. So: purposive sampling from the data set in order to help make the approach easier to apprehend.

-- mob: 07976987392 email: @. skype: jreades On 31 Jan 2023 at 18:57 +0000, Alex Wermer-Colan @.>, wrote:

Sounds great @jreades. Have you updated it on Github so it previews here yet?: https://programminghistorian.github.io/ph-submissions/en/drafts/originals/clustering-visualizing-word-embeddings Just to clarify, what do you mean by humanities oriented DDC? DDC is acronym for what? Let me know if you'd like me to review anything more preliminary. Once we have an updated draft previewable, I'll ask the reviewers to look over it and we can then give you additional feedback on any necessary revisions. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

hawc2 commented 1 year ago

@jreades when do you expect to be finished with revisions? I wasn't sure about your timeline and what revisions you still mean to do before we review it again

jreades commented 1 year ago

I’m almost done: I’ve on the section called ‘Validation’ and have been updating the figures. The only thing I want to do is bring back the confusion matrix for a much simpler overview of how well the process performed on 3 and 4 clusters. And I need to update the word clouds for visualising the larger number of clusters. I’ve been updating this in my own repo so as not to throw lots of confusing, partial edits into the PH repo. If you want to see where we’re at I can copy this back over prior to tidying up the final section.

Jon

-- mob: 07976987392 email: @. skype: jreades On 13 Feb 2023, 19:37 +0000, Alex Wermer-Colan @.>, wrote:

@jreades when do you expect to be finished with revisions? I wasn't sure about your timeline and what revisions you still mean to do before we review it again — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

hawc2 commented 1 year ago

Sounds good Jon, you can wait to update the PH repo until once you're happy with your revisions. No rush on my end, just want to keep track of things. And great to hear you've made so much progress, it sounds like you've done alot of work in a relatively short time!

jreades commented 1 year ago

Right, I’ve just pushed the updated tutorial into the PH repo. I still need to update a few of the code blocks from the updated coding notebook, but it’s ‘feature complete’ in the sense of implementing all of the requested changes that I can recall off-hand.

Jon

--

jreades commented 1 year ago

And that’s the code in too. It might be a good idea for you to take a look before returning to the reviewers since it’s a fairly radical rewrite—in line with what was asked for (I hope) but just to make sure I’ve covered off everything requested. On 15 Feb 2023 at 03:53 +0000, Alex Wermer-Colan @.***>, wrote:

Sounds good Jon, you can wait to update the PH repo until once you're happy with your revisions. No rush on my end, just want to keep track of things. And great to hear you've made so much progress, it sounds like you've done alot of work in a relatively short time! — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

hawc2 commented 1 year ago

Thanks @jreades! I'm looking forward to reading through it again. Will try to get back to you in the next week

hawc2 commented 1 year ago

@jreades I found time to read this through this evening. It's looking really good, your hard work definitely paid off, and I think it's ready for a final look-over by the peer reviewers. I expect to list this lesson as Advanced, but even still, once we get to the copyediting stage, there may be some work for @anisa-hawes and I to help make sure terms and concepts are introduced in a way typical of PH lessons (this would include linking to more secondary resources for various algorithmic terms, and connecting more relevant PH tutorials to get readers up to speed for this lesson).

One thing you @jreades can help address now is the 'onboarding' stage of the lesson. It's really only the first few sections which could use a few more sentences of introductory material. Once you get into the Background section, the lesson slows down sufficiently I think. But your lesson's first sentence jumps right into "dimensionality reduction" and other terms that beg some questions. I wonder if there's a way to start a little broader in the opening? Clustering is probably a better term to lead with, and explain dimensionality reduction in relation to it. For instance, at the start, you could situate for the reader a research question where building a corpus and then conducting dimensionality reduction would be important. You could also use this opportunity to foreshadow a bit the stunning visualizations and takeaways later parts of the lesson focus upon, giving readers a sense of why you would do this and what you get from it.

With that said, I think it's ready for @quinnanya and @BarbaraMcG to read through. Quinn and Barbara, I suspect it's a busy period, so let me know if this is not enough time, but could we ask for you to do a final look through of this lesson by March 15th?

At this stage, I'd only be asking that as peer reviewers you look over the revised lesson once more, verify your concerns have been sufficiently addressed, and make sure the code itself works for you when you test it. @anisa-hawes and I will do a stage of copyediting after your final review, so you don't need to worry about line edits, although of course any specific feedback for revision is always welcome.

Thanks everyone for your work on this! It's exciting to see this lesson come together so well!

quinnanya commented 1 year ago

Thanks for these revisions -- overall, it reads much more clearly, and I actually have a project I'm curious to try this on!

It looks like the link for the Colab notebook doesn't work, though? I can run the code locally, but wanted to flag that before I dive in.

BarbaraMcG commented 1 year ago

I like the revisions too, I think they've turned the text into a much easier to read piece. It would be useful to cite this guide, as it's an accessible introduction to word embeddings: https://kclpure.kcl.ac.uk/portal/en/publications/how-to-use-word-embeddings-for-natural-language-processing(29e7fd42-728f-4315-aee3-3daba07aab8e).html . Green open access version here: https://kclpure.kcl.ac.uk/portal/files/178190303/How_to_use_word_MCGILLIVRAY_2022_GREEN_AAM.pdf .

hawc2 commented 1 year ago

@jreades can you fix the Colab notebook and share a working link so @quinnanya and I can test it? Thanks!

jreades commented 1 year ago

Yes, will do. Didn’t want to polish that until I was sure the content wouldn’t be changing radically again. We also deal with reviewer feedback at same time… probably towards the end of this week On 11 Mar 2023 at 18:23 +0100, Alex Wermer-Colan @.***>, wrote:

@jreades can you fix the Colab notebook and share a working link so @quinnanya and I can test it? Thanks! — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

jreades commented 1 year ago

Colab notebook is now up-to-date with the code in the tutorial: https://colab.research.google.com/github/jreades/ph-tutorial-code/blob/main/Clustering_Word_Embeddings.ipynb#scrollTo=OGwAfxDfLB0L). (And I’ve updated the link in the tutorial as well)

I’ve also revised the introductory section to add a link to Barbara’s "How to use word embeddings for Natural Language Processing”.

Jon On 13 Mar 2023, 08:51 +0000, Jon Reades @.>, wrote: Yes, will do. Didn’t want to polish that until I was sure the content wouldn’t be changing radically again. We also deal with reviewer feedback at same time… probably towards the end of this week On 11 Mar 2023 at 18:23 +0100, Alex Wermer-Colan @.>, wrote: Message ID: @.***>

hawc2 commented 1 year ago

Colab notebook worked well for me, thank you @jreades.

@quinnanya can you take a look and share any last thoughts on this lesson?

quinnanya commented 1 year ago

The notebook worked well for me, too -- and admittedly, I was on Colab Pro, but the completion times were quite short.

Running through the code, though, leaves me with one major concern: how can people run this on any other data set they might have? There's a somewhat breezy note about parquet, but I can think of exactly one person I know who regularly uses that as a data format. If everything is going to depend on parquet files, for this to have a prayer of being reusable, it needs at a very minimum a pointer to some external tutorial for how one might go about converting, say, a CSV to a parquet file.

jreades commented 1 year ago

This is something that fell out of the streamlining of the notebook and tutorial to focus only on the post-embedding stage: the process of creating WEs (from a CSV with a complex encoding) disappeared. In Python converting to Parquet from CSV can be as simple as:

pd.read_csv(…).to_parquet(…)

The main advantages relate to size (Parquet files are much, much smaller than their equivalent CSV) and more complex data structures (lists and dicts can be embedded without the need to deserialise them). So in this case the document embeddings are just ’there’ in ‘doc2vec’ and ‘word2vec’ columns, and not something that needs ast.literal to convert. I will also note how to do this and just explain what you’d do if given a CSV containing a literal list (“['elem1’, ‘elem2’, ‘elem3’, … ]”).

I would tend to lean towards adding an explanatory note at the start of the notebook together with the illustrative code above. The dependency on pyarrow is installed automatically in Colab and is listed in the requirements.txt file (which I should probably also signpost at the start of the notebook).

Does this work?

Jon On 21 Mar 2023, 00:04 +0000, Quinn Dombrowski @.***>, wrote: Running through the code, though, leaves me with one major concern: how can people run this on any other data set they might have? There's a somewhat breezy note about parquet, but I can think of exactly one person I know who regularly uses that as a data format. If everything is going to depend on parquet files, for this to have a prayer of being reusable, it needs at a very minimum a pointer to some external tutorial for how one might go about converting, say, a CSV to a parquet file.

quinnanya commented 1 year ago

Hi Jon,

Got it! Yup, I think a quick "if you've got a CSV you can convert it easily with [code]" insert there would take care of it, thanks!

~Quinn

BarbaraMcG commented 1 year ago

I too have checked the code and it runs quickly with no issues. I agree that the input format needs some clarification, as well as some more comments to the code in the initial part of the notebook; it gets more verbose later on, but the beginning is a little dense.

hawc2 commented 1 year ago

Thanks @BarbaraMcG and @quinnanya for this useful and precise feedback. @jreades it sounds like the lesson is ready for copy-editing. @anisa-hawes can start working on that next phase.

Thanks again to our reviewers for taking a second look at this lesson, and giving it such a careful eye. It's going to be an excellent lesson, and I look forward to seeing how @quinnanya's in-progress lesson provides an introduction/background useful for this lesson.

Separately, I can work with you @jreades on revising and finalizing the accompanying Google Colab notebook. What you described doing sounds good to me. Since we don't want to replicate commentary available in the Programming Historian lesson, you can focus on minimal commentary and headings in the colab notebook that make it easy to follow along with, and explain any technical diversions from the PH lesson. The parquet pandas implementation is very elegant!

jreades commented 1 year ago

I added a bit of preamble to the Collab notebook: https://github.com/jreades/ph-tutorial-code/blob/main/Clustering_Word_Embeddings.ipynb

If that’s not where you want it let me know.

Jon On 23 Mar 2023 at 18:51 +0000, Alex Wermer-Colan @.***>, wrote:

Thanks @BarbaraMcG and @quinnanya for this useful and precise feedback. @jreades it sounds like the lesson is ready for copy-editing. @anisa-hawes can start working on that next phase. Thanks again to our reviewers for taking a second look at this lesson, and giving it such a careful eye. It's going to be an excellent lesson, and I look forward to seeing how @quinnanya's in-progress lesson provides an introduction/background useful for this lesson. Separately, I can work with you @jreades on revising and finalizing the accompanying Google Colab notebook. What you described doing sounds good to me. Since we don't want to replicate commentary available in the Programming Historian lesson, you can focus on minimal commentary and headings in the colab notebook that make it easy to follow along with, and explain any technical diversions from the PH lesson. The parquet pandas implementation is very elegant! — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

anisa-hawes commented 1 year ago

Thank you @jreades. This lesson /en/drafts/originals/clustering-visualizing-word-embeddings is now being copyedited.

anisa-hawes commented 1 year ago

Hello @jreades,

I hope you are well.

Our copyeditor Iphgenia has prepared the edits for this lesson. I've staged these edits in a Pull Request #554. You can review the changes she's made in the rich diff by navigating to the "Files changed" tab.

Please let me know if you're happy with the adjustments. You'll notice that I have left some small comments/queries, and indicated where a few additions are needed.

With many thanks, Anisa

cc. @hawc2

hawc2 commented 1 year ago

@jreades did you see the pull request #554 awaiting your approval for copyedits on your lesson?

jreades commented 1 year ago

I had -- what I wasn't sure about was whether I was supposed to use the inline commenting function or approve the pull request. I've now done this and then added in the requested revisions.

I think this means we're there? 🤞

anisa-hawes commented 1 year ago

Thank you, @jreades!

--

Hello @hawc2,

This lesson is almost ready for your final review.

Sustainability + accessibility actions status:

- name: Jon Reades
  orcid: 0000-0002-1443-9263
  team: false
  bio:
    en: |
      Jon Reades is Associate Professor at the Centre for Advanced Spatial Analysis, University College London.
- name: Jennie Williams
  orcid: 0000-0000-0000-0000
  team: false
  bio:
    en: |
      Jennie Williams is a PhD Student at the Centre for Advanced Spatial Analysis, University College London.

Next steps @hawc2 :

hawc2 commented 1 year ago

thanks @anisa-hawes for getting everything ready. Yes, we can host these assets either in github large file storage or our zenodo repository.

@jreades, can you add the link to github where code is hosted as Anisa mentioned on line 731?

jreades commented 1 year ago

Done. On 28 Apr 2023 at 02:41 +0100, programminghistorian/ph-submissions @.***>, wrote:

@jreades, can you add the link to github where code is hosted as Anisa mentioned on line 731?

anisa-hawes commented 1 year ago

Thank you, @jreades 🙂

hawc2 commented 1 year ago

@jreades I did a final line edit of the lesson and standardized some of the wording, tried to clarify a few points. The lesson overall is looking really solid, it's impressive, and while difficult, illuminating. It provides a lighthouse around which future PH lessons on embeddings can situate themselves, and it'll be interesting to see how we publish more lesssons that go into the weeds of emerging machine learning methods while trying to stay sustainable. I also appreciate how you engage with the Scikit Clustering lesson, and more generally situate the lesson in the context of other PH lessons.

In terms of the publishing timeline, Anisa is working on finalizing how we'll store all the assets for this lesson, and I'm preparing other elements for publication. I'm hoping to publish the lesson in the next couple weeks.

So this would be your last chance to make any edits to it. I had a couple lingering questions I was hoping you could clarify, and two suggestions for additional minor edits.

jreades commented 1 year ago

Hi Alex — replies below (and cc’ing Jennie to see if she has a quick ref for Q1). On 3 May 2023 at 16:39 +0100, Alex Wermer-Colan @.>, wrote: @jreades I did a final line edit of the lesson and standardized some of the wording, tried to clarify a few points. The lesson overall is looking really solid, it's impressive, and while difficult, illuminating. It provides a lighthouse around which future PH lessons on embeddings can situate themselves, and it'll be interesting to see how we publish more lesssons that go into the weeds of emerging machine learning methods while trying to stay sustainable. I also appreciate how you engage with the Scikit Clustering lesson, and more generally situate the lesson in the context of other PH lessons. Great, that’s nice to hear, thank you! [ ] Under Configuring the Context, what does this line mean?: "A mixture of experimentation and reading indicated that Euclidean distance with Ward's quality measure is best" - does "reading" here mean "research"? Is there an article to cite for this decision? Yes, we did mean ‘research’. @. do you have a reference or two handy? [ ] I also wasn't sure what this sentence meant?: "Indeed, the assumptions about the theses being swapped between History DDCs are probably more robust, since the number of misclassified records is substantial enough for the differences to be relatively more robust." Can we use different words than robust here? Ah, good catch. That paragraph is doing two things which slightly confuse the matter: 1) there’s an argument about there being enough misclassified History theses that it suggests we’ve got substantive overlap between the input classes in the clustering space and so the differences in outcome classifications are significant (in a statistical sense); 2) an argument that the absence of misclassifications between Philosophy/Linguistics and History of Ancient World suggest a strong separation in the clustering space and it’s why we get slightly unhelpful distinguishing words such as ‘Bulgarian’ and ‘Mozambique’ in the TF/IDF plots (so these terms are more like artefacts of the statistics).

I’ve rewritten this to try to make it easier to navigate. [ ] One thing I should've flagged earlier is I'm not a fan of commentary embedded in code blocks, especially when it breaks up functions. In this case, however, your lesson is so complex, and there's such detailed code blocks, that I think it works in many places. I tried to condense how many lines some of these commentary sections take up in distrupting the code, but I'd also encourage you to take one last look at this element of the lesson. In a few instances, some of the commentary in-line the code could be taken out and added as a paragraph before or after the code chunk. Ideally you are explaining in the tutorial prose what each code chunk is about to do or has just done. This might be especially helpful in the cases of functions being broken up by commentary. I'll defer to you in the end on how you prefer the in-line commentary to appear case by case, but I just wanted to flag it as something you could alter. I’ve tidied this up where it seemed sensible to me and left it in-place where I thought removing it would make the resulting code more forbidding to less experienced Python programmers. [ ] It's fine for you to leave that sort of commentary in the Google Colab notebook. The notebook itself works great, and my only ask for edits on the Colab notebook would be for you to try to incorporate more of the Section Headings from the PH Leson into the Colab notebook itself. Ideally a reader could switch from the lesson to the colab notebook and use the outlines to figure out where in the lesson the google colab code fits. It doesn't have to be perfect, but adding some more sign posts might help a reader juggle everything. I’ve done that now. Also realised as a result that the code for comparing clustering algorithms wasn’t in the notebook. Have fixed this and it led to a new sub-head in the tutorial near the end with one paragraph moving.

If you would like to take a look at what I’ve done, I’ll get the refs from Jennie and we can work out where to put the parquet file.

Jon — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

hawc2 commented 1 year ago

@jreades thanks for making these edits and additions, including the references. .

For the parquet file you mentioned, is that currently in the directory of assets you link? I think @anisa-hawes is planning to store all of those files on our Zenodo repository and relink to that location in the lesson

jreades commented 1 year ago

Just chasing this: I can certainly put the Parquet file in GitHub so that you can access it via the 'assets' directory, but if you're then going to move it elsewhere there's not much point as it will bulk up your repo with a 26MB file that's never actually used.

Let me know where you want it to go and I can do it, or feel free to download using the URL in the code and move it wherever you like.

Is there anything else you ened from me?

anisa-hawes commented 1 year ago

Dear @jreades. Thank you for following up.

Apologies for the delay. I am working through a few questions about how we handle lessons that integrate codebooks, and also about how to host large data assets. Both are important to ensuring we can manage and sustain this lesson into the future.

I have downloaded your code and created a .zip file which combines all the data assets and uploaded it to our PH Zenodo repository. However, I've expressed to Alex that I am a bit unsure about the data .zip having its own DOI. This appears to be automatically assigned by Zenodo unless we supply one (the lesson's own DOI isn't activated until shortly after publication). I've contacted the library who coordinate and register our DOIs with crossref to ask for advice here.

I’m also uncertain about how the download would work within the code. At line 245 of the Markdown, a block of Python specifies df = pd.read_parquet. Would this work to download and save a .zip? Sorry for all the questions and doubts here.

Anisa

hawc2 commented 1 year ago

@anisa-hawes Can we move this lesson forward to publish in the next couple weeks?

anisa-hawes commented 1 year ago

Hello @hawc2 ,

Thank you for your extended patience. All the sustainability + accessibility actions are complete:

As you are in the unusual position of being both Managing Editor and Editor of this lesson, I have prepared the files on Jekyll to help you.

Everything is ready for your review here: https://github.com/programminghistorian/jekyll/pull/2987

(the first thing after publication will be for us to prepare x1 announcement + x2 future posts for our social media channels)

hawc2 commented 1 year ago

Huge congrats @jreades and @jenniewilliams on the publication of this amazing new lesson on word embeddings: https://programminghistorian.org/en/lessons/clustering-visualizing-word-embeddings

It's been my pleasure editing this piece, and I'm grateful to @quinnanya and @BarbaraMcG for their careful review of the lesson. Big thanks to @anisa-hawes too for helping prepare this lesson for publication and developing a new way for us to manage Jupyter and Colab Notebooks going forward.

@jreades and @jenniewilliams, we'll be promoting the published lesson on social media, and we encourage you to share it around as well. I look forward to recommending students read it, and I'm sure I'll make use of it in my own research in the future as well. Thanks for all your work and time on this lesson, and again congratulations!