Understanding and Creating Word Embeddings

yann-ryan commented 1 year ago

The Programming Historian has received the following tutorial on 'Understanding and Creating Word Embeddings' by @blaak-18, @quinnanya, and @saraheconnell. This lesson is now under review and can be read at:

https://programminghistorian.github.io/ph-submissions/en/drafts/originals/understanding-creating-word-embeddings

Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.

I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. I have already read through the lesson and provided feedback, which I include as a comment below this.

Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.

I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me.

Our dedicated Ombudsperson is (Ian Milligan - http://programminghistorian.org/en/project-team). Please feel free to contact him at any time if you have concerns that you would like addressed by an impartial observer. Contacting the ombudsperson will have no impact on the outcome of any peer review.

Anti-Harassment Policy

This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. Thank you for helping us to create a safe space.

yann-ryan commented 1 year ago

Initial feedback:

Thank you for this contribution - I really enjoyed this lesson which explains the basics of word embeddings very clearly. It is overall extremely well written and easy to follow, and I think it is going to make a fantastic contribution to the Programming Historian. The lesson is in very good shape and almost ready to proceed to peer review. I have a few small suggestions in advance of that. If you’re happy with them (and please feel free to query!), the next stage would be to make the changes (you can do so directly to the draft), and I will being the search for peer reviewers.

Some general points:

To me a strength of this lesson is that the code feels of secondary importance - much of the value I took from it was the discussion about the theory behind word embeddings, how to interpret them, and how to prepare a corpus for them, etc. With this in mind, I think it could be signposted early in the lesson so that the reader knows what to expect - that the coding steps will be quite minimal in comparison to the discussion.

I thought that some information could have been presented in a different sequence, alongside the practical parts of the lesson where possible. There were some places where as a reader I was unsure why I was being provided with certain information. Paragraph 20 goes into some detail with specific advice on how to mitigate against getting slightly different results from the Word2Vec model- at this point in the lesson, I’m not sure if that will be useful. It may be better moving it to the appropriate section (once the reader has run the model), or giving it a separate box so as not to break the flow of the lesson.

Another general suggestion is that I would advise reducing the information you provide in comments in code blocks and incorporate as much of it as possible within the lesson main text.

Some specific points:

[ ] Paragraph 5: It’s not totally clear what you mean by ‘traditional’ methods. Do you mean, for example, close reading, or more established digital humanities methods?
[ ] Paragraph 10/11: 'Visualization' doesn't feel like the correct word to me - the visual representation of the two-dimensional space has nothing to do with the math we can perform on it. Maybe ‘graph’ instead of ‘visualizations’?
[ ] Paragraph 20 - perhaps this could be moved further down and incorporated within the code of the lesson - I don’t think it’ll be helpful until the reader actually does this part of the process.
[ ] Paragraph 22: this explanation would really benefit from a simple diagram showing some vectors as lines and the corresponding triangle and how the cosine distance in calculated?
[ ] Paragraph 36: If this feature is IDE-dependent, it may be worth removing the part about tab completion or specifying when this feature will be relevant. I would bear in mind throughout that readers may complete the lesson using an IDE other than Jupyter Notebooks.
[ ] Paragraph 41: I think for a lesson at this level, it wouldbe useful to suggest how a reader might learn how to do this extra step of dealing with contractions.
[ ] Paragraph 52: could you expand on what different data types are best suited to the CBOW or skipgram methods?
[ ] In the code block after paragraph 65, perhaps replace ("../../WordVectors/python/“) with 'FILL IN YOUR FILE PATH HERE’ as in the earlier code block, or explain more clearly that this may need to be changed.
[ ] I found the validation section (paragraph 62 - 64) a bit confusing. It wasn’t clear to me what exactly the evaluation was doing, and how I should interpret the outputs. I was also a bit confused as to why the evaluation output was being saved as a .csv - considering in the rest of the lesson code outputs are interpreted underneath their code cells. It seemed as if this file was being saved for some particular purpose, but I wasn’t sure what that was.
[ ] I also had a small bug in this code block: running the original code I got an error saying ‘key ‘cupcake’ not present’. When I removed the cupcake entry from the test_words list, it ran fine, so I’m guessing it’s because cupcake is not in the vocabulary?
[ ] It felt a little strange to end the lesson with a section headed ‘corpus preparation’. I suggest adding some final concluding remarks, summing up what the reader has learned, etc, before the further reading/next steps.

Small errors/typos:

[ ] Paragraph 3: remove the or a in the first sentence.
[ ] Paragraph 7: unnecessary hyphen after the em dash.
[ ] Paragraph 9: not sure ‘data’ is the appropriate heading here, as the paragraph focuses on the algorithm and some information about the lesson. Perhaps just remove and keep as part of the introduction?
[ ] Paragraph 14: use either dense or condensed
[ ] Paragraph 19: be consistent with capitalisation of Word2Vec, sometimes the first word is capitalised mid-sentence, sometimes not.
[ ] Paragraph 29: standardise use of 19th century and nineteenth-century (used differently in the header and body)
[ ] Paragraph 32: repetition of the information provided in the previous paragraph (the source of the recipes used for the lesson)
[ ] Paragraph 40: the references for these two citations should be provided, for example at the end of the lesson.
[ ] Paragraph 42: the last clause of the first sentence is missing a conjunction word/phrase, such as ‘and’ or ‘and finally’.

Can I ask you to propose a timeframe for carrying out the initial edits? The submission file above can be edited directly. In the meantime, I will find two peer reviewers.

Thanks so much!

Yann.

yann-ryan commented 1 year ago

The first reviewer for this lesson will be @rubenros1795, who aims to complete his review by late August. Thanks Ruben, let me know if you have any questions!

yann-ryan commented 1 year ago

Small update: second reviewer is also confirmed, I'll tag their Github username here over the next week or so.

blaak-18 commented 1 year ago

Paragraph 5: It’s not totally clear what you mean by ‘traditional’ methods. Do you mean, for example, close reading, or more established digital humanities methods?

Added language in para. 5 to clarify

Paragraph 10/11: 'Visualization' doesn't feel like the correct word to me - the visual representation of the two-dimensional space has nothing to do with the math we can perform on it. Maybe ‘graph’ instead of ‘visualizations’?

switched to "graph" in para 10/11

Paragraph 20 - perhaps this could be moved further down and incorporated within the code of the lesson - I don’t think it’ll be helpful until the reader actually does this part of the process.

Moved further down as suggested

Paragraph 22: this explanation would really benefit from a simple diagram showing some vectors as lines and the corresponding triangle and how the cosine distance in calculated?

We decided against including a diagram just so that the readers don't get so hung up on the math part of the lesson, but can discuss more based on reviewer feedback!

Paragraph 36: If this feature is IDE-dependent, it may be worth removing the part about tab completion or specifying when this feature will be relevant. I would bear in mind throughout that readers may complete the lesson using an IDE other than Jupyter Notebooks.

Added language in para 36 and removed tab completion bits

Paragraph 41: I think for a lesson at this level, it wouldbe useful to suggest how a reader might learn how to do this extra step of dealing with contractions.

Added language to para. 41 to explain the contractions further

Paragraph 52: could you expand on what different data types are best suited to the CBOW or skipgram methods?

added language to para. 52 to address this feedback

In the code block after paragraph 65, perhaps replace ("../../WordVectors/python/“) with 'FILL IN YOUR FILE PATH HERE’ as in the earlier code block, or explain more clearly that this may need to be changed.

replaced as advised

I found the validation section (paragraph 62 - 64) a bit confusing. It wasn’t clear to me what exactly the evaluation was doing, and how I should interpret the outputs. I was also a bit confused as to why the evaluation output was being saved as a .csv - considering in the rest of the lesson code outputs are interpreted underneath their code cells. It seemed as if this file was being saved for some particular purpose, but I wasn’t sure what that was.

added more context to this section and also provided an option for displaying the results in-line if wanted

I also had a small bug in this code block: running the original code I got an error saying ‘key ‘cupcake’ not present’. When I removed the cupcake entry from the test_words list, it ran fine, so I’m guessing it’s because cupcake is not in the vocabulary?

fixed this bug

It felt a little strange to end the lesson with a section headed ‘corpus preparation’. I suggest adding some final concluding remarks, summing up what the reader has learned, etc, before the further reading/next steps.

**We moved the preparing your own corpus to the end to help ease the lesson to a close

All of the below typos should be fixed now!

Small errors/typos:

Paragraph 3: remove the or a in the first sentence.

Paragraph 7: unnecessary hyphen after the em dash.

Paragraph 9: not sure ‘data’ is the appropriate heading here, as the paragraph focuses on the algorithm and some information about the lesson. Perhaps just remove and keep as part of the introduction?

Paragraph 14: use either dense or condensed

Paragraph 19: be consistent with capitalisation of Word2Vec, sometimes the first word is capitalised mid-sentence, sometimes not.

Paragraph 29: standardise use of 19th century and nineteenth-century (used differently in the header and body)

Paragraph 32: repetition of the information provided in the previous paragraph (the source of the recipes used for the lesson)

Paragraph 40: the references for these two citations should be provided, for example at the end of the lesson.

Paragraph 42: the last clause of the first sentence is missing a conjunction word/phrase, such as ‘and’ or ‘and finally’.

yann-ryan commented 1 year ago

Just to confirm, the second reviewer will be @anneheyer.

rubenros1795 commented 1 year ago

Thank you for this lesson. It is a clearly written and intuitive introduction to word embeddings. I liked the setup, especially explanation of vectors based on document-term matrices. The choice to first introduce word2vec, then run some code, and add considerations at the end is also a good one.

Here are my comments:

In paragraph 6, some more explanation may be needed on what "semantically similar" means. While this might be intuitive for some, historians who, for example, just finished a less on collocation analysis could wonder how semantically similar words in word2vec differ from highly-ranked collocates.
In paragraph 13, a visual example of a DT-matrix might be useful.
In paragraph 14, the explanation of the embedding process is fairly minimal. You could try to add a bit more information on how exactly word2vec embeds. I realize this is hard, though.
The issue of "semantic similarity" comes up again in paragraph 19. True, word2vec is aimed at synonyms, but with smaller data, the associative dimension becomes more important. See: Hill, Felix, Roi Reichart, and Anna Korhonen. "Simlex-999: Evaluating semantic models with (genuine) similarity estimation." Computational Linguistics (2016). I would actually see this as an advantage of word2vec, because there is much more in associations for humanists that in optimally trained synonyms.
The values for cosine similarity range from -1 to 1. Does word2vec use additional normalization?
I'm not sure if the title above paragraph 24 ("Going beyond distance") is good for the point you are trying to make, since even with the vector algebra, it's still about distance. Something in the spirit of "beyond most similars" would be more in line with the argumentative step.
I like the discussion of the parameters. One important one that is missing and definitely should be included is the vector size. This parameter is rather crucial, because, if I'm not mistaken, the larger the vector, the more word relations are about meaning and less about grammatical aspects (the syntagmatic/paradigmatic axis).
The issue of corpus size (relevant in paragraph 7 and from 71 onwards) could be discussed more extensively, as small corpora often produce highly unstable vectors. This article mentions some orders of magnitude that could serve as a threshold for what is "enough" for word2vec. Perhaps you could consider adding such information: Wevers, M., & Koolen, M. (2020). Digital begriffsgeschichte: Tracing semantic change using word embeddings. Historical Methods: A Journal of Quantitative and Interdisciplinary History, 53(4), 226-243.
In the "Next Steps", I would like to see some examples of research that uses word2vec to make a humanistic argument. I think this would give users of the lesson a better understanding of its potential, alongside technical elaboration.

yann-ryan commented 1 year ago

Hi @rubenros1795, thank you so much for this! As per the guidelines, I'd ask the authors to hold off making any edits until the second review is posted. At that point, I'll summarise the reviews and provide a list of action points.

anneheyer commented 1 year ago

My apologies for this review coming in a bit late. First of all, let me emphasize how much I enjoyed this lesson. The lesson is very useful, has nice (and entertaining) examples, is clearly structured and generally very well written. In addition, I have also worked in the accompanying notebook that was equally really enjoyable, informative and inspiring for future work. I have a few general comments that will be followed by paragraph-specific suggestions for improvements and I also list a few typos that I found.

General comments Perhaps it is a good idea to explain a little bit more the relationship between the website text and the notebook? If I see this correctly, the notebook follows a slightly different structure, the main difference being perhaps “The Code” sections. For the user, it might be helpful if you mention in the website text that the code is further explained in the notebook under the sections “The Code”. This might be an additional reason for the user to open the notebook and explore further, even if they are beginners and find programming a bit scary. BTW: the notebook is really well done!

Specific Comments ¶4 perhaps name an example of IDE to help the user to understand? I can imagine less-experienced users being a bit confused with this. Obviously, you don’t want to explain details, but perhaps a little hint, example or link to an explanation would help?

¶5 I noted that you use “humanistic” rather than “historical”. Perhaps it would be good to shortly reflect (behind the scenes, not in the text) whether this lesson is targeting historians or humanists more generally. I am fine with the latter, but just wanted to make you aware of.

¶13 Perhaps a graph of a table (and a vector) would help the user to understand document-text matrix better here? I made a drawing to get my head around this paragraph when reading it on a screen.

¶13 and 14 One might get a bit confused here about the general structure of the argument: is a matrix (a) one way of representing a corpus and sparse vector representation (b) another? And does this mean that embedding models (c) are a sub-form of sparse vector representation? I understand what you are trying to do (and generally everything is very well explained and easy to follow), but perhaps you could be a bit more explicit in writing about the relationship between a) and b); and b) and c) would help a beginner.

¶31 Perhaps for the notebook, it would be useful to explain how to download the data at github. When I return to coding after months of other work, I always need to search where to find the right place for downloads at github. Code between ¶32 and ¶33 I had to install gensim first before being able to import it. Might be useful to add to your code to make it more accessible? !python -m pip install -U gensim

¶34 and ¶35 Excellent point. When I started coding, this was one of the things that took me a while to understand. This will be very useful for beginners and the format also allows more advanced users to skip this quickly. Well done! Code after ¶37: Does the basic course of python include an explanation about loops? One of the things that I found confusing at the beginning was that “name” in this code could also be “x” or “variable” or something else. To learn read the code, this might be a helpful “#note that…” for the user.

¶40 You can probably safely assume that users know what tokenization is, but if you like, you could add a link here to where this is explained for computational linguistics (or the relevant programming historian course paragraph on the website) as a small service to the user.

Code after ¶42 Understand everything but the 2 lines after #remove punctuation for which I had to google a bit and still feel a bit confused. Could you explain these (compile, escape, sub '[%s]’, %)a bit more in the notebook in the “code section”?

¶40 Very helpful and written in such an accessible way!

¶51 Every parameters is clearly explained. Compliments also for indicating default setting and when parameters are optional. But “Workers” section could be a bit more detailed, even though it seems less relevant.

¶51 In the notebook this section is called Analysis, which I think would also be helpful for the website text. Adding Analysis as a header would also help streamline working on the notebook while reading your explanations here.

¶60- 64 Would it be helpful to explain how to interpret the numbers in the output? I think you said something about interpreting the cosine earlier, but I am not sure how I would even call the ciphers in the output – similarity scores or are these likelihood? I guess you say this earlier, but here (or in the notebook) it would be helpful, too.

¶Validation Very well done. One thing to consider is mentioning that you have now provided code for one model and in order to make this work, one would have to come up with a few models. When going through this part of the lesson, I was wondering whether I had created multiple models, without really realizing it. But seeing that my csv file remained empty, I assume that I would do that by changing parameters? Perhaps you could give one example of an additional model that is also meaningful for your example to make this section even more concrete?

¶73 Again perhaps a superfluous comment: I was wondering whether you could add “Spanish” before gato? In some place like the US, Spanish is a quasi second language, but this might be different for the Asian context or even (Eastern) Europe where more people speak French. For “con” you could add in brackets (“with” in Spanish). Makes the text a little less abstract for some group of users.

¶79 Love the suggestions for further practices. I would also welcome a list of references for the scholarly texts (or blogs) that you mention in the text or is this against the programming historian’s format? Typos

¶3 “you can use the a Jupyter notebook”

¶5 “Questions such as these are the type of humanistic inquiries that can be prove to be challenging to answer through traditional methods such as close reading”

¶ 6 “Unlike topic models, which rely on word usage to better understand documents, word embeddings are more concerned with how words across a whole a corpus are used.”

¶9 “While word embeddings have been implemented in many different ways using varying algorithms, for this purposes of this lesson,”

¶13 “because most of the vectors for each word”

Again very well done and, as you hopefully notice, all my comments are rather minor. Thanks so much for creating this lesson. This is truly a great service to the community!

yann-ryan commented 1 year ago

Thanks so much for your review @anneheyer! Now that we have both, we can make a plan for revisions. I'll aim to read through both and comment here by the end of the week.

yann-ryan commented 1 year ago

Thanks again for these incredibly helpful and constructive reviews!

@blaak-18, @quinnanya, and @saraheconnell, I think both reviewers are in agreement that this lesson is in a very good state in terms of structure and the overall content. In both cases comments are minor and are relating to specific tweaks or requests for clarification.

I suggest going through the specific reviewer comments and addressing them individually - but there's no need for any very large or organisational changes at this stage. I think in most cases the changes to be made are clear, but here are a few points:

Most of the comments relate to clarifications to the code or method. I think it would improve the lesson to address these as far as possible - particularly as it's aimed at relative beginners. However in a few places, it may be better to point users to other PH tutorials.
It stood out to me that both reviewers agreed that a graph/table of the DTM and vectors would be very helpful. This would be a useful addition to the lesson, if it's possible to add.

At some point, the Jupyter notebook will need to be updated, but I suggest waiting until the text of the lesson is finalised to do that. There is also the possibility of hosting your notebook on the PH's Google colab - are you OK with this?

Generally, its advised to have a timeframe of about a month for completing this round of edits. Does Friday October 7th work as a deadline for you?

saraheconnell commented 11 months ago

Yes, thank you, we really appreciate the time and care that went into these! We are working on the edits now and will put in every effort to have them ready by October 7. Avery and I are fine with hosting on Colab, as long as that’s also okay with @quinnanyahttps://github.com/quinnanya.

From: Yann Ryan @.> Date: Thursday, September 7, 2023 at 3:44 AM To: programminghistorian/ph-submissions @.> Cc: Connell, Sarah @.>, Mention @.> Subject: Re: [programminghistorian/ph-submissions] Understanding and Creating Word Embeddings (Issue #555)

Thanks again for these incredibly helpful and constructive reviews!

@blaak-18https://github.com/blaak-18, @quinnanyahttps://github.com/quinnanya, and @saraheconnellhttps://github.com/saraheconnell, I think both reviewers are in agreement that this lesson is in a very good state in terms of structure and the overall content. In both cases comments are minor and are relating to specific tweaks or requests for clarification.

I suggest going through the specific reviewer comments and addressing them individually - but there's no need for any very large or organisational changes at this stage. I think in most cases the changes to be made are clear, but here are a few points:

Most of the comments relate to clarifications to the code or method. I think it would improve the lesson to address these as far as possible - particularly as it's aimed at relative beginners. However in a few places, it may be better to point users to other PH tutorials.
It stood out to me that both reviewers agreed that a graph/table of the DTM and vectors would be very helpful. This would be a useful addition to the lesson, if it's possible to add.

At some point, the Jupyter notebook will need to be updated, but I suggest waiting until the text of the lesson is finalised to do that. There is also the possibility of hosting your notebook on the PH's Google colab - are you OK with this?

Generally, its advised to have a timeframe of about a month for completing this round of edits. Does Friday October 7th work as a deadline for you?

— Reply to this email directly, view it on GitHubhttps://github.com/programminghistorian/ph-submissions/issues/555#issuecomment-1709639504, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB6KXECVEYCTCBJ5JDGT3ADXZF3L3ANCNFSM6AAAAAAWUQOEV4. You are receiving this because you were mentioned.Message ID: @.***>

yann-ryan commented 10 months ago

Hi @blaak-18, @saraheconnell, and @quinnanya: I've just uploaded your edited version of the markdown file directly to the repository. When you can, would you mind creating a comment here listing the changes you've made in response to the reviews? Thanks so much!

anisa-hawes commented 10 months ago

Hello @yann-ryan. Thank you for adding the authors' updates!

Thank you also for already mentioning that we can host notebooks associated with lessons within our organisational Colab space. When the notebook is finalised, we'd be very happy to receive it for processing + upload.

We're moving towards a new approach for integrating notebooks to support sustainability, future translatability and usability. Ideally, we want our readers to be able to make the choice to work in Google Colab, run the code locally, or opt to work in a different cloud-based development environment.

If authors provide codebooks to accompany their lesson, we ask that:

Notebooks consist of the code + line comments only
Headings and subheadings mirror those of the lesson to support readers' navigation
Notebooks do not extend or replicate commentary from the lesson

Please let Alex @hawc2 or I know if you have any questions. Thank you.

blaak-18 commented 10 months ago

@yann-ryan Below are the list of changes we have made according to reviewer feedback:

Updated to language to use more precise language: "used in similar contexts" rather than "semantically similar"
We reviewed the lesson and removed all references to "semantic similarity"; there are several places where we discuss or remind readers that models will reveal "closeness" and the different kinds of language usage that might produce such closeness in vector space. We think that this is an appropriate level of detail for an introductory lesson such as this one, but are happy to expand if a more extensive discussion is called for.
Revised "Going Beyond Distance" header to "Using vector math to answer questions"
Added vector_size parameter discussion
Added language about corpus size in the paragraph beginning with "Although smaller corpora..."
Added a section on readings and resources
Added a note when the link to the notebook is introduced, "The notebook contains…"
Revised IDE explanation to include examples and spell out the acronym "an Integrated Development Environment (IDE)—such as IDLE, Spyder, or Jupyter Notebooks"
Added a link to a GitHub release of the notebook to make it easier for downloading everything for those who prefer more structure and language about installing libraries
In the section where we discuss prior knowledge needed for this tutorial, we have added a pointer to an additional tutorial that explicitly covers variable naming & best practices. "or see this very brief.."
Linked to PH lesson on normalizing and tokenizing text in Python “for more on tokenizing text with Python….”
Added explanation of the re_punc and sub functions to the Jupyter Notebook
Added additional language about workers “Increasing this parameter means…”
Added Analysis heading above “exploratory queries”
Added “the code below will return…” in the description of the most similar function
The CSV should have been populated even with only one model. We've updated the code to help prevent this kind of issue.
Added “(in Spanish)” to introduction of Spanish words
Fixed typos

Please let me know if you need anything else in terms of notes!

yann-ryan commented 10 months ago

@blaak-18 Thank you so much for this, this is plenty of detail. I'll now do one final review, and then this is ready to move to the next stage!

yann-ryan commented 10 months ago

Hi @blaak-18, @saraheconnell, and @quinnanya,

Thanks again for this updated version. I've made a few very small edits but think we can begin to progress to the next step and recommend that this be copy-edited before publication.

@rubenros1795 and @anneheyer, would you mind having a look over everything and letting us know here if you have any further comments or suggestions based on this new revised version?

I've sent an email with some further small tasks so we can complete the lesson metadata. Also, can I ask you to create a new version of the notebook according to the guidelines posted by @anisa-hawes above? You can post a link to it here, or email it to me directly if you prefer.

The guideslines are:

Notebooks consist of the code + line comments only
Headings and subheadings mirror those of the lesson to support readers' navigation
Notebooks do not extend or replicate commentary from the lesson

Thanks!

Yann.

rubenros1795 commented 10 months ago

Dear all,

Thank you for these revisions. I have little to add, because all my main points are addressed. I think speaking of "closeness" is a perfect way to avoid theoretical complexities, but also an intuitive concept for people to understand what word2vec measures.

Ruben

anneheyer commented 10 months ago

Dear all, thanks for this very thorough revision process - really impressive work. For me everything looks great. Just one question: when will this be posted for the general public? I might direct some students to this in the upcoming block.
Anne

yann-ryan commented 10 months ago

Thanks @anneheyer and @rubenros1795!

@anisa-hawes and @hawc2, this lesson is more or less ready to be passed to you, I think, besides a few last pieces of metadata.

hawc2 commented 10 months ago

Congrats everyone on getting this lesson revised! Thanks to our reviewers! I'm looking forward to reviewing the lesson as well, and I'm very excited for Programming Historian to publish an introductory lesson on this important subject.

Would be possible to make a more direct link between this lesson and the Clustering and Visualizing Documents using Word Embeddings lesson that we recently published? I see it cited in the Next Steps section, but it might be nice for the link between the two lessons to stand out more clearly, and maybe say a little more about how they are connected?

I don't see References yet, we're you planning to include a Reference list?

One other minor thought - especially since this is an introductory lesson, could we include more links for key terms and references to sources like Wikipedia? For example, when you reference IDEs or Gensim... More links to relevant info the better in my opinion.

saraheconnell commented 10 months ago

Thanks! We can work on those edits—are there particular connections between the other lesson that we should be trying to draw out, apart from the fact that it's something that people might be interested in after the intro? (And, would corresponding edits need to be made to that lesson to point people toward the more introductory one?) Also, unless I've missed something, I don't think have write access. Is there a preferred way that we should be making revisions, now that others are more actively working on the draft?

anisa-hawes commented 10 months ago

Hello @saraheconnell,

I've sent you an invitation to join us an Outside Collaborator. This means that you can make direct edits to your Markdown file /en/drafts/originals/understanding-creating-word-embeddings.md.

Let me know if you need any advice, or if you'd prefer to email Yann or I your collective edits. Thank you, Anisa

saraheconnell commented 10 months ago

Hi @anisa-hawes, many thanks! I'll get started on those links and such as soon as I can and will let you know if any questions arise. Cheers!

anisa-hawes commented 10 months ago

Hello @yann-ryan,

Thank you for sharing the revised .ipynb with me.

I've uploaded this to our PH Colab space and linked it to our repo (https://github.com/programminghistorian/ph-submissions/commit/b4c6dadb1968ca8cf917c177a704123e27edc224). It is now available to review: /assets/understanding-creating-word-embeddings/understanding-creating-word-embeddings.ipynb. If any further amendments are required, please send me an updated version of the file – I'll need to replace our copy in Colab and relink it to our repo.
I've updated the links to this notebook asset at lines 42 and 135 of the lesson: https://github.com/programminghistorian/ph-submissions/commit/66ec2c5e74a2476e023925496399b9796784ae1d

--

While doing this I noticed:

[ ] Line 135 refers to a sample corpus which readers can use to train a model on their own computer. I'd like to suggest that we also consider hosting this corpus on our GitHub repository, or on Zenodo (if the zipped files are too large for GitHub) to support the sustainability of this lesson. What do you think, Avery @blaak-18?
[x] The following sentence of line 135 reads: The notebook contains the same code discussed below, but with some additional comments on the code itself, and with fewer large-scale contextualizations. Now that the notebook has been revised, I just wanted to check with @saraheconnell whether this sentence needs revision or removal? We'll go through the whole lesson carefully during copyedit, but I just wanted to raise this query now as I note it.

Thank you. Anisa

hawc2 commented 10 months ago

@saraheconnell I think it could just be a sentence or two, maybe near the end of the lesson, saying something like: "Now that you've learned how to build and analyze word embeddings, you can see the Clustering and Visualizing Documents with Word Embeddings lesson to learn more about what advanced methods of analysis are possible." I'd mostly just like to make sure that the connection between these lessons on the same topic and both published by ProgHist is more clearly delineated, in distinction from the other next steps and related resources

After this lesson is published, we can reach out to the author of the Clustering lesson and ask them to add something to their lesson which links it more directly to your lesson for prerequisite learning.

@anisa-hawes let's plan to do this over the coming months - I also wonder if there's any more formal way ProgHist can have a little featured text at the bottom of the lesson that make it clearer what other lessons are directly connected. In this case, I don't think we can call these two lessons a 'series,' but they are very relevant to each other.

saraheconnell commented 10 months ago

Thanks, all! I just committed a few changes to add more informational links, a short references section, and a more direct reference to the other tutorial (making that more visibly the first "next step"). I also deleted that outdated reference to what can be found in the notebook. Let me know if I need to adjust any of this!

anisa-hawes commented 10 months ago

Thank you for your further edits @saraheconnell.

Would you like to re-read this, @hawc2? (You can review Sarah's adjustments in rich-diff here https://github.com/programminghistorian/ph-submissions/commit/1b03a4471b2e5f3752f7b99c3d164f121ddc4f5d) If you're happy for this lesson is ready to move onwards into Phase 6, let me know – we can plan time to start copyediting next week.

Anisa

hawc2 commented 10 months ago

@saraheconnell thanks for all your edits thus far. I've done a complete read through and made a series of line edits to clarify certain sections and standardize some styling. That includes a slight reorg of some of the Introduction section. Before we send this on to copy-edits, I have a set of final, minor revisions I was hoping you could make to the lesson.

[ ] The section entitled “Word2Vec” is a little confusing. Some of the concepts explained here are better explained elsewhere. For example, when you write: “Word2vec collects a sample of contexts around each word throughout the corpus, along with contexts that don’t actually exist in the corpus:” - it makes it sound like this is how Word2Vec functions, rather than that you are choosing to give it this secondary corpus of surreal sentences.
[ ] You also bring up the concept of a ‘neural network’ here but don’t really explain it. Is it a necessary concept to introduce here? If so, say more. What kind of neural network is word2vec?
[ ] One other thought is it may be worth saying a little more about the history of Word2Vec. You say it’s the first word embedding algorithm, but perhaps you could add a sentence or two about how Google created it and then released it as open-source?
[ ] Line 117: "the more two vectors are going in the same direction for the same distance, the nearer they are." I found this a little confusing - “in the same direction for the same distance” - Can you find a way to clarify this a bit?
[ ] There are two sections where you talk about the sample data, maybe it's worth linking your data set the first time you mention it in the Introduction? Just so readers find it right away. On this same topic, the sections, Corpus specs and Accessing the code and data kind of repeat themselves and could be combined and consolidated.
[ ] On line 220, you say “the string.punctuation pre-initialized string” - is there a typo here? Seems like “pre-initialized string” could be deleted.
[ ] One general formatting change I’d suggest is, where possible, try to reduce the commented code. Some comments run a long ways and could be shorter, to declutter the code chunks themselves, and reduce redundancy. One good example is the section now titled “Code for cleaning the corpus”. Your final big chunk of code in Validation also seems like it could be broken up into chunks with a little commentary in between.
[ ] Another example is the code chunk to train and save the model. You could get rid of the commented “# train the model” and just say in the sentence before the code, “The following line of code is where you train the Word2Vec model.” Then provide the line of code, then another paragraph commentary where you say, “You can save your model by running ‘model.save("word2vec.model")’”.
[ ] At the end of the section on Parameters, you have a paragraph that begins “Because word2vec samples skipgrams, you won’t end up with the same result every time.” This paragraph was a little confusing to me, reading back over the lesson I got a little lost on whether it is using the CBOW or skipgram method, and at this particular moment, I felt I needed a refresher on skipgrams to make this paragraph easy to understand in context. Would this point still apply if I set the CBOW method? Maybe this paragraph should be part of the description of Skipgrams a little higher up in the section.
[ ] I found this sentence little hard to read, could you rephrase? -"This style of evaluation can help you identify which of the models is performing best by identifying which models understand word similarities the way you would expect the"
[ ] I found this sentence hard to read, could you make it simpler to read?: "The point of evaluating a set of models using this method, is that by choosing word pairings which should have very high cosine similarities (i.e. words that are closely related) or even pairings that should have very low cosine similarity, then you can get a sense of how well your model is performing."
[ ] In the section - “Building a corpus for your own research” - are the bullet points necessary? Each point is a long paragraph, it might be easier to read without bullet points.

That’s it! I hope these are all quick, minor edits for you. I found this lesson incredibly informative, it’s such a great description of word embeddings and how to use them. I’m excited for it to be published, and to use it for research and teaching!

Once you make these last edits, the lesson can move on to copyedits. We'll aim to publish before end of the year.

hawc2 commented 10 months ago

One other thing to note, early on you talk a bit about how the reader can use the Jupyter Notebook. It is worth adding that ProgHist is making it available as a Colab notebook. I know that is a recent addition we made, but we should make sure it works correctly, and that its availability is mentioned in the lesson when you bring up how the reader can access the code.

quinnanya commented 10 months ago

I think the list of comments in the post above were edits I was supposed to take care of at an earlier point but got lost in the inbox. Let me see what I can do to edit it next week?

quinnanya commented 9 months ago

Thanks for these notes, @hawc2! I think I've addressed all of them, clarifying the text in several places, streamlining the code, and consolidating a few things.

The architecture of Word2Vec is a 2-layer neural net (see here for an explanation), but I don't know if even saying "neural net" is sort of opening a can of worms we then have to deal with, so I deleted it everywhere instead. Happy to take the other tack if you'd prefer, though!

hawc2 commented 9 months ago

Thanks @quinnanya for addressing my comments and making these revisions.

As for the '2-layer neural net' explainer, it's up to you if you want to explain it or not. I agree it's not really necessary to understanding how to interpret word embeddings themselves, and it does open a can of worms. It's also fine to briefly state the fact that it's a two-layer network and link to the explainer you provided, although it's possible that blog you linked won't always be available. There will still be some opportunities for minor tweaks during copy editing as well, but if you plan to add this link let us know now.

@anisa-hawes this lesson is ready for Phase 6!

quinnanya commented 9 months ago

Let's just leave the neural net piece out of it, it feels like it's likely to cause more problems than the value it adds

anisa-hawes commented 9 months ago

Super! Thanks to all.

--

Hello @blaak-18, @quinnanya, and @saraheconnell,

Your lesson will now be copyedited by our Publishing Assistant, Charlotte (@charlottejmc). We aim to complete the work by ~next Friday 8th December.

Please note that you won't have direct access to make further edits to your files during this Phase.

Any further revisions can be discussed with your editor @yann-ryan after copyedits are complete. Thank you for your understanding.

Anisa.

charlottejmc commented 9 months ago

Hello @blaak-18, @quinnanya, @saraheconnell and @yann-ryan, I've prepared a PR with the copyedits for your review.

There, you'll be able to review the 'rich-diff' to see my edits in detail. You'll also find brief instructions for how to reply to any questions or comments which came up during the copyedit.

When you're both happy, we can merge in the PR.

blaak-18 commented 9 months ago

@quinnanya @yann-ryan @charlottejmc @saraheconnell

Thank you so much for these edits! Sarah and I have reviewed the copyedits and everything looks great to us. We're happy for this to go ahead and be merged!

charlottejmc commented 9 months ago

@blaak-18, thank you for the confirmation!

I have now merged the copyedit branch and will move on to the next phase, which is the Typesetting. ✨

charlottejmc commented 8 months ago

Hello @blaak-18, @quinnanya, and @saraheconnell,

Thank you again for reviewing my copyedits, which you saw in the pull request I made previously (now closed). There are, however, a few outstanding points from the comments I made inline, which I thought might be easier to reiterate here:

[ ] In this comment, I asked about the paragraph which I've isolated in an 'info' box at line 167. I think the way the sentences follow each other here made it difficult for me to understand, as a non-expert, whether the advice "keep in mind that these files should always contain some form of plain text. For example, .doc or pdf won't work with this code." refers back to the given code (endswith(.txt)), or follows on from changing the endswith() call.
[ ] In this comment, I was hoping you might clarify the idea of 'embracing textual noise' and whether it is anti or pro-regularization. I have moved the reference to footnote [^3] which you can find at line 497.
[ ] The three endnotes refer to research mentioned in the text. However, I lacked the information to flesh these out and cite the works correctly. Footnote [^3] especially only gives author names and dates, and I could not find the full citation in the original text. I wonder if you might be able to provide more information about these, including perhaps a small discussion.

I apologise if these questions were not immediately visible to you from my comment above.

Thank you for your help! Charlotte

charlottejmc commented 8 months ago

Hello @hawc2,

This lesson's sustainability + accessibility checks are in progress.

Preview: http://programminghistorian.github.io/ph-submissions/en/drafts/originals/understanding-creating-word-embeddings

Publisher's sustainability + accessibility actions:

[x] Copyediting
[x] Typesetting
[x] Addition of Perma.cc links
~[ ] Check/resize images~
~[ ] Check/adjust image filenames~
[x] Receipt of author(s) copyright agreement:
- Hello @blaak-18, @quinnanya, and @saraheconnell, our authorial copyright declaration form is an opportunity to acknowledge copyright and grant us permission to publish the lesson. For lessons that are co-authored, we only require one lead author to complete the form. Could one of you download this, complete the details, and email it to me (publishing.assistant [@] programminghistorian.org)? Many thanks.
[x] Request doi

Authorial / editorial input to YAML:

[x] Define difficulty:, based on the criteria set out here
[x] Define the research activity: this lesson supports (acquiring, transforming, analysing, presenting, or sustaining) Choose one
[x] Define the lesson's topics: (api, python, data-management, data-manipulation, distant-reading, get-ready, linked-open-data, mapping, network-analysis, web-scraping, digital-publishing, r, machine-learning, creative-coding, or data-visualization) Choose one or more. Let us know if you'd like us to add a new topic
~[ ] Provide alt-text for all figures~
[x] Provide a short abstract: for the lesson
[x] Agree an avatar (thumbnail image) to accompany the lesson

The image must be:

copyright-free

non-offensive

an illustration (not a photograph)

at least 200 pixels width and height Image collections of the British Library, Internet Archive Book Images, Library of Congress Maps as well as their Photos/Prints/Drawings or the Virtual Manuscript Library of Switzerland are useful places to search

[x] Provide avatar_alt: (visual description of that thumbnail image)
[x] Provide author(s) bio for ph_authors.yml using this template:

- name: Forename Surname
  orcid: 0000-0000-0000-0000
  team: false
  bio:
    en: |
      Forename Surname is an Assistant Professor in the Department of Subject at the University of City.

- name: Forename Surname
  orcid: 0000-0000-0000-0000
  team: false
  bio:
    en: |
      Forename Surname is an Assistant Professor in the Department of Subject at the University of City.

- name: Quinn Dombrowski
  team: false
  orcid: 0000-0001-5802-6623
  bio:
    en: |
      Quinn Dombrowski is the Academic Technology Specialist for the Division of Literatures, Cultures, and Languages at Stanford University, and works on non-English digital humanities.
    fr: |
      Quinn Dombrowski est spécialiste des technologies appliquées à la recherche au sein de la Faculté de Littératures, cultures et langues à l'Université Stanford avec un intérêt particulier pour les humanités numériques non-anglophones.
    pt: |
      Quinn Dombrowski é técnica especialista na Divisão de Literaturas, Culturas e Línguas da Stanford University e trabalha em Humanidades Digitais não anglófonas.

Hi @quinnanya, I've pasted in the bio we have on file for you from previous lessons. Please let me know if you'd like any adjustments made!

Files to prepare for transfer to Jekyll:

.md file: /en/drafts/originals/understanding-creating-word-embeddings.md
images: [none]
assets: /assets/understanding-creating-word-embeddings
original avatar: /gallery/originals/understanding-creating-word-embeddings-original
gallery avatar: /gallery/understanding-creating-word-embeddings

Promotion:

[x] Prepare announcement post (using template)
[ ] Prepare x2 posts for future promotion via our social media channels

saraheconnell commented 8 months ago

Hi Charlotte, apologies for missing these! Avery and I should be able to tackle them next week—hopefully by end of day Wednesday.

Cheers,

Sarah

quinnanya commented 8 months ago

Hi Sarah,

Do you want me to try to do a first pass on this stuff tonight? I’ve got a bit of time.

~Quinn

charlottejmc commented 8 months ago

Hello @blaak-18, @quinnanya, and @saraheconnell,

I have had a first look around for a lesson avatar, and suggest to you this drawing from the British Library.

11129133475_5a895af4c0_q

What do you think? I like the feeling of 'embeddedness' it creates, but we can also look for images that suggest networks, vectors, milk, cookbooks, etc.!

saraheconnell commented 8 months ago

Hi @charlottejmc,

Thanks for checking in! We are very charmed by that image—another possibility is this one (https://www.flickr.com/photos/britishlibrary/11139139683/in/album-72157638850077096/), which I’d sent to Yann earlier. We are happy with whichever you think is the best fit; the star is perhaps more evocative of vectors, but the goblin is more memorable.

We also just got off a call together and believe we have resolved the remaining comments—please let us know if we missed anything! (The commit says ‘edits in progress’ but we didn’t have any further changes to make after we checked that in.)

One thing we wanted to flag is that the markdown in the info box at line 167 might need to be reviewed by the typesetter.

Here are the other remaining details: Difficulty: Intermediate

Research activity: Analyzing

Topics: python, distant-reading, machine-learning

Abstract: Word embeddings are a text analysis method that allows you to analyze the usage of different terms in a corpus, by capturing information about their contextual usage. This lesson covers how to create word embeddings and use them to answer humanities research questions.

The topics and abstract are new, but we’d sent the research activity and difficulty to Yann last month—just wanted to flag that in case sending it again would cause any confusion! For the bios, I sent mine in to Yann earlier, and Quinn already has one on file. Avery will get hers in soon.

Thanks so much,

Sarah

blaak-18 commented 8 months ago

And here is my bio!

Avery Blankenship is a PhD candidate in the department of English at Northeastern University. She is a member of the Viral Texts Project and her work is published in The Mark Twain Annual, _The Nathaniel Hawthorne Review__, and in the Viral Texts book project Going the Rounds: Virality in Nineteenth-Century American Newspapers. Her dissertation, Marginal Spaces explores the uptake, use, and transformation of nineteenth-century American cookbooks and recipes along the lines of race, gender, and class. Her research focuses on nineteenth-century domesticity, cookbook and recipe circulation, and power dynamics in the nineteenth-century home.

charlottejmc commented 8 months ago

Hello all,

Thank you very much for your response! I've now added all the information into the markdown file, and only made a few minor touch-ups to your additions, which were great.

I decided to go with the 'goblin' avatar in the end, as I agree with you it's more memorable!

Thank you @blaak-18 for your bio – we use a specific format which looks like this:

- name: Forename Surname
  orcid: 0000-0000-0000-0000
  team: false
  bio:
    en: |
      Forename Surname is an Assistant Professor in the Department of Subject at the University of City.

So, may I suggest:

- name: Avery Blankenship
  orcid: 0000-0000-0000-0000
  team: false
  bio:
    en: |
      Avery Blankenship is a PhD candidate in the department of English at Northeastern University.

Unfortunately this means we can only use the first sentence of the bio you wrote. Apologies! I didn't find an ORCID number for you, but do let me know if I'm wrong.

The lesson is getting really close to publication now. Exciting! We are grateful for your patience and your work so far.

anisa-hawes commented 8 months ago

Thanks @saraheconnell, @blaak-18, and @quinnanya for the energy you have given to these revisions, clarifications and adjustments.

We noticed that the corpus central to this lesson is hosted by the Viral Texts Project on your GitHub repository Nineteenth-Century American Recipes.

We'd like to host a copy of this dataset ourselves so that we can ensure the lesson's sustainability.

[x] I've downloaded the corpus and created a .zip file of all the recipes so that readers will be able to download them easily
[x] I've uploaded this .zip to: /assets/understanding-creating-word-embeddings
[x] I've updated the links embedded within the following lines of the lesson: This lesson uses as its case study a relatively small corpus of nineteenth-century recipes (line 33), The corpus we are using in this lesson is built from nineteenth-century American recipes. (line 137), You can download this lesson’s Jupyter notebook and the corpus to train a model on your own computer (line 145), You will provide the file path for the sample corpus which you should have downloaded from Github (line 164) so that we are directing to the .zip file hosted here on PH.

blaak-18 commented 8 months ago

@anisa-hawes I was the one who collected the data for that set so I'm totally fine with the set being hosted through PH to make things easier

anisa-hawes commented 8 months ago

Thank you, @blaak-18. I noticed that you don't explicitly acknowledge this in the lesson, so let's also add an endnote with a clear citation for the dataset:

Blankenship, Avery. “A Dataset of Nineteenth-Century American Recipes,” Viral Texts: Mapping Networks of Reprinting in 19th-Century Newspapers and Magazines. 2021. https://github.com/ViralTexts/nineteenth-century-recipes/.

blaak-18 commented 8 months ago

@anisa-hawes That works for me! I appreciate your work on this!

charlottejmc commented 8 months ago

Hello @hawc2,

This lesson's sustainability + accessibility checks are now complete.

author bios for ph_authors.yml:

Avery Blankenship:

- name: Avery Blankenship
  team: false
  bio:
    en: |
      Avery Blankenship is a PhD candidate in the department of English at Northeastern University.

Quinn Dombrowski:

[Already in ph_authors.yml]

Sarah Connell:

[Sent to @yann-ryan]

.md file: /en/drafts/originals/understanding-creating-word-embeddings.md
images: [none]
assets: /assets/understanding-creating-word-embeddings
original avatar: /gallery/originals/understanding-creating-word-embeddings-original
gallery avatar: /gallery/understanding-creating-word-embeddings

Promotion:

[x] Prepare announcement post (using template)
[ ] Prepare x2 posts for future promotion via our social media channels

programminghistorian / ph-submissions

Understanding and Creating Word Embeddings #555

Anti-Harassment Policy