Review Ticket: Understanding and Using Common Similarity Measures

walshbr commented 4 years ago

The Programming Historian has received the following tutorial on 'Understanding and Using Common Similarity Measures' by @jrladd. This lesson is now under review and can be read at:

http://programminghistorian.github.io/ph-submissions/lessons/common-similarity-measures

Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.

I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. I have already read through the lesson and provided feedback, to which the author has responded.

Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.

I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me. You can always turn to @amandavisconti if you feel there's a need for an ombudsperson to step in.

Anti-Harassment Policy

This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. If anyone witnesses or feels they have been the victim of the above described activity, please contact our Ombudsperson (@amandavisconti). Thank you for helping us to create a safe space.

[Permission to Publish]

The editor must also ensure that the author or translator post the following statement to the Submission ticket.

I the author|translator hereby grant a non-exclusive license to ProgHist Ltd to allow The Programming Historian English|en français|en español to publish the tutorial in this ticket (including abstract, tables, figures, data, and supplemental material) under a CC-BY license.

walshbr commented 4 years ago

@jrladd - I'll get to work on this in the next day or two. In the meantime, if you wouldn't mind re-posting the "Permission to Publish" bit above I'll get to it. I'll make any basic syntax corrections and things necessary and ping as things come up. Will probably have suggestions for revisions before sending it out for review.

walshbr commented 4 years ago

@jrladd I think I fixed all the image paths and LaTex syntax. The syntax for mathjax appeared to be double dollar signs rather than singles. I'll do a sweep today for general comments, but it might be worth double checking the equations to make sure I didn't muck anything up with the find and replace I did.

jrladd commented 4 years ago

@walshbr Thanks! The equations all look right to me. One of the images (for city block distance) still isn't showing up, though.

I want to make some small changes to the document structure: adding an additional top-level heading and moving some headings up a level. That will make the table of contents feel less lopsided and help the reader navigate. But I'll wait to do that until I get your other comments.

Here's my permission to publish:

I the author hereby grant a non-exclusive license to ProgHist Ltd to allow The Programming Historian English|en français|en español to publish the tutorial in this ticket (including abstract, tables, figures, data, and supplemental material) under a CC-BY license.

walshbr commented 4 years ago

@jrladd Whoops I hadn't pushed. That figure should be working now. You can go ahead and push those heading changes if you want - i . don't imagine they will interfere with what i do!

walshbr commented 4 years ago

Common Similarity Measures Editing Notes

@jrladd - just finished a preliminary read through. This is great! I knew a little about the topic and I learned a lot. I think this is exactly the sort of thing I’d like to see us do more of. My main comments below pertain to bringing out the case studies / examples you’re using a bit more. I think for Austen/Wharton and the EarlyPrint dataset both, spending more time on the kinds of questions that method can help you to ask will help people understand why they’re doing this in the first place. And it will help encourage people to see the results as more than just numbers but also opportunities for interesting research questions. If you have any examples of people using this in literary or historical research ready to go those might be useful to include as well. But in general I think I would comment more on what the methods are showing you, what questions they provoke, how they relate to the datasets and authors you’re working with, etc.

Point by point things follow. Let me know if I can help with any of them! You should be able to push directly to the repo here.

P1 - Why might people care about whether something is similar? An example from the list with more depth (even just a sentence or two) might help. That might help ground this discussion for a humanist and make the case for why something like this is useful before launching into the stats, which might be a scary lift for humanities folks to dig into without some sense of why it’s useful up front. You could even make the point that something like similar or different might seem abstract, but this is a way of mathematically quantifying those distinctions.

P2 - under the preparation section I would stamp version numbers on all the required libraries.

P3 - for Anaconda installations I would probably actually point the reader to the Anaconda documentation. You might want to get the instructions for installing in a few more steps. Something like…

Install anaconda through X
Install SciPy through Y
Install Pandas through Z

That is, unless Anaconda ships with SciPy and Pandas already installed (I’m less familiar with it). If that’s the case I wouldn’t worry as much about that.

P4-5 - I think a couple more sentences describing TF-IDF would be useful here. It’s fine to point people elsewhere for more info, but I think a sentence summary would be helpful otherwise. Particularly since the TF-IDF dataset is what they’ll use for the rest of the lesson, I think it is worth explaining more. And then maybe offering a sentence about how this sort of dataset connects to the small Wharton and Austen datasets you’re using. Might even be worth screenshotting the first few rows/columns so people know what they’re looking at.

P7 - Might be worth just a note to the effect of how the features that interest you are related to your research question.

P11 - Is it worth trying to make these graphs using Python? Just thinking that if you ever need to change or update them it might be easier to do so if they were not screenshots / things you’d have to draw from scratch each time.

P12 - Might be good to give another “So what?” answer here as to why someone would be interested in this?

P13 - This is the point where people usually lose me when I explain this topic to them :). So I might add a couple more sentences here to explain a bit more. It might help to note that, in this case, if we’re using words, you could has as many dimensions as you have words in each text. I think the phrase “the math is the same” might confuse people because you haven’t actually gotten to the distance part yet, which I think is the math component that you’re talking about.

P33 - why brackets around cosine similarity here? Also the distinction between orientation and distance probably merits another sentence.

P43 - I think this section might be less confusing by tying it back to the texts. That is to say, beyond saying mathematically if they are similar or not, say that as a humanist might take it. I.e. - “According to the measures we’ve seen so far, these texts are pretty similar to one another.”

P56 - Re: the large asset file - a policy exists now! If you want to upload your dataset to Zenodo, that will create an archived version of the site with a DOI that we can link for the purposes of the lesson. Thanks for hanging with this! Lmk if you have questions.

P61 - I would offer some of the output and then offer preliminary thoughts on it. You start to do that with the discussion of zeroes, but visuals might help. It might be good to have a description of the underlying data and research questions in the EarlyPrint dataset in order for the results to be more meaningful to the reader.

P64 - Again a little bit more information about Cavendish could help ground the reader in what they’re looking at.

I might add in a piece that graphs the distance/similarities in Python as a way of bringing it back to the visualizations that you started with.

jrladd commented 4 years ago

Thank you so much, @walshbr! This is really generous and helpful feedback. I can certainly bring out more about the case studies and what kinds of questions humanists might want to answer with these approaches. I'll get started on this and have a new draft for you soon!

walshbr commented 4 years ago

Sounds good and no problem! Great to work with you again. Do you want to set a tentative date for revisions handed over with the understanding that it might change? Something that works for your schedule?

jrladd commented 4 years ago

Sure thing! My tentative plan is to have the new draft by Friday (work on this dovetails well with other stuff I'm doing this week), and then we'll be ready to go for further reviews after break? I'll let you know if anything changes, but it should be doable.

jrladd commented 4 years ago

@walshbr I just pushed a bunch of edits based on your suggestions. I also uploaded the CSV to Zenodo, and it's here. (Fair warning: that link kept freezing up my browser.)

I added a screenshot of TF-IDF data, and I think that helps to give a better sense of what the reader will be working with. I opted not to change the other images, if that's okay. I actually think it'll be easier to make changes by hand than to figure out how to replicate all the dotted lines and stuff. But if you feel strongly, I can change them.

The only other thing was a visualization for the end of the lesson. I'm not sure how I'd create a graph from the results that would easily refer back to the earlier viz, just because we're now working in so many dimensions. I could certainly do some kind of dimension reduction, but that would add a layer of complexity we probably don't want to explain in this lesson. I can continue to think about it though, and maybe reviewers will have suggestions.

Thanks again for your help! If this looks okay to you, I think it's ready to go out.

walshbr commented 4 years ago

Thanks @jrladd! Taking a look now and tomorrow if I can't finish today.

Yowza that Zenodo situation is complicated…because they generate a preview of the item for the record it basically killed my browser as it was trying to open it. I managed to grab the actual download link by viewing source though - https://zenodo.org/record/3572854/files/1666_tfidf.csv. Going directly to that link should actually download the thing without having to go to the slow page. I'll add it in.

walshbr commented 4 years ago

Actually - @jrladd could you try re-uploading the csv but as a zip file and handing me the new DOI? That might keep the thing from breaking in the preview. And just to standardize things, @mdlincoln suggested naming the zip file with the same slug as used for the .md file and images folder, etc

mdlincoln commented 4 years ago

@jrladd congrats on breaking zenodo with your data https://github.com/zenodo/zenodo/issues/79

walshbr commented 4 years ago

Thanks @jrladd. I read through this all again - I think it’s good to go for sending out. I also think it’s fine to leave the drawings / screenshots in as you say.

re: closing visualization - I think you’re right that it might add on too much to add in something like that. I think it’s fine to leave it the way it is for now! If the reviewers have thoughts they can share them, but it was just a suggestion.

If we wind up using a zip file on the Zenodo link it might make sense to add in a sentence telling people to unzip the thing. Beyond that tweak (and updating the Zenodo link to that zipped record) I think it’s good to go. I’ll start working on reviewers.

jrladd commented 4 years ago

Appreciate the help, @walshbr @mdlincoln. Who says humanities data can't cause as many problems as any other kind of data?!

The file compressed is down to just three and a half MB, so I thought it might be easier to skip Zenodo? I've attached it here. If you'd still prefer Zenodo, I'm happy to upload it.

common-similarity-measures.zip

I think this will be fine for sending to reviewers, and I'm curious to see how it works out with the full dataset. But I'm starting to think that it's probably not worth it to have people new to these methods working with such a large dataset. The original idea was to have data identical to what I produced here, but I don't want that goal to get in the way of a good tutorial.

I could easily reconfigure this to work on TF-IDF results for just the top 100 words in the same corpus. That might change the results slightly (so I'd need to update those examples too), but the general takeaway should stay the same. If you and the reviewers agree this is a good idea, I'll be happy to change it over.

mdlincoln commented 4 years ago

FWIW @jrladd I typed out a whole comment flagging the fact that the table has 170k columns and how the very long load times could make the lesson difficult for users without shiny new processors, and that you might want to consider reducing to only the most informative words :) But then I deleted it since I didn't want to overly weigh how this lesson gets edited.

@walshbr should make the final call of course, but I might suggest leaving it as is for the time being and letting the reviewers have a go at it, while keeping this potential data reduction in mind as a possible revision, to be put into the hopper with the many others that reviewers will come up with?

jrladd commented 4 years ago

@mdlincoln I certainly don’t mind if that info is part of the reviewers’ assessment! :) And this all sounds good to me.

I still like the idea of an unabridged “real world” dataset in principle, but that should be balanced against making the lesson as usable and helpful as it can be.

walshbr commented 4 years ago

Yeah I think let's leave it the way it is for now - I'll update the lesson to link to the zip file (and maybe add a phrase indicating that people will need to unzip it). That's a good idea though! I'll mention to the reviewers that they should think about this question. At the very least, it might be worth adding a couple sentences on limitations with processing speed for opening and working with medium to large datasets, but you can think about that for the post-review phase. More soon!

walshbr commented 4 years ago

@saraheconnell and @statsmaths have agreed to serve as reviewers, and they'll aim to get them back around the end of the month (with some flexibility given that we're coming off a long holiday). Thanks, team! Let me know if you have any questions - otherwise I'll check back in later in the month.

statsmaths commented 4 years ago

Hi @jrladd. I had a jump on reading this before the New Year and just finished a second pass through the lesson. I really like to idea of have an article that describes different ways to do bag-of-words based distances between texts. I have seen many DHers doing text analysis that do not realise the assumptions behind the specific way they are choosing to do document similarity (or even when they do, not having a solid basis for making that choice).

From my perspective, overall the article is clear and well-written. There are a few spots where I would recommend making minor changes to tighten certain technical claims, but these are quite minor. My bigger recommendation, however, is to suggest some editing/restructuring of the material being covered. The title and abstract seem to suggest that you are looking to offer a general introduction to similar metrics. However, this article really only covers one "type" of similarity score. It also does not give any introduction to similarity scores as a general class. Namely, you are introducing scores based on a distances (L1 and L2 are just distances measured along a different paths; cosine similarity is equivalent to the L2 distance metric if you scale the documents in order to account for their size). There are many other similarity scores that are quite different, such as those based on weighted edit distances (used in genomics), supervised trees (which deform the input vector space based on a classification task), UMAP/tSNE and friends (that try to make the distribution of similarity scores more uniform), and the use of transfer learning through reusable embeddings. There are even more specific models and approaches for handling the specific stylistic and topical challenges of document similarity within the domain of NLP.

My suggestion would be to reframe the article to make it specifically about similarity scores for describing word usage across a corpus of texts. Most of what you've written is already organised this way, and would require a lot less work than the alternative (making the text match the title and abstract). Here are my thoughts/recommendations for how to tighten the article to make it very clearly about word-based textual similarity, and hopefully make it a bit more clear how your article can be useful for practitioners within DH:

As already said, adjust the title and abstract to make it focused specifically on textual distances.
I would completely remove the L1 (Manhattan distance) metric for the article. I don't know of anyone who uses this is text analysis, nor why someone would prefer it to an L2 distance. The only rational currently given is that it "is useful because it employs simple arithmetic" (P25), but I am not sure that this is a good argument. Nobody is suggesting that these metrics should be done by hand. It is not as thought there is a conceptional simplicity to the L1 distance metric. The complication of L2 distances is only mathematical. L2 distance, in fact, is what almost everyone means when they colloquially talk about distance. I think adding this third case needlessly complicates the actually important distinction between cosine similarity and Euclidean distances.
There is an important technical point that I think is missing here between the theory and the application. All of our examples at the top of the page use term-frequencies, but the application uses tf-idf. The distinction is quite important because a lot of your descriptions about how L2 distances favour long texts is not precisely correct. I mean, L2 does to some degree, but it also favours texts that use rare words because they have a higher IDF weight. To solve this, see the next comment.
I find it quite unsatisfyingly that this tutorial just takes a pre-made TF-IDF matrix from another lesson. It think it's great to build off of another lesson and to use the same dataset and I don't think you need to fully repeat what another fantastic lesson has done. However, to me it would be preferable to have your code also start with the raw texts and create the TF-IDF matrix directly (its really only a few lines of code). This has a number of advantages: (i) easier for people to repeat with their own data, (ii) you can include the code that spits out the nearby texts within the lesson, allowing people to choose other base texts and play around with the method, (iii) you could also form the raw TF matrix and use that in your examples in place of or in addition to the TF-IDF matrix to address point 3.
It would be nice to find a better example (I don't mean a different corpus, just a better starting text) that really highlights the difference between cosine similarity and L2 distances. I think this will be much easier if you start with a TF matrix and drop the L1 metric (which really isn't useful in this domain as far as I know). Perhaps start with a very long text and show that L2 just returns other long texts, whereas cosine similarity returns more sensible results?

Sorry, I know that that's a lot and I probably started to ramble. To summarise the summary: Focus on text distances, drop L1 (it is distracting because I don't think it is useful), and include the code that works from the raw data.

Please let me know if you have any questions at this point. I think @walshbr will want you to wait for @saraheconnell before editing anything in the document, but I believe it's okay, but of course not required, to offer comments/questions here in the GH issues in the meantime.... I will do a more exact line-edit check after your next pass through the document.

walshbr commented 4 years ago

Thanks @statsmaths! Yep - it'd be good to wait for @saraheconnell before editing the document just so she doesn't have a moving document as she's working on it. But it's fine to chat here if you'd like to correspond about the review.

jrladd commented 4 years ago

Many thanks for this thoughtful response, @statsmaths! I will hold off on making any changes until the other review comes in, but your comments are very much in line with thoughts and concerns I had, especially regarding the sample data/code, whether city block is helpful or distracting, and the scope of the piece and its discussion of distance vs. the broader subject of similarity. It's worthwhile to be clearer about the small slice of similarity that I'm covering here, and maybe I could list off a few other similarity scores.

Your comments also got me thinking: I hadn't thought about using the terms L1 and L2, but it may be useful to briefly introduce them as a researcher is likely to encounter them in something like SKLearn.

saraheconnell commented 4 years ago

Apologies that this has taken me so long—I’ve been buried under a writing deadline and the usual start-of-term chaos. I should also mention that this is the first review I’ve written for PH so I hope that it’s helpful! I’ve divided this into substantive and language-focused suggestions, and also offered a few thoughts on things that I think are already working quite well.

Substantive suggestions:

The first review has already made some really useful points and I particularly agree that the lesson could be more effective if it included the code for learners to start with raw texts and generate the TF-IDF matrix themselves. I think it is important to give people direct access to texts & methods so that they can try different configurations and more easily apply what they are learning to their own data. This would also help address what to me seemed the most significant way that this already-excellent lesson could be improved: it covers the key concepts admirably, but it would be even stronger with additional entry-points for people to experiment with and apply those concepts. This could be addressed through some more signposting around potential modifications and applications, and also by making the "next steps" section more detailed and concrete.
The abstract could be more precise about what the "take home" from the lesson is. When I read through this the first time, I was expecting more specifics on configuring, applying, and modifying these different difference measurements, while the lesson is currently focused more on core concepts. The final version of the abstract will, of course, depend on how the contents of the lesson are revised, but I did want to flag that there’s a need for greater specificity on what the reader can expect to get from the lesson in any case.
Is it possible to provide a link to the metadata for the files in the 1666 dataset? (I did scan back up through the lesson to see if I could find one, so sorry if I missed this; if the link is elsewhere in the lesson, I think it should also be added again around P66.) Just having the ID numbers doesn’t give readers a chance to explore with other titles, or even look further down in the results. This would also help with the point above about providing more next steps and opportunities to engage actively with the lesson. When people are just presented with code to run without modification, I’ve found that they tend to gloss over things a bit and don’t really internalize what is going on; asking them to change even a few small things and see how their results differ can make a real difference. It would also be great to have a direct pointer to the texts themselves, or a footnote on how readers could most quickly get access to the full text of any files they are interested in, rather than just linking out to EarlyPrint in general.
I think the question of evaluating similarity in results needs to be addressed more directly. This is gestured to in P90, but I suspect that most readers would find themselves wondering how they would determine exactly how close the closest texts are, and what that would mean for their analyses. I know that this would vary quite a lot from one situation to another, but even a paragraph on general best practices and things to look out for would be useful.

Aspects I wanted to flag as already effective:

I really like that the opening situates the question of difference within a broader critical lens; I think it’s useful for people to consider how these technical approaches are on a critical continuum with the broad range of scholarship that relies on examinations of difference. Overall, I think that the structure of the piece works well for giving readers a solid grounding in these concepts—there is a good balance between examples and explanations of core concepts, and it is effectively paced. The lesson nicely anticipates where additional definitions or explanations are needed, without slowing down the overall flow.
I thought the work-through with Austen and Wharton was really effective—when discussing abstract concepts operating at a relatively large scale, it helps to start out with something concrete and manageable.
I was able to run though all the code provided without any difficulty (Anaconda, Mac).
The explanation on the significance of choosing between different distance measurements is so key, and I very much appreciate that this was treated so thoughtfully and thoroughly!

Language & readability-focused suggestions:

There’s some need for restructuring/additional clarity in P9 and 10. I think this section would be clearer if it first discussed the different kinds of things that could be features, including non-textual things, and then went on to explain that the features selected will depend on the research project and address the potential for selecting large numbers of features. I also think that it might be clearer to have “these textual examples” rather than “the text example,” since the section that proceeds is about all the different kinds of textual analyses that are possible, rather than the two examples from Wharton and Austen.
It’s a small thing, and perhaps not worth bothering about, but a screenshot from the TF-IDF dataset (P12) that included more common word forms (for example, full words, and without the specific notations marking gaps and the like) would make what is going on here easier for people unfamiliar with EEBO-TCP texts to understand. That said, it works very well that there’s a smaller and more concrete example followed by a larger and more complex one in this section.
Along the same lines, it might be worth redrawing the graph in P14 with a clearer “a” character on the y axis (especially since this is the first one).
It’s worth revisiting the bolding and italics in P15 and 16; I’m not sure I can see a clear pattern in why the key terms are bolded some times and not others. Using both bolding and italicization for emphasis in this section is also potentially confusing.
I think it would be worth doing a read through for the voice of the piece and the ways it addresses the reader; I’m struggling to articulate this, but there are times when the tone feels somewhat lecturing, or as if there’s a distance being established between the author and the reader. I’m thinking about phrases like: “I hope you will remember this caution” or “particularly for those of you who are working with data derived from text.” I don't mean the uses of the first and second person in general, but just particular phrasings in which the reader is being addressed in ways that seem to be putting up a barrier between the “I” and the “you,” especially in cases where “we” might make more sense.

I hope that helps and I’d be glad to answer questions about any of the above. This was a really well-structured and thoughtful lesson and I’m personally looking forward to being able to point people to it. Thanks!

walshbr commented 4 years ago

Thanks to you both @statsmaths and @saraheconnell! Just wanted to acknowledge that I see both reviews. I'll take a look at them both and offer a thought or two about working through them in the next couple days when back in the office.

jrladd commented 4 years ago

Thanks so much for these helpful comments, @saraheconnell! I can certainly offer more ways for the reader to adapt these methods to their own work, and be more clear about how to interpret results. I should definitely be linking directly to the metadata, so thank you for catching that!

@walshbr, I'll wait for your comments before getting started, but I can see some helpful overlaps between both reviews. I'm more convinced now of the need to trim the TF-IDF data (particularly if I'll be guiding readers through the data collection). Using the 100 most frequent words in the subcorpus, for instance, would make the whole thing more readable. And one more question: I think I went a little too far in trying to adhere to the author guideline about using the second person, and got away from the more important guideline of using an accessible tone. Using "we" is my instinct anyway: is it okay to use a few "we"s throughout to help with the tone? I can certainly fix the tonal issues without "we," but I thought I'd ask.

walshbr commented 4 years ago

Go for it @jrladd! I just wanted to double check that there wasn't anything in conflict between the two reviews. I'm happy to have you bounce ideas off me if you have them or participate in a conversation about the revisions, but both of these reviews seem really helpful and clear to me. Let me know if you run into questions. Thanks again @statsmaths and @saraheconnell!

Let me check with the editorial board about the question of second person. Are you up for roughly February 21st as a deadline for the next round of revisions @jrladd?

Speaking of abstracts, there will be one at the top of the page on the PH site. So when you settle on language that you like you might add that to the metadata at the top of the file. I can also come up with one, but it's usually going to be better if the author gives language they want.

walshbr commented 4 years ago

Update on the person question - all new lessons are going through copy editing. And those copy editors are likely going to try to revert any clear exceptions to be in keeping with those rules, which were developed with an international audience in mind. So if there are particular instances that feel especially off it might be worth rewriting the sentence to avoid the personhood entirely. I'm happy to help workshop possibilities for sentences that seem difficult.

jrladd commented 4 years ago

Thanks @walshbr! That definitely clarifies things. I'm certain I can fix the tone and still keep to the rules.

Feb. 21st will be fine as a deadline. Should I push changes directly to this repo?

I'll add a new abstract that better describes the lesson than the original, and I'll definitely let you know as questions/issues come up. Thanks again to everyone for the help and advice.

walshbr commented 4 years ago

Yep! Go ahead and push here @jrladd. We won't move to the production repo until things are pretty finalized.

jrladd commented 4 years ago

Thanks again, all, for your helpful comments! I've just pushed some revisions. Here's an incomplete list of what I've done:

changed the title and introduction to focus on distance for text analysis
removed the city block distance from the coding section (I still think it's a good way to introduce the concept of distance, so I've left it in the explanatory Austen/Wharton section. But I agree that it muddles things when we get to the coding, so I've left it out and simply mention that the results will be similar to Euclidean distance.)
added original XML files and section with code where I explain how to calculate TF-IDF from the raw files; this adds a fair bit of material l but I think it's worth it (@walshbr The zipped XML files just exceeds the recommended limit for Github, but it let me push it to assets anyway. It's still half the size that original CSV was!)
redid the screenshot of data with real words
changed my example to only use the top 1000 words in the corpus, which speeds up calculation and makes everything a bit more manageable
completely redid results section with better example using Boyle instead of Cavendish
restructured section explaining samples and features
went through with lots of little changes for tone and clarity

That's most of it! Unfortunately, as I was finishing all this, I started thinking more about @statsmaths advice about the difference between using raw term frequency and TF-IDF. Now that I see it all written out, I'm wondering if I should simply eliminate TF-IDF altogether. There's a lot of extra explanation (and some extra code) to get readers up to speed with it, and I don't think it adds much in terms of teaching people about distance.

Right now I've put in an explanation that you can do distance just over raw term frequency and then move on to TF-IDF. I could leave it as is, but here are two additional options:

I could add the term frequency calculations in addition to TF-IDF. This would lengthen the tutorial a bit, and we might risk getting caught up in the term frequency vs. TF-IDF distinction at the expense of the Euclidean/cosine distinction. Probably my least preferred option of the three, but I can imagine how it might work.
I could drop TF-IDF more or less entirely. I could have a mention toward the end that says "A good next step would be to try this with TF-IDF as your features" and point people to Matt Lavin's lesson. But we could let go of the other lengthier explanations. The downside of this is that the results (which I've so far glanced at) simply aren't as good or as striking with term frequencies alone. I'd have to rewrite the results section completely, and perhaps find a different example again. I don't mind the extra work, and as I was finishing the revisions this started to seem like the best option.

What do you think of these options? I haven't added the official abstract or the similarity.py file yet, because a big change like the one I've mentioned above would likely affect these. I also need to redraw the graph that @saraheconnell pointed out. Again, many thanks for the feedback. I think the lesson has improved a lot.

statsmaths commented 4 years ago

Thanks for taking the time to address all of the feedback @jrladd ! I think the way you modified the discussion of the Manhattan distance metric (keeping it in, but downplaying whether it should be used over Euclidian distance) is excellent. In regards to the TF vs TFIDF differences, I could also see it going any of the three ways that you suggested. The current approach works because it goes straight to the "best" (more or less) approach; using just TF would work too because it better connects to the introduction; doing both balances both benefits but at the risk of being more complicated.

The one difficulty I still see is that I still (if I was using this as a resource in the classroom) would really prefer to be able to directly work with the texts and metadata. The new version include the relatively complex XML parsing code (making it look less accessible), but still does not give access to all of the metadata or the code for others to work directly from their raw texts. One way to simplify both of these things would be to have the tutorial work from of a csv file that contains the texts and metadata. I put one together based on the texts you were using:

metadata_and_texts.csv

What do you think about reading this CSV file into Python (pd.read_csv) as a starting point? You could then use sklearn.feature_extraction.text.CountVectorizer to produce the term frequency matrix (it is actually much easier to do it this way, because reducing the number of works to the top 1000 is handled by sklearn), and carry on with the rest of the lesson as written. I see this having a number of benefits: (1) remove the need to introduce a lot of XML parsing code that is never really explained nor the focus of the lesson, (2) allows you to give the code for producing results of the nearest neighbours along with the metadata, (3) gives readers the code for working with raw text, which makes it much easier to apply to other collections, (4) the function CountVectorizer has a lot of features that can be useful to play around with, and (5) this data is about half the size of the XML files, and safely fits under GitHubs file size limits. Related to (4), if you do decide to convert the entire lesson to just TFs, you may find that the results on the raw frequencies are much better if you remove the most frequent terms (i.e., max_df=0.7 in CountVectorizer), at least for the cosine similarity.

Hopefully that's helpful. I think its almost there!

walshbr commented 4 years ago

Thanks for all your work on this! Let me know if I can do anything to help, but this all seems like sensible advice to me. Let me know if you have any questions for me @jrladd, and just ping me when you're all set and ready for me to take a look again.

jrladd commented 4 years ago

Thanks for this, @statsmaths! I see the point about having plaintext and metadata side-by-side. And it would be nice to use CountVectorizer (or even TfidfVectorizer, which combines CountVectorizer with TfidfTransformer).

I'm a bit torn about presenting the texts in the way you demonstrate. The added complexities of wrangling the early modern texts into the right form for sklearn was the reason my first instinct was to start with pre-processed TF-IDF data. Your CSV would be really convenient, but a little far from what the real-world process would be for someone trying to replicate the process with different texts. As you point out though, most readers are going to be working with plaintext of some sort rather than XML in this very particular form, so it's not relevant to the tutorial.

How would this work for a compromise? I can convert the XML documents to a series of plaintext files (and point at an EarlyPrint tutorial if anyone is really curious about the XML stuff). Then I can still use CountVectorizer in the way you describe, and probably also skip TF-IDF. This will greatly simplify things while being more like a lot of humanities text collections I've worked with. With the space I'm saving, I can include a CSV of the metadata and a brief codeblock describing how to match the data to the IDs.

I can experiment with this a little today to make sure it will give results that'll be a good illustration. Your main point is really well-taken: the last thing I want is for the code to be overly complex and distract from the lesson about choosing among distance measures. @walshbr does this sound good to you?

walshbr commented 4 years ago

That sounds good to me @jrladd!

jrladd commented 4 years ago

Okay, @walshbr, I've made the changes, and I think it's ready for you to have a look.

Many thanks, again, all!

walshbr commented 4 years ago

Ok! I'll take a look @jrladd thanks. And I'll let you know if I see anything else that needs doing before passing it off to our managing editor.

walshbr commented 4 years ago

I'm still looking at this and will pass on further thoughts when I can @jrladd, but as I'm slowed down at work I thought I'd pass along a couple questions that I have already:

Any updates to your bio? It currently reads “John Ladd is a PhD candidate in literature at Washington University in St. Louis and postdoctoral fellow for Six Degrees of Francis Bacon at Carnegie Mellon.” Also note that on the old lesson you didn’t use a middle initial. If you want it, we’ll need to update the author bio accordingly. They’ll need to be consistent. Let me know.

Thoughts on a difficulty selection? There are currently three tiers, which can be set with the following numerical codes: 1 (Beginning), 2 (Intermediate), 3 (Advanced).

For the abstract - I think it might be worth thinking about the audience. This is from our editorial guidelines: “Try to avoid technical vocabulary when possible, as these summaries can help scholars without technical knowledge to try out something new.”

I was just wondering if there might be a way to make this just a tad more legible to a novice audience? " "This lesson introduces three common distance measures for text analysis: city block distance, Euclidean distance, and cosine distance. You will learn the general principles behind similarity, the different advantages of these measures, and how to calculate each of them using the SciPy Python library."

Maybe it could be as easy as saying "three common distance measures for determining how similar texts are to one another:……" or something like that?

jrladd commented 4 years ago

Thanks for catching these, @walshbr. Yes, let's update the bio to "John R. Ladd is a postdoctoral fellow in digital humanities at Northwestern University, where he teaches early modern studies and computational approaches to literature." If it's not too much trouble, it would be nice to add the middle initial, for consistency with stuff outside ProgHist.

Maybe the difficulty is a 2? Your judgment on this will be better than mine, but that seems right since it's focused on general concepts rather than a lot of code. But I could see how it might be a 3 as well.

Your proposed change to the abstract sounds good. Here's a new version:

"This lesson introduces three common measures for determining how similar texts are to one another: city block distance, Euclidean distance, and cosine distance. You will learn the general principles behind similarity, the different advantages of these measures, and how to calculate each of them using the SciPy Python library."

Should I update it in the file?

walshbr commented 4 years ago

Nah I can do the updating! I'll write back more when I get a chance. Was doing the easy stuff before I actually go through and do a final read.

acrymble commented 4 years ago

Just a note that “Block” is best understood by North Americans. People in the UK know it from TV more than a term used locally where roads are often hundreds or thousands of years old and predate modern city planning

On Fri, 28 Feb 2020 at 17:13, Brandon Walsh notifications@github.com wrote:

Nah I can do the updating! I'll write back more when I get a chance. Was doing the easy stuff before I actually go through and do a final read.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/programminghistorian/ph-submissions/issues/275?email_source=notifications&email_token=AAE5352Z3QU4FBAJVZ4B63DRFFA3DA5CNFSM4JY7GX6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENJIS4Y#issuecomment-592611699, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE5353L4XRMU4XYEEXBEKDRFFA3DANCNFSM4JY7GX6A .

--

Adam Crymble Chair, ProgHist Ltd. ProgHist Ltd is a Not for Profit Company Limited by Guarantee, Registered in England, Company Number 12192946 https://programminghistorian.org/

walshbr commented 4 years ago

Thanks @acrymble. In this case, "city block distance or 'Manhattan'" is the actual name of the distance metric, so I think that's probably the best way to describe it. And in the context of the lesson @jrladd mentions the name's regional specify.

@jrladd - do you have any thoughts? I think the only point for editing here might be to explicitly say something like "The graph here resembles the grid-like layout of North American city streets, in particular, which is where the name comes from." That way you're addressing the implied knowledge about NYC, which not everyone might have in a global readership.

jrladd commented 4 years ago

Yes, I think defining it in the way you suggest would be a good solution. Thanks, @acrymble, I hadn't thought of that.

walshbr commented 4 years ago

Cool thanks! @jrladd would you mind adding that component in so that it is your language / in a way you're happy with? I finished all the proghist metadata / avatar / plumbing stuff just now. I'll read through the lesson and run the code blocks one last time to check things in the next few days before handing any final revisions / pieces i noticed back to you.

jrladd commented 4 years ago

@walshbr I've just pushed a change with more language about city blocks (and also fixed a typo I found). Let me know whenever there's more to do!

walshbr commented 4 years ago

Hey @jrladd - I finished reading through this just now. It looks good to me! I also ran the code blocks again, and everything appears to be working. My only note was that your code comments are sometimes punctuated with a period and sometimes not. It might be good to have them be consistent, but, if that were the case, they would be the only example of grammatically correct code comments I've ever seen.

@svmelton - this is ready to advance. Here are my notes for you and the files to move forward. Let me know when you're ready to merge so I can stick the new tweets on the bot spreadsheet.

The author has a bio ph_authors.yml. He wants to update the name of it from “John Ladd” to “John R. Ladd.” This will also require updating the metadata for the “exploring and analyzing network data with python” lesson to be “John R. Ladd” and then the ph_authors.yml file bio should be updated to “John R. Ladd is a postdoctoral fellow in digital humanities at Northwestern University, where he teaches early modern studies and computational approaches to literature”

assets/common-similarity-measures - asset files images/common-similarity-measures - image files lessons/common-similarity-measures.md - the lesson file gallery/common-simiarlity-measures.png - the modified avatar gallery/originals/common-similarity-measures.png - the original avatar

jrladd commented 4 years ago

Great! Many thanks to you, @walshbr, and to everyone for the time you’ve put into this. Can’t wait to share it with people.

svmelton commented 4 years ago

Thank you @walshbr! I'll take a look in the next few days.

svmelton commented 4 years ago

Just a heads up that I've reached out to get the copyediting process started. I'll keep you posted once we have someone confirmed and the process is underway! @walshbr @jrladd

svmelton commented 4 years ago

Hi @walshbr and @jrladd! Thanks for your patience—the copyedits are up.

walshbr commented 4 years ago

Thanks @svmelton! @jrladd let me know if you have any questions as you work on these.

jrladd commented 4 years ago

Thanks very much, @svmelton and @walshbr! I've just pushed all the changes from the copyedits. Let me know if there's more I can do.

programminghistorian / ph-submissions

Review Ticket: Understanding and Using Common Similarity Measures #275

Anti-Harassment Policy