Closed walshbr closed 4 years ago
@jrladd - I'll get to work on this in the next day or two. In the meantime, if you wouldn't mind re-posting the "Permission to Publish" bit above I'll get to it. I'll make any basic syntax corrections and things necessary and ping as things come up. Will probably have suggestions for revisions before sending it out for review.
@jrladd I think I fixed all the image paths and LaTex syntax. The syntax for mathjax appeared to be double dollar signs rather than singles. I'll do a sweep today for general comments, but it might be worth double checking the equations to make sure I didn't muck anything up with the find and replace I did.
@walshbr Thanks! The equations all look right to me. One of the images (for city block distance) still isn't showing up, though.
I want to make some small changes to the document structure: adding an additional top-level heading and moving some headings up a level. That will make the table of contents feel less lopsided and help the reader navigate. But I'll wait to do that until I get your other comments.
Here's my permission to publish:
I the author hereby grant a non-exclusive license to ProgHist Ltd to allow The Programming Historian English|en français|en español to publish the tutorial in this ticket (including abstract, tables, figures, data, and supplemental material) under a CC-BY license.
@jrladd Whoops I hadn't pushed. That figure should be working now. You can go ahead and push those heading changes if you want - i . don't imagine they will interfere with what i do!
Common Similarity Measures Editing Notes
@jrladd - just finished a preliminary read through. This is great! I knew a little about the topic and I learned a lot. I think this is exactly the sort of thing I’d like to see us do more of. My main comments below pertain to bringing out the case studies / examples you’re using a bit more. I think for Austen/Wharton and the EarlyPrint dataset both, spending more time on the kinds of questions that method can help you to ask will help people understand why they’re doing this in the first place. And it will help encourage people to see the results as more than just numbers but also opportunities for interesting research questions. If you have any examples of people using this in literary or historical research ready to go those might be useful to include as well. But in general I think I would comment more on what the methods are showing you, what questions they provoke, how they relate to the datasets and authors you’re working with, etc.
Point by point things follow. Let me know if I can help with any of them! You should be able to push directly to the repo here.
P1 - Why might people care about whether something is similar? An example from the list with more depth (even just a sentence or two) might help. That might help ground this discussion for a humanist and make the case for why something like this is useful before launching into the stats, which might be a scary lift for humanities folks to dig into without some sense of why it’s useful up front. You could even make the point that something like similar or different might seem abstract, but this is a way of mathematically quantifying those distinctions.
P2 - under the preparation section I would stamp version numbers on all the required libraries.
P3 - for Anaconda installations I would probably actually point the reader to the Anaconda documentation. You might want to get the instructions for installing in a few more steps. Something like…
That is, unless Anaconda ships with SciPy and Pandas already installed (I’m less familiar with it). If that’s the case I wouldn’t worry as much about that.
P4-5 - I think a couple more sentences describing TF-IDF would be useful here. It’s fine to point people elsewhere for more info, but I think a sentence summary would be helpful otherwise. Particularly since the TF-IDF dataset is what they’ll use for the rest of the lesson, I think it is worth explaining more. And then maybe offering a sentence about how this sort of dataset connects to the small Wharton and Austen datasets you’re using. Might even be worth screenshotting the first few rows/columns so people know what they’re looking at.
P7 - Might be worth just a note to the effect of how the features that interest you are related to your research question.
P11 - Is it worth trying to make these graphs using Python? Just thinking that if you ever need to change or update them it might be easier to do so if they were not screenshots / things you’d have to draw from scratch each time.
P12 - Might be good to give another “So what?” answer here as to why someone would be interested in this?
P13 - This is the point where people usually lose me when I explain this topic to them :). So I might add a couple more sentences here to explain a bit more. It might help to note that, in this case, if we’re using words, you could has as many dimensions as you have words in each text. I think the phrase “the math is the same” might confuse people because you haven’t actually gotten to the distance part yet, which I think is the math component that you’re talking about.
P33 - why brackets around cosine similarity here? Also the distinction between orientation and distance probably merits another sentence.
P43 - I think this section might be less confusing by tying it back to the texts. That is to say, beyond saying mathematically if they are similar or not, say that as a humanist might take it. I.e. - “According to the measures we’ve seen so far, these texts are pretty similar to one another.”
P56 - Re: the large asset file - a policy exists now! If you want to upload your dataset to Zenodo, that will create an archived version of the site with a DOI that we can link for the purposes of the lesson. Thanks for hanging with this! Lmk if you have questions.
P61 - I would offer some of the output and then offer preliminary thoughts on it. You start to do that with the discussion of zeroes, but visuals might help. It might be good to have a description of the underlying data and research questions in the EarlyPrint dataset in order for the results to be more meaningful to the reader.
P64 - Again a little bit more information about Cavendish could help ground the reader in what they’re looking at.
I might add in a piece that graphs the distance/similarities in Python as a way of bringing it back to the visualizations that you started with.
Thank you so much, @walshbr! This is really generous and helpful feedback. I can certainly bring out more about the case studies and what kinds of questions humanists might want to answer with these approaches. I'll get started on this and have a new draft for you soon!
Sounds good and no problem! Great to work with you again. Do you want to set a tentative date for revisions handed over with the understanding that it might change? Something that works for your schedule?
Sure thing! My tentative plan is to have the new draft by Friday (work on this dovetails well with other stuff I'm doing this week), and then we'll be ready to go for further reviews after break? I'll let you know if anything changes, but it should be doable.
@walshbr I just pushed a bunch of edits based on your suggestions. I also uploaded the CSV to Zenodo, and it's here. (Fair warning: that link kept freezing up my browser.)
I added a screenshot of TF-IDF data, and I think that helps to give a better sense of what the reader will be working with. I opted not to change the other images, if that's okay. I actually think it'll be easier to make changes by hand than to figure out how to replicate all the dotted lines and stuff. But if you feel strongly, I can change them.
The only other thing was a visualization for the end of the lesson. I'm not sure how I'd create a graph from the results that would easily refer back to the earlier viz, just because we're now working in so many dimensions. I could certainly do some kind of dimension reduction, but that would add a layer of complexity we probably don't want to explain in this lesson. I can continue to think about it though, and maybe reviewers will have suggestions.
Thanks again for your help! If this looks okay to you, I think it's ready to go out.
Thanks @jrladd! Taking a look now and tomorrow if I can't finish today.
Yowza that Zenodo situation is complicated…because they generate a preview of the item for the record it basically killed my browser as it was trying to open it. I managed to grab the actual download link by viewing source though - https://zenodo.org/record/3572854/files/1666_tfidf.csv. Going directly to that link should actually download the thing without having to go to the slow page. I'll add it in.
Actually - @jrladd could you try re-uploading the csv but as a zip file and handing me the new DOI? That might keep the thing from breaking in the preview. And just to standardize things, @mdlincoln suggested naming the zip file with the same slug as used for the .md file and images folder, etc
@jrladd congrats on breaking zenodo with your data https://github.com/zenodo/zenodo/issues/79
Thanks @jrladd. I read through this all again - I think it’s good to go for sending out. I also think it’s fine to leave the drawings / screenshots in as you say.
re: closing visualization - I think you’re right that it might add on too much to add in something like that. I think it’s fine to leave it the way it is for now! If the reviewers have thoughts they can share them, but it was just a suggestion.
If we wind up using a zip file on the Zenodo link it might make sense to add in a sentence telling people to unzip the thing. Beyond that tweak (and updating the Zenodo link to that zipped record) I think it’s good to go. I’ll start working on reviewers.
Appreciate the help, @walshbr @mdlincoln. Who says humanities data can't cause as many problems as any other kind of data?!
The file compressed is down to just three and a half MB, so I thought it might be easier to skip Zenodo? I've attached it here. If you'd still prefer Zenodo, I'm happy to upload it.
common-similarity-measures.zip
I think this will be fine for sending to reviewers, and I'm curious to see how it works out with the full dataset. But I'm starting to think that it's probably not worth it to have people new to these methods working with such a large dataset. The original idea was to have data identical to what I produced here, but I don't want that goal to get in the way of a good tutorial.
I could easily reconfigure this to work on TF-IDF results for just the top 100 words in the same corpus. That might change the results slightly (so I'd need to update those examples too), but the general takeaway should stay the same. If you and the reviewers agree this is a good idea, I'll be happy to change it over.
FWIW @jrladd I typed out a whole comment flagging the fact that the table has 170k columns and how the very long load times could make the lesson difficult for users without shiny new processors, and that you might want to consider reducing to only the most informative words :) But then I deleted it since I didn't want to overly weigh how this lesson gets edited.
@walshbr should make the final call of course, but I might suggest leaving it as is for the time being and letting the reviewers have a go at it, while keeping this potential data reduction in mind as a possible revision, to be put into the hopper with the many others that reviewers will come up with?
@mdlincoln I certainly don’t mind if that info is part of the reviewers’ assessment! :) And this all sounds good to me.
I still like the idea of an unabridged “real world” dataset in principle, but that should be balanced against making the lesson as usable and helpful as it can be.
Yeah I think let's leave it the way it is for now - I'll update the lesson to link to the zip file (and maybe add a phrase indicating that people will need to unzip it). That's a good idea though! I'll mention to the reviewers that they should think about this question. At the very least, it might be worth adding a couple sentences on limitations with processing speed for opening and working with medium to large datasets, but you can think about that for the post-review phase. More soon!
@saraheconnell and @statsmaths have agreed to serve as reviewers, and they'll aim to get them back around the end of the month (with some flexibility given that we're coming off a long holiday). Thanks, team! Let me know if you have any questions - otherwise I'll check back in later in the month.
Hi @jrladd. I had a jump on reading this before the New Year and just finished a second pass through the lesson. I really like to idea of have an article that describes different ways to do bag-of-words based distances between texts. I have seen many DHers doing text analysis that do not realise the assumptions behind the specific way they are choosing to do document similarity (or even when they do, not having a solid basis for making that choice).
From my perspective, overall the article is clear and well-written. There are a few spots where I would recommend making minor changes to tighten certain technical claims, but these are quite minor. My bigger recommendation, however, is to suggest some editing/restructuring of the material being covered. The title and abstract seem to suggest that you are looking to offer a general introduction to similar metrics. However, this article really only covers one "type" of similarity score. It also does not give any introduction to similarity scores as a general class. Namely, you are introducing scores based on a distances (L1 and L2 are just distances measured along a different paths; cosine similarity is equivalent to the L2 distance metric if you scale the documents in order to account for their size). There are many other similarity scores that are quite different, such as those based on weighted edit distances (used in genomics), supervised trees (which deform the input vector space based on a classification task), UMAP/tSNE and friends (that try to make the distribution of similarity scores more uniform), and the use of transfer learning through reusable embeddings. There are even more specific models and approaches for handling the specific stylistic and topical challenges of document similarity within the domain of NLP.
My suggestion would be to reframe the article to make it specifically about similarity scores for describing word usage across a corpus of texts. Most of what you've written is already organised this way, and would require a lot less work than the alternative (making the text match the title and abstract). Here are my thoughts/recommendations for how to tighten the article to make it very clearly about word-based textual similarity, and hopefully make it a bit more clear how your article can be useful for practitioners within DH:
Sorry, I know that that's a lot and I probably started to ramble. To summarise the summary: Focus on text distances, drop L1 (it is distracting because I don't think it is useful), and include the code that works from the raw data.
Please let me know if you have any questions at this point. I think @walshbr will want you to wait for @saraheconnell before editing anything in the document, but I believe it's okay, but of course not required, to offer comments/questions here in the GH issues in the meantime.... I will do a more exact line-edit check after your next pass through the document.
Thanks @statsmaths! Yep - it'd be good to wait for @saraheconnell before editing the document just so she doesn't have a moving document as she's working on it. But it's fine to chat here if you'd like to correspond about the review.
Many thanks for this thoughtful response, @statsmaths! I will hold off on making any changes until the other review comes in, but your comments are very much in line with thoughts and concerns I had, especially regarding the sample data/code, whether city block is helpful or distracting, and the scope of the piece and its discussion of distance vs. the broader subject of similarity. It's worthwhile to be clearer about the small slice of similarity that I'm covering here, and maybe I could list off a few other similarity scores.
Your comments also got me thinking: I hadn't thought about using the terms L1 and L2, but it may be useful to briefly introduce them as a researcher is likely to encounter them in something like SKLearn.
Apologies that this has taken me so long—I’ve been buried under a writing deadline and the usual start-of-term chaos. I should also mention that this is the first review I’ve written for PH so I hope that it’s helpful! I’ve divided this into substantive and language-focused suggestions, and also offered a few thoughts on things that I think are already working quite well.
Substantive suggestions:
Aspects I wanted to flag as already effective:
Language & readability-focused suggestions:
I hope that helps and I’d be glad to answer questions about any of the above. This was a really well-structured and thoughtful lesson and I’m personally looking forward to being able to point people to it. Thanks!
Thanks to you both @statsmaths and @saraheconnell! Just wanted to acknowledge that I see both reviews. I'll take a look at them both and offer a thought or two about working through them in the next couple days when back in the office.
Thanks so much for these helpful comments, @saraheconnell! I can certainly offer more ways for the reader to adapt these methods to their own work, and be more clear about how to interpret results. I should definitely be linking directly to the metadata, so thank you for catching that!
@walshbr, I'll wait for your comments before getting started, but I can see some helpful overlaps between both reviews. I'm more convinced now of the need to trim the TF-IDF data (particularly if I'll be guiding readers through the data collection). Using the 100 most frequent words in the subcorpus, for instance, would make the whole thing more readable. And one more question: I think I went a little too far in trying to adhere to the author guideline about using the second person, and got away from the more important guideline of using an accessible tone. Using "we" is my instinct anyway: is it okay to use a few "we"s throughout to help with the tone? I can certainly fix the tonal issues without "we," but I thought I'd ask.
Go for it @jrladd! I just wanted to double check that there wasn't anything in conflict between the two reviews. I'm happy to have you bounce ideas off me if you have them or participate in a conversation about the revisions, but both of these reviews seem really helpful and clear to me. Let me know if you run into questions. Thanks again @statsmaths and @saraheconnell!
Let me check with the editorial board about the question of second person. Are you up for roughly February 21st as a deadline for the next round of revisions @jrladd?
Speaking of abstracts, there will be one at the top of the page on the PH site. So when you settle on language that you like you might add that to the metadata at the top of the file. I can also come up with one, but it's usually going to be better if the author gives language they want.
Update on the person question - all new lessons are going through copy editing. And those copy editors are likely going to try to revert any clear exceptions to be in keeping with those rules, which were developed with an international audience in mind. So if there are particular instances that feel especially off it might be worth rewriting the sentence to avoid the personhood entirely. I'm happy to help workshop possibilities for sentences that seem difficult.
Thanks @walshbr! That definitely clarifies things. I'm certain I can fix the tone and still keep to the rules.
Feb. 21st will be fine as a deadline. Should I push changes directly to this repo?
I'll add a new abstract that better describes the lesson than the original, and I'll definitely let you know as questions/issues come up. Thanks again to everyone for the help and advice.
Yep! Go ahead and push here @jrladd. We won't move to the production repo until things are pretty finalized.
Thanks again, all, for your helpful comments! I've just pushed some revisions. Here's an incomplete list of what I've done:
That's most of it! Unfortunately, as I was finishing all this, I started thinking more about @statsmaths advice about the difference between using raw term frequency and TF-IDF. Now that I see it all written out, I'm wondering if I should simply eliminate TF-IDF altogether. There's a lot of extra explanation (and some extra code) to get readers up to speed with it, and I don't think it adds much in terms of teaching people about distance.
Right now I've put in an explanation that you can do distance just over raw term frequency and then move on to TF-IDF. I could leave it as is, but here are two additional options:
What do you think of these options? I haven't added the official abstract or the similarity.py
file yet, because a big change like the one I've mentioned above would likely affect these. I also need to redraw the graph that @saraheconnell pointed out. Again, many thanks for the feedback. I think the lesson has improved a lot.
Thanks for taking the time to address all of the feedback @jrladd ! I think the way you modified the discussion of the Manhattan distance metric (keeping it in, but downplaying whether it should be used over Euclidian distance) is excellent. In regards to the TF vs TFIDF differences, I could also see it going any of the three ways that you suggested. The current approach works because it goes straight to the "best" (more or less) approach; using just TF would work too because it better connects to the introduction; doing both balances both benefits but at the risk of being more complicated.
The one difficulty I still see is that I still (if I was using this as a resource in the classroom) would really prefer to be able to directly work with the texts and metadata. The new version include the relatively complex XML parsing code (making it look less accessible), but still does not give access to all of the metadata or the code for others to work directly from their raw texts. One way to simplify both of these things would be to have the tutorial work from of a csv file that contains the texts and metadata. I put one together based on the texts you were using:
What do you think about reading this CSV file into Python (pd.read_csv
) as a starting point? You could then use sklearn.feature_extraction.text.CountVectorizer
to produce the term frequency matrix (it is actually much easier to do it this way, because reducing the number of works to the top 1000 is handled by sklearn), and carry on with the rest of the lesson as written. I see this having a number of benefits: (1) remove the need to introduce a lot of XML parsing code that is never really explained nor the focus of the lesson, (2) allows you to give the code for producing results of the nearest neighbours along with the metadata, (3) gives readers the code for working with raw text, which makes it much easier to apply to other collections, (4) the function CountVectorizer
has a lot of features that can be useful to play around with, and (5) this data is about half the size of the XML files, and safely fits under GitHubs file size limits. Related to (4), if you do decide to convert the entire lesson to just TFs, you may find that the results on the raw frequencies are much better if you remove the most frequent terms (i.e., max_df=0.
7 in CountVectorizer
), at least for the cosine similarity.
Hopefully that's helpful. I think its almost there!
Thanks for all your work on this! Let me know if I can do anything to help, but this all seems like sensible advice to me. Let me know if you have any questions for me @jrladd, and just ping me when you're all set and ready for me to take a look again.
Thanks for this, @statsmaths! I see the point about having plaintext and metadata side-by-side. And it would be nice to use CountVectorizer
(or even TfidfVectorizer
, which combines CountVectorizer
with TfidfTransformer
).
I'm a bit torn about presenting the texts in the way you demonstrate. The added complexities of wrangling the early modern texts into the right form for sklearn was the reason my first instinct was to start with pre-processed TF-IDF data. Your CSV would be really convenient, but a little far from what the real-world process would be for someone trying to replicate the process with different texts. As you point out though, most readers are going to be working with plaintext of some sort rather than XML in this very particular form, so it's not relevant to the tutorial.
How would this work for a compromise? I can convert the XML documents to a series of plaintext files (and point at an EarlyPrint tutorial if anyone is really curious about the XML stuff). Then I can still use CountVectorizer in the way you describe, and probably also skip TF-IDF. This will greatly simplify things while being more like a lot of humanities text collections I've worked with. With the space I'm saving, I can include a CSV of the metadata and a brief codeblock describing how to match the data to the IDs.
I can experiment with this a little today to make sure it will give results that'll be a good illustration. Your main point is really well-taken: the last thing I want is for the code to be overly complex and distract from the lesson about choosing among distance measures. @walshbr does this sound good to you?
That sounds good to me @jrladd!
Okay, @walshbr, I've made the changes, and I think it's ready for you to have a look.
Many thanks, again, all!
Ok! I'll take a look @jrladd thanks. And I'll let you know if I see anything else that needs doing before passing it off to our managing editor.
I'm still looking at this and will pass on further thoughts when I can @jrladd, but as I'm slowed down at work I thought I'd pass along a couple questions that I have already:
Any updates to your bio? It currently reads “John Ladd is a PhD candidate in literature at Washington University in St. Louis and postdoctoral fellow for Six Degrees of Francis Bacon at Carnegie Mellon.” Also note that on the old lesson you didn’t use a middle initial. If you want it, we’ll need to update the author bio accordingly. They’ll need to be consistent. Let me know.
Thoughts on a difficulty selection? There are currently three tiers, which can be set with the following numerical codes: 1 (Beginning), 2 (Intermediate), 3 (Advanced).
For the abstract - I think it might be worth thinking about the audience. This is from our editorial guidelines: “Try to avoid technical vocabulary when possible, as these summaries can help scholars without technical knowledge to try out something new.”
I was just wondering if there might be a way to make this just a tad more legible to a novice audience? " "This lesson introduces three common distance measures for text analysis: city block distance, Euclidean distance, and cosine distance. You will learn the general principles behind similarity, the different advantages of these measures, and how to calculate each of them using the SciPy Python library."
Maybe it could be as easy as saying "three common distance measures for determining how similar texts are to one another:……" or something like that?
Thanks for catching these, @walshbr. Yes, let's update the bio to "John R. Ladd is a postdoctoral fellow in digital humanities at Northwestern University, where he teaches early modern studies and computational approaches to literature." If it's not too much trouble, it would be nice to add the middle initial, for consistency with stuff outside ProgHist.
Maybe the difficulty is a 2? Your judgment on this will be better than mine, but that seems right since it's focused on general concepts rather than a lot of code. But I could see how it might be a 3 as well.
Your proposed change to the abstract sounds good. Here's a new version:
"This lesson introduces three common measures for determining how similar texts are to one another: city block distance, Euclidean distance, and cosine distance. You will learn the general principles behind similarity, the different advantages of these measures, and how to calculate each of them using the SciPy Python library."
Should I update it in the file?
Nah I can do the updating! I'll write back more when I get a chance. Was doing the easy stuff before I actually go through and do a final read.
Just a note that “Block” is best understood by North Americans. People in the UK know it from TV more than a term used locally where roads are often hundreds or thousands of years old and predate modern city planning
On Fri, 28 Feb 2020 at 17:13, Brandon Walsh notifications@github.com wrote:
Nah I can do the updating! I'll write back more when I get a chance. Was doing the easy stuff before I actually go through and do a final read.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/programminghistorian/ph-submissions/issues/275?email_source=notifications&email_token=AAE5352Z3QU4FBAJVZ4B63DRFFA3DA5CNFSM4JY7GX6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENJIS4Y#issuecomment-592611699, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE5353L4XRMU4XYEEXBEKDRFFA3DANCNFSM4JY7GX6A .
--
Adam Crymble Chair, ProgHist Ltd. ProgHist Ltd is a Not for Profit Company Limited by Guarantee, Registered in England, Company Number 12192946 https://programminghistorian.org/
Thanks @acrymble. In this case, "city block distance or 'Manhattan'" is the actual name of the distance metric, so I think that's probably the best way to describe it. And in the context of the lesson @jrladd mentions the name's regional specify.
@jrladd - do you have any thoughts? I think the only point for editing here might be to explicitly say something like "The graph here resembles the grid-like layout of North American city streets, in particular, which is where the name comes from." That way you're addressing the implied knowledge about NYC, which not everyone might have in a global readership.
Yes, I think defining it in the way you suggest would be a good solution. Thanks, @acrymble, I hadn't thought of that.
Cool thanks! @jrladd would you mind adding that component in so that it is your language / in a way you're happy with? I finished all the proghist metadata / avatar / plumbing stuff just now. I'll read through the lesson and run the code blocks one last time to check things in the next few days before handing any final revisions / pieces i noticed back to you.
@walshbr I've just pushed a change with more language about city blocks (and also fixed a typo I found). Let me know whenever there's more to do!
Hey @jrladd - I finished reading through this just now. It looks good to me! I also ran the code blocks again, and everything appears to be working. My only note was that your code comments are sometimes punctuated with a period and sometimes not. It might be good to have them be consistent, but, if that were the case, they would be the only example of grammatically correct code comments I've ever seen.
@svmelton - this is ready to advance. Here are my notes for you and the files to move forward. Let me know when you're ready to merge so I can stick the new tweets on the bot spreadsheet.
The author has a bio ph_authors.yml. He wants to update the name of it from “John Ladd” to “John R. Ladd.” This will also require updating the metadata for the “exploring and analyzing network data with python” lesson to be “John R. Ladd” and then the ph_authors.yml file bio should be updated to “John R. Ladd is a postdoctoral fellow in digital humanities at Northwestern University, where he teaches early modern studies and computational approaches to literature”
assets/common-similarity-measures - asset files images/common-similarity-measures - image files lessons/common-similarity-measures.md - the lesson file gallery/common-simiarlity-measures.png - the modified avatar gallery/originals/common-similarity-measures.png - the original avatar
Great! Many thanks to you, @walshbr, and to everyone for the time you’ve put into this. Can’t wait to share it with people.
Thank you @walshbr! I'll take a look in the next few days.
Just a heads up that I've reached out to get the copyediting process started. I'll keep you posted once we have someone confirmed and the process is underway! @walshbr @jrladd
Hi @walshbr and @jrladd! Thanks for your patience—the copyedits are up.
Thanks @svmelton! @jrladd let me know if you have any questions as you work on these.
Thanks very much, @svmelton and @walshbr! I've just pushed all the changes from the copyedits. Let me know if there's more I can do.
The Programming Historian has received the following tutorial on 'Understanding and Using Common Similarity Measures' by @jrladd. This lesson is now under review and can be read at:
http://programminghistorian.github.io/ph-submissions/lessons/common-similarity-measures
Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.
I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. I have already read through the lesson and provided feedback, to which the author has responded.
Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.
I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me. You can always turn to @amandavisconti if you feel there's a need for an ombudsperson to step in.
Anti-Harassment Policy
This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.
[Permission to Publish]
The editor must also ensure that the author or translator post the following statement to the Submission ticket.