programminghistorian / ph-submissions

The repository and website hosting the peer review process for new Programming Historian lessons
http://programminghistorian.github.io/ph-submissions
135 stars 112 forks source link

Review Ticket: Analyzing Documents with Tf-idf #206

Closed ZoeLeBlanc closed 5 years ago

ZoeLeBlanc commented 5 years ago

The Programming Historian has received the following tutorial on 'Analyzing Documents with Tf-idf' by @mjlavin80 . This lesson is now under review and can be read at:

http://programminghistorian.github.io/ph-submissions/lessons/tf-idf

Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.

I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum.

Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.

I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me. You can always turn to @amandavisconti if you feel there's a need for an ombudsperson to step in.

Anti-Harassment Policy

This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. If anyone witnesses or feels they have been the victim of the above described activity, please contact our ombudspeople (Ian Milligan and Amanda Visconti - http://programminghistorian.org/project-team). Thank you for helping us to create a safe space.

ZoeLeBlanc commented 5 years ago

Thanks @mjlavin80 for this great submission! I've gone through the lesson now twice and each time it gets better each time. I have very few suggestions (two really) and we can discuss them more in depth if you have questions. I've also fixed the endnotes that weren't rendering properly, but I had to reorder them to the end of the lesson, so fyi. Once you've gone through my suggestions and we've agreed to the changes, I'll solicit some reviewers to get their feedback on the lesson. Do you think you would be able to get these changes in by the end of January?

Suggestions:

  1. I think this tutorial would benefit from a short précis ( maybe just two-three sentences) before the preparation section, describing why someone might want to read the lesson and the materials you cover. Maybe something along the lines of you've heard of TF-IDF or topic modeling and are interested in text mining, this lesson covers these topics, etc... A great example is the recent Temporal Network Analysis with R lesson we published, though that's a bit longer than what I was imagining. We can talk about this suggestion more if you have questions or thoughts.

  2. Completely optional, but I might add a citation to Ben Schmidt's Do Digital Humanists Need to Understand Algorithms? from Debates in the Digital Humanities to your line at P16 Mathematical equations like these can be a bit bewildering if you’re not used to them. for those interested in the question of how much digital humanists need to understand algorithms

  3. I don't love the variables names (X and myarray) at P25 and 26 though I realize they are kind of standard in text mining tutorials. Could we potentially use something more descriptive, like transformed_docs and matrix_to_array ?

Some things I especially enjoyed in your lesson:

  1. Absolutely love your initial description of TF-IDF, as well as your walkthrough and comparison of term frequencies vs TF-IDF term weights. It's one of the clearest and interpretable I've seen, so thank you!
  2. Appreciated at P30 how you show the code and then go through explaining each line's function.
  3. Your sections on Interpreting Word Lists: Best Practices and Cautionary Notes and Some Ways Tf-idf Can Be Used in Computational History are so good that I won't be surprised if they end up on a lot of syllabi once this is published.

Let me know if anything is unclear and if you need longer to work on the suggestions I posted. Thanks again for submitting such a great lesson and looking forward to getting it out to reviewers!

ZoeLeBlanc commented 5 years ago

Heard from @mjlavin80 on January 3, 2018 via email and just moving his reply here for posterity and project management. Matt's response:

Matt agreed to the end of January deadline. Once his changes are pushed up, we'll start soliciting reviewers 🎉

Matt

mjlavin80 commented 5 years ago

I have now completed the changes @ZoeLeBlanc recommended in her Dec. 22 response. I pushed the update to the gh-pages branch under the commit message "editor-recommended changes." Cheers, and looking forward to the peer reviewers' responses!

ZoeLeBlanc commented 5 years ago

Thanks @mjlavin80 for implementing these changes, everything looks great to me! I'm currently soliciting reviewers and I'll update here once they've agreed.

ZoeLeBlanc commented 5 years ago

@quinnanya and @broomgrass have agreed to serve as our lesson reviewers 🎉 . They've agreed to a submission date for their reviews of April 1, 2019.

Just a reminder reviewer guidelines can be found here https://programminghistorian.org/en/reviewer-guidelines, and that we're following an open review process. All reviews should be posted in this ticket thread and feel free to look at previous submission tickets for examples. Also please post here or email me any questions you have about the process.

This ticket will be open to any other feedback from our community while we wait for our reviewers to submit. Once we have received both reviewers' responses, the open review period will then be closed. I will summarize the feedback from both reviewers as well as any other input from our community, and work with @mjlavin80 to decide what further editing the lesson will need.

Please do not make any edits to the lesson draft @mjlavin80 until we've received both reviews and I've written up my summary - that way I have a chance to prioritize comments and work out any conflicting advice.

Thanks to our reviewers and looking forward to working together 😊

quinnanya commented 5 years ago

Thanks for writing this, @mjlavin80! I can imagine it being really helpful to point students to for a thorough explanation both of what tf-idf refers to, and one way to implement it. I really like the examples at the end, and the explanation of the implications for stemming or lemmatization. The comparison to other text analysis methods is also really helpful, particularly the section on topic modeling.

I wasn't able to find the Jupyter notebook or set of scraped documents to actually run this, but I'd be happy to try it out if you can point me to them and I can write up my experience doing it.

On the text itself, I've got three different sets of feedback: substantive things, organizational tweaks, and typos. I'll go through them in that order, even though that means jumping around a bit in the tutorial itself. Vis-a-vis organizational tweaks, there were multiple points where I made a note to myself with a question, only to discover that you'd answered it below the table, code, etc., I was looking at. Moving some of those explanations above the breaks in the text might be effective for preempting confusion.

Substantive stuff

Rearranging

Typos / phrasing

ZoeLeBlanc commented 5 years ago

Thanks to @quinnanya for her extensive review 👏. Once @broomgrass submits her review, I'll summarize both and provide some guidance on revisions @mjlavin80, and then we can discuss changes to the lesson.

quinnanya commented 5 years ago

@ZoeLeBlanc I wasn't actually able to find the notebook or code to run the lesson itself. Do you have a pointer, or is that not usually ready at this stage?

ZoeLeBlanc commented 5 years ago

@quinnanya I think the repository Matt is referring to is this one https://github.com/mjlavin80/tf-idf-programming-historian-draft . I think the idea is to create a jupyter notebook in that repo though I still need to clarify with him. I think the ideal solution would be to move data and a jupyter notebook example into the assets folder here. Thanks for bringing this up!

mjlavin80 commented 5 years ago

@quinnanya, thanks for your response! From what I've read so far, you've provided really thoughtful and constructive feedback. @ZoeLeBlanc is right that the lesson files are located at https://github.com/mjlavin80/tf-idf-programming-historian-draft. This repo is linked in the draft lesson, but I suspect the link needs to be more prominent. However, if I recall correctly, I didn't put a Jupyter Notebook into the repo because I assumed the reader would make their own empty notebook and paste code blocks as they went through the lesson. That said, I'm happy to make a notebook and add it to the repo.

quinnanya commented 5 years ago

@ZoeLeBlanc Thanks! Would it help for me me to create a draft notebook with the code mentioned and try running the lesson on the data in that directory, and post more feedback?

Oops, @mjlavin80, just saw your comment. I usually prepackage notebooks myself, so wasn't thinking about copying and pasting. Might be worth staring explicitly?

I'll try that and add any comments. Thanks!

broomgrass commented 5 years ago

Hi all! Just working through the final draft of my review, and @quinnanya covered many of the similar things!

I had similar questions about the repo - I ended up downloading it and making a new Jupyter notebook in the folder, which worked, but I think making the process a little more explicit at an earlier part of the lesson would be helpful.

Should have my review up and posted within a couple hours! :)

broomgrass commented 5 years ago

Hi everyone! Overall, I think this is a very clear and concise description of tf-idf, and I’m thankful to have it for my own reference! In particular, I think it works well to introduce someone to tf-idf both for its own sake and as a precursor/part of other methods – or, in other words, it certainly fulfills the stated purpose of the tutorial.

I also appreciate that @mjlavin80 identifies not just where the methods can be used for exploration, but also some of the assumptions and pitfalls to watch out for. There is more praise for particular elements that jumped out at me in the running commentary below, interspersed with some typos that I noticed and some other suggestions that may help improve the tutorial.

Happy to answer any follow-up questions, etc, if anything is unclear!


ZoeLeBlanc commented 5 years ago

Thanks @quinnanya & @broomgrass for these in depth reviews 👏 . I'll need a bit of time to summarize them but should have that done by next Monday April 8th. I'll ping you @mjlavin80 once I post them here, and then we can discuss the recommendations and determine a timeline for you to incorporate these changes into the lesson.

mjlavin80 commented 5 years ago

Roger that. Thank you @quinnanya, @broomgrass, and @ZoeLeBlanc!

ZoeLeBlanc commented 5 years ago

Just a quick update! Apologies for the delay I came down with a case of bad allergies the last few days. Currently working on consolidating both reviews and should have it here by Saturday April 13. Sorry again!

mjlavin80 commented 5 years ago

No problem @ZoeLeBlanc. Next week is our last week of classes, so I probably won't be able to look at it until Friday, April 19 anyway.

ZoeLeBlanc commented 5 years ago

Thanks @broomgrass and @quinnanya for such comprehensive reviews! I've compiled and consolidated both their reviews, and organized them by paragraph. @mjlavin80 if you have any questions or concerns feel free to message me here. I realize there's a lot of feedback to incorporate, so would a deadline of early June to address these suggestions and fix any typos make sense? If you need more time please let me know. Looking forward to getting this lesson published 🎉

To Dos:

mjlavin80 commented 5 years ago

@ZoeLeBlanc I just pushed a round of changes. I'm not sure I've resolved everything, but I want to ask you to look at what's there before I do more. Here's a summary of which edits I left incomplete and why:

Code for this lesson is written in Python 3.6, but you can run tf-idf several different versions of Python, using one of several packages, and you can find tf-idf implementations in various other programming languages.

I also updated the bibliography and re-numbered the endnotes. I'll want to double check those once everything else is set. Please do let me know what you think of all this. Thanks!

mjlavin80 commented 5 years ago

@ZoeLeBlanc I'm just pinging this thread in case you didn't notice my last post. I'm holding off on doing any more edits before I hear back from you. If you're just busy doing other things, apologies for nagging.

ZoeLeBlanc commented 5 years ago

@mjlavin80 sorry and thanks for pinging me! For some reason didn't see your reply in my Github notifications. Let me go through your response and I'll post an updated reply today. Thanks for getting back to me so quickly 👍

ZoeLeBlanc commented 5 years ago

Here's my feedback @mjlavin80 . Let me know what you think and if you have questions.

  1. Re: Jupyter Notebook. Let me ask the editorial board about whether we want to have a Jupyter notebook with the code already pasted in the folder. I agree that copying and pasting is more likely to encourage going through the whole tutorial, but I'm not sure if we have a set policy already on this.
  2. Prior skills looks good to me
  3. Really like the changes to p3/p4 and think it's much clearer now with the footnote too. I would go ahead and tick off the remaining to-do boxes with these changes.
  4. Think p6 reads great, and had no idea about term specificity, but actually really like that term too. Also appreciated the additions in p9 to the restaurant analogy, think that's a lot more clear now.
  5. p20 agree that skew might be misleading here. I think part of the issue is trying to briefly explain normalization and all its tradeoffs is difficult. I was trying to help you word something that explains how when multiplying a number smaller than one it changes the overall distribution of the data. Maybe something along the lines of "which preserves the original distribution of the data". Happy to discuss the exact wording though if you have thoughts.
  6. p23 👍
  7. p32 "We can use the TFIDFVectorizer class's get_feature_names() method" sounds good to me.
  8. I'll get back to you on Windows (along with Jupyter notebooks)
  9. p42 agree that detailing how to test for stability is probably too much to cover here (might be a topic for separate lesson even). Perhaps you could just add a parentheses in p43 after "robust results should be stable enough to appear with various settings" add "(Some of these settings are covered in the "Scikit-Learn Settings" section.)
  10. Numbering seems to be fixed. Github markdown is so weird sometimes but thanks for finding that bug.
  11. p 58 👍
  12. I think the additional footnotes and references are really comprehensive so thank you for adding them!

Final thoughts reading over:

These were the only changes I would recommend at the moment, but I'll read it closely again before we hit publish. I think the lesson is looking really great though and thank you for all your hard work!

Let me know if you have any additional questions, but I would go ahead and make these changes, check off the rest of the todos, and I'll get back to you about the Windows and Jupyter notebooks policy.

mjlavin80 commented 5 years ago

@ZoeLeBlanc I just pushed changes. I've checked off the rest of the to-do items, all except for those relating to a Jupyter Notebook and handling paths in Windows. Thanks again for all your help!

mjlavin80 commented 5 years ago

@ZoeLeBlanc one quick follow-up: if you or others are still mulling a solution for Windows and Mac file path compatibility in Python, has anyone looked at pathlib? It’s part of Python 3's standard library, and it seems very straightforward. I'd be happy to add it to my code.

ZoeLeBlanc commented 5 years ago

Thanks so much @mjlavin80 for checking off these todos. I'm going to read through the lesson once more today but I think we pretty close to having this go live.

I consulted with the board and I think the conclusion is that we would like a Jupyter Notebook to be included in your lessons files zip that includes the code of the lesson. Having just checked out pathlib (which I didn't know about so thanks for the rec!), it seems like a much more interoperable library than os. I think if you just change the following code blocks in the lesson and push up a notebook with all the code, we should be good to go.

Let me know anything doesn't look right, and thanks for all your hard work!

P25

import os
all_txt_files =[]
for root, dirs, files in os.walk("txt"):
    for file in files:
        if file.endswith(".txt"):
            all_txt_files.append(os.path.join(root, file))
# counts the length of the list
n_files = len(all_txt_files)
print(n_files)

Should become 👇

from pathlib import Path
all_txt_files =[]
for file in Path("txt").rglob("*.txt"):
     all_txt_files.append(os.path.join(file.parent, file.name))
# counts the length of the list
n_files = len(all_txt_files)
print(n_files)

and P33

import pandas as pd
import os

# make the output folder if it doesn't already exist
if not os.path.exists("tf_idf_output"):
    os.makedirs("tf_idf_output")

[rest of code be fine]

Should become 👇

import pandas as pd

# make the output folder if it doesn't already exist
Path("./tf_idf_output").mkdir(parents=True, exist_ok=True)
mjlavin80 commented 5 years ago

@ZoeLeBlanc I tweaked one thing in your suggested code blocks above, as year p25 edit still contained a reference to os.path.join(). os can otherwise be removed entirely from the lesson, so I did the following:

from pathlib import Path

all_txt_files =[]
for file in Path("txt").rglob("*.txt"):
     all_txt_files.append(file.parent / file.name)
# counts the length of the list
n_files = len(all_txt_files)
print(n_files)

I tested this is a Jupyter Notebook on my mac, but I'm guessing someone should try it all on a windows machine.

Lastly, I changed the text of my lesson to refer to pathlib instead of os.walk(), and I changed how I described using a Jupyter Notebook to make reference to the one in the lesson files '.zip' file. I believe that's everything! Thanks!

ZoeLeBlanc commented 5 years ago

Two quick questions @mjlavin80:

  1. Do you have a short bio I could use for your author information? I could also just condense your bio from here http://www.english.pitt.edu/person/matthew-j-lavin but wanted to ask your preference first.

  2. We need a short abstract from the lesson. What do you think of this? 👇 "This lesson focuses on a foundational natural language processing and information retrieval method called Term Frequency - Inverse Document Frequency (tf-idf). This lesson explores the foundations of tf-idf, and will also introduce you to some of the questions and concepts of computationally oriented text analysis."

This is an example abstract from a recent lesson: "Machine learning and API extensions by HathiTrust and Internet Archive are making it easier to extract page regions of visual interest from digitized volumes. This lesson shows how to efficiently extract those regions and, in doing so, prompt new, visual research questions."

Feel free to propose changes to the abstract and I'll be pushing up the branch tonight on the main jekyll site 🎉 Thanks!

mjlavin80 commented 5 years ago

@ZoeLeBlanc the abstract looks great. For my bio, since they tend to run shorter than what's on my department page, how about this: "Matthew J. Lavin is a Clinical Assistant Professor of English and Director of the Digital Media Lab at the University of Pittsburgh. His current scholarship and teaching focus on book history, cultural analytics, turn-of-the-twentieth-century U.S. literature and culture."

ZoeLeBlanc commented 5 years ago

Perfect! Thanks so much Matt!

I'm adding a few links and copy edits but your lesson should hopefully be live by the end of this week 🎉

ZoeLeBlanc commented 5 years ago

Sorry for delay everyone! My computer died this last week and so it's been a frantic haze of data recovery.

But great news lesson is officially live and tweets go out tomorrow!!!! Thank you so much @mjlavin80 for writing this excellent lesson, and @quinnanya & @broomgrass for being such generous reviewers! Also HT to @acrymble for giving me guidance throughout this process.

Official link is here https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf and please feel free to tweet & retweet 🎉

ZoeLeBlanc commented 5 years ago

@mjlavin80 do you still have a twitter account? If you could send your handle that would be great but no worries if you don't have one at the moment. Just wanted to check before I started tweeting! Thanks!!!

mjlavin80 commented 5 years ago

@ZoeLeBlanc I'm not on twitter anymore, but I'm happy to help promote the piece in other ways!

ZoeLeBlanc commented 5 years ago

No worries @mjlavin80! Totally understand and just wanted to double check in case your handle had changed. I'm happy to tweet about this lesson since I think it needs to go on every DH syllabus ASAP 😄

acrymble commented 5 years ago

@ZoeLeBlanc is this ready to close? Is the editorial checklist fully complete?