Review Ticket: Analyzing Documents with Tf-idf

ZoeLeBlanc commented 5 years ago

The Programming Historian has received the following tutorial on 'Analyzing Documents with Tf-idf' by @mjlavin80 . This lesson is now under review and can be read at:

http://programminghistorian.github.io/ph-submissions/lessons/tf-idf

Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.

I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum.

Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.

I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me. You can always turn to @amandavisconti if you feel there's a need for an ombudsperson to step in.

Anti-Harassment Policy

This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. If anyone witnesses or feels they have been the victim of the above described activity, please contact our ombudspeople (Ian Milligan and Amanda Visconti - http://programminghistorian.org/project-team). Thank you for helping us to create a safe space.

ZoeLeBlanc commented 5 years ago

Thanks @mjlavin80 for this great submission! I've gone through the lesson now twice and each time it gets better each time. I have very few suggestions (two really) and we can discuss them more in depth if you have questions. I've also fixed the endnotes that weren't rendering properly, but I had to reorder them to the end of the lesson, so fyi. Once you've gone through my suggestions and we've agreed to the changes, I'll solicit some reviewers to get their feedback on the lesson. Do you think you would be able to get these changes in by the end of January?

Suggestions:

I think this tutorial would benefit from a short précis ( maybe just two-three sentences) before the preparation section, describing why someone might want to read the lesson and the materials you cover. Maybe something along the lines of you've heard of TF-IDF or topic modeling and are interested in text mining, this lesson covers these topics, etc... A great example is the recent Temporal Network Analysis with R lesson we published, though that's a bit longer than what I was imagining. We can talk about this suggestion more if you have questions or thoughts.
Completely optional, but I might add a citation to Ben Schmidt's Do Digital Humanists Need to Understand Algorithms? from Debates in the Digital Humanities to your line at P16 Mathematical equations like these can be a bit bewildering if you’re not used to them. for those interested in the question of how much digital humanists need to understand algorithms
I don't love the variables names (X and myarray) at P25 and 26 though I realize they are kind of standard in text mining tutorials. Could we potentially use something more descriptive, like transformed_docs and matrix_to_array ?

Some things I especially enjoyed in your lesson:

Absolutely love your initial description of TF-IDF, as well as your walkthrough and comparison of term frequencies vs TF-IDF term weights. It's one of the clearest and interpretable I've seen, so thank you!
Appreciated at P30 how you show the code and then go through explaining each line's function.
Your sections on Interpreting Word Lists: Best Practices and Cautionary Notes and Some Ways Tf-idf Can Be Used in Computational History are so good that I won't be surprised if they end up on a lot of syllabi once this is published.

Let me know if anything is unclear and if you need longer to work on the suggestions I posted. Thanks again for submitting such a great lesson and looking forward to getting it out to reviewers!

ZoeLeBlanc commented 5 years ago

Heard from @mjlavin80 on January 3, 2018 via email and just moving his reply here for posterity and project management. Matt's response:

[x] items 1 and 2, I'll stick to your guidance. I think both revisions will improve the lesson.
[x] in terms of variable names (item 3), I will change them to names that are more descriptive of what they do, and I'll also add a brief note about why I've chosen the variable names I've used. I think that's an opportunity to discuss (briefly) a subject that coders should think about and discuss more often, so I really appreciate you flagging that.

Matt agreed to the end of January deadline. Once his changes are pushed up, we'll start soliciting reviewers 🎉

Matt

mjlavin80 commented 5 years ago

I have now completed the changes @ZoeLeBlanc recommended in her Dec. 22 response. I pushed the update to the gh-pages branch under the commit message "editor-recommended changes." Cheers, and looking forward to the peer reviewers' responses!

ZoeLeBlanc commented 5 years ago

Thanks @mjlavin80 for implementing these changes, everything looks great to me! I'm currently soliciting reviewers and I'll update here once they've agreed.

ZoeLeBlanc commented 5 years ago

@quinnanya and @broomgrass have agreed to serve as our lesson reviewers 🎉 . They've agreed to a submission date for their reviews of April 1, 2019.

Just a reminder reviewer guidelines can be found here https://programminghistorian.org/en/reviewer-guidelines, and that we're following an open review process. All reviews should be posted in this ticket thread and feel free to look at previous submission tickets for examples. Also please post here or email me any questions you have about the process.

This ticket will be open to any other feedback from our community while we wait for our reviewers to submit. Once we have received both reviewers' responses, the open review period will then be closed. I will summarize the feedback from both reviewers as well as any other input from our community, and work with @mjlavin80 to decide what further editing the lesson will need.

Please do not make any edits to the lesson draft @mjlavin80 until we've received both reviews and I've written up my summary - that way I have a chance to prioritize comments and work out any conflicting advice.

Thanks to our reviewers and looking forward to working together 😊

quinnanya commented 5 years ago

Thanks for writing this, @mjlavin80! I can imagine it being really helpful to point students to for a thorough explanation both of what tf-idf refers to, and one way to implement it. I really like the examples at the end, and the explanation of the implications for stemming or lemmatization. The comparison to other text analysis methods is also really helpful, particularly the section on topic modeling.

I wasn't able to find the Jupyter notebook or set of scraped documents to actually run this, but I'd be happy to try it out if you can point me to them and I can write up my experience doing it.

On the text itself, I've got three different sets of feedback: substantive things, organizational tweaks, and typos. I'll go through them in that order, even though that means jumping around a bit in the tutorial itself. Vis-a-vis organizational tweaks, there were multiple points where I made a note to myself with a question, only to discover that you'd answered it below the table, code, etc., I was looking at. Moving some of those explanations above the breaks in the text might be effective for preempting confusion.

Substantive stuff

p2: I was a little confused about the quote from Ted Underwood upfront. I like the idea of framing this as a point of entry into an area of discussion in computationally oriented disciplines, but maybe it's just the wording of that quote, but I wasn't 100% sure that what the quote was referring to was the same thing as the area of discussion you were referencing earlier in that paragraph. It might help to paraphrase the quote and append it to the previous sentence, e.g. "a bigger subject in computationally oriented disciplines: how to identify words or phrases that characterize an author or genre" (and maybe include a citation to the Underwood post if you want there).
p3/4: I had to read through these paragraphs multiple times, starting with: "many of the historic figures are well known, which suggests a self-conscious effort to look back at the history of The New York Times and select obituaries based on some criteria. In short, this isn’t a representative sample of historic obituaries, it’s a recent collection." I'm not totally sure this follows. Maybe NY Times primarily publishes obituaries of famous people, so a representative sample could turn up familiar names? Also, "representative sample" and "recent collection" aren't necessarily mutually exclusive things. I got more befuddled in p4, when you mention that the dataset hasn't been updated since 2011, and it's been replaced by a sleeker blog on another page. Does the sleeker blog use the same data set (not updated since 2011) but just looks nicer? And I really got lost with "It represents, on some level, how the questions inclusion and representation might affect both the decision to publish an obituary, and the decision to highlight a particular obituary many years later. " I wasn't sure what the "it" was -- the new site? The original data set? The fact that they're different (if they are different)? And this is the first introduction of the issues of inclusion and representation -- which become clearer in the following sentences, but seem a little out of nowhere when they're introduced. I think it might help to be a little more straightforward rather than suggestive with this paragraph. If I'm correctly reading between the lines, what you've scraped is a curated set of obituaries of more or less famous people, dating to 2011 (or earlier -- but not later) with few women and fewer non-white people. If you state that clearly upfront, it sets the stage for the reflections on questions of inclusion and representation, and the quote from Padnani and Bennett.
p5: I found your initial definition/explanation of tf-idf really clear and understandable. More understandable, at least for me, than the quote in p6, to the point where I might consider leaving the quote out, and just having a reference to Spärck Jones for further background.
p6: I see where you're going with the restaurant analogy, but like with the quote, I found it a little more difficult to follow than the initial explanation (though this might be me, and it might resonate for someone else.) One thing that'd help me parse the restaurant analogy is making it more concrete -- maybe comparing a chain restaurant that gets good reviews (e.g. a McDonald's or Subway with fresh produce, quick service, friendly staff, etc.) but you can have the same meal anywhere, vs. a local restaurant with food personally made by a local chef.
p13: "For example, Scikit-Learn’s implementation represents N as N+1, calculates the natural logarithm of (N+1)/dfi, and then adds 1 to the final result." -- this leaves me wondering why this is, or what sort of normalization it accomplishes. Relatedly from p19, "Different normalization schemes would produce different scales." That one leaves me wondering "what are the other options?" , "are some scales better for some kinds of use cases than others?", "can you choose which normalization happens in the Scikit-Learn algorithm or any other?". It might be worth providing a pointer, if not a quick explanation, for some of those questions. For a specific situation where it might be relevant, I'm not sure if HTRC provides tf-idf for its in-copyright books, but if you have your own texts that you want to compare to a corpus where you can't run tf-idf yourself (but have those values), it'd be important to use the same normalization to be able to compare them.
p28: "Text processing steps like tokenization and removing punctuation will happen automatically when we use Scikit-Learn’s TfidfVectorizer to convert documents from a list of strings to tf-idf scores." -- Are there easy add-ons for languages with more complicated tokenization? (e.g. Chinese, Arabic, Japanese?) Do you need to preprocess your text that way (punctuation stripping and tokenization) first before running tf-idf?
p50: "it’s best to understand exactly what each setting does so that you can describe and defend the choices you’ve made" -- this amounts to a feature request, and it's totally fair to say it's outside the scope of this piece, but I'd love to see an example of this (either your own, or a reference to an article where someone does it).
p51: "preselected list of high frequency function words": if there's a list of these in the Scikit-Learn settings documentation, it might be worth mentioning that (or where else you can look to see what specifically is being filtered out)
p54: I love the pointer for more explanation of what's going on with the first thing you mention, and it'd be great to have similar pointers for the other two. Why would you want an extra pretend document? What's the benefit of log(tf)?

Rearranging

In p1, I think it'd be good to expand the tf-idf acronym upfront, even if you wait until later to spell out what all it means. At least for myself, even if I know what an acronym means conceptually, I want to know what it stands for early on in case someone asks me.
"Suggested prior skills" point 1 -- you mention being "comfortable with basic types and operations", but I wasn't 100% sure what all you were counting under that. An enumeration of those things comes in point 2; it might be clearer to just put those together in point 1, and then provide the pointer to the intro to Python materials if people aren't familiar.
p12 notes "this version has a second column that represents the number of documents in which each term can be found" -- the first questions that come to mind, from a hypothetical newcomer, are "why Df?" You get into that immediately after the table, but moving even just the first sentence of p13 up above the table could head off that question.
p29 "convert the sparse matrices to a numpy array" -- numpy arrays get explained after the code block. Moving p30 to the end of p29 might help alleviate confusion.
p39: "Each list is dominate by individualized words (proper names, geographic places, companies, etc.) but I could screen these out using my tf-idf settings" -- how would you do that? Is NER built in? There's another reference made to tf-idf settings later, in p42. With both of those, it'd help to have a flag pointing you to the section on relevant tf-idf settings; when I was reading, it felt like an omission until I got to the bottom. You've done a great job thinking through what additional information is needed, it just needs a few signposts.
p40 "I’ve used boldface to indicate terms that seem overtly related to authorship or writing" -- it'd help to have this before the table, rather than after it, so people know what the bold is about when they're looking at the table.

Typos / phrasing

"Suggested prior skills" point 1 -- you say "tf-idf is available in many versions of Python and other programming languages". I know what you're getting at, but I can see it being a little confusing if someone is totally new to this, because it reads a little like a brand name. (e.g. something like "the Google Translate API is available for many versions of Python".) Consider rephrasing to something like "there are packages that enable you to run tf-idf in many versions of Python and other programming languages"?
p4: typo "This obituary corpus also an historical object in its own right."
p4: typo "how the questions inclusion and representation might affect"
p4: typo "In march 2018"
p25: typo "The to most common variable naming"
p36: phrasing: "converts each merged term/score pairs to a pandas dataframe" -- it might be worth putting in a brief explanation of what that is
p37: typo "If you the code excerpts above"
p38: typo "these terms lists will be helpful"
p38: typo or confusing wording "definitive in terms to defending claims"
p46-49: the formatting of this is off, they're all noted as 1 rather than increasing numbers
p49: not totally sure what "item 2" refers to; is it the previous thing in this same set of examples?
p49: "As I will show in the next section, tf-idf can also be used to cull machine learning feature lists and, often, building a model with fewer features is desirable...." The next section I'm seeing is settings, and I'm not sure this shows up explicitly later, but maybe I'm missing something. A more specific pointer might help.
p58: typo "measures that attempt to indicated"
p58: typo "A Chi-square test, for example, we can evaluate the relationship"
p58: typo "a P-value indicating probability of encountering" (the probability?)
p59: typo "Tf-idf is espeically appropriate"

ZoeLeBlanc commented 5 years ago

Thanks to @quinnanya for her extensive review 👏. Once @broomgrass submits her review, I'll summarize both and provide some guidance on revisions @mjlavin80, and then we can discuss changes to the lesson.

quinnanya commented 5 years ago

@ZoeLeBlanc I wasn't actually able to find the notebook or code to run the lesson itself. Do you have a pointer, or is that not usually ready at this stage?

ZoeLeBlanc commented 5 years ago

@quinnanya I think the repository Matt is referring to is this one https://github.com/mjlavin80/tf-idf-programming-historian-draft . I think the idea is to create a jupyter notebook in that repo though I still need to clarify with him. I think the ideal solution would be to move data and a jupyter notebook example into the assets folder here. Thanks for bringing this up!

mjlavin80 commented 5 years ago

@quinnanya, thanks for your response! From what I've read so far, you've provided really thoughtful and constructive feedback. @ZoeLeBlanc is right that the lesson files are located at https://github.com/mjlavin80/tf-idf-programming-historian-draft. This repo is linked in the draft lesson, but I suspect the link needs to be more prominent. However, if I recall correctly, I didn't put a Jupyter Notebook into the repo because I assumed the reader would make their own empty notebook and paste code blocks as they went through the lesson. That said, I'm happy to make a notebook and add it to the repo.

quinnanya commented 5 years ago

@ZoeLeBlanc Thanks! Would it help for me me to create a draft notebook with the code mentioned and try running the lesson on the data in that directory, and post more feedback?

Oops, @mjlavin80, just saw your comment. I usually prepackage notebooks myself, so wasn't thinking about copying and pasting. Might be worth staring explicitly?

I'll try that and add any comments. Thanks!

broomgrass commented 5 years ago

Hi all! Just working through the final draft of my review, and @quinnanya covered many of the similar things!

I had similar questions about the repo - I ended up downloading it and making a new Jupyter notebook in the folder, which worked, but I think making the process a little more explicit at an earlier part of the lesson would be helpful.

Should have my review up and posted within a couple hours! :)

broomgrass commented 5 years ago

Hi everyone! Overall, I think this is a very clear and concise description of tf-idf, and I’m thankful to have it for my own reference! In particular, I think it works well to introduce someone to tf-idf both for its own sake and as a precursor/part of other methods – or, in other words, it certainly fulfills the stated purpose of the tutorial.

I also appreciate that @mjlavin80 identifies not just where the methods can be used for exploration, but also some of the assumptions and pitfalls to watch out for. There is more praise for particular elements that jumped out at me in the running commentary below, interspersed with some typos that I noticed and some other suggestions that may help improve the tutorial.

Happy to answer any follow-up questions, etc, if anything is unclear!

P1 – expanded definition of acronym tf-idf earlier?
P1 - “informational retrieval” sounds awkward to me – would information retrieval not work instead?’
P2 – The flow between the first and second sentence is a bit fuzzy. What exactly is “This longstanding issue”? This ambiguity also makes the “summary” quotation unclear.
Suggested Prior Skills – I wonder if it is worth noting here, or perhaps (probably?) later in the tutorial, how useful pandas is, either for the tutorial or for doing more work with the results. For example, if someone wanted to try creating their own top 20 tf-idf terms chart, it would be much faster to use pandas than to just use Excel (which is listed as a suggested prior skill). Perhaps this would mean linking to another tutorial / the pandas documentation as a resource? This might be going beyond the scope of the tutorial, but I wanted to float it as an option for people who get a bunch of csvs and then want to figure out how to do comparisons.
P3 – which year was scraped? (I know they're not from that year, so to speak, but still interesting - what if the tf-idf changed from year to year? I'm also always very curious as to where exactly the data comes from)
P4 – typo? It represents, on some level, how the questions inclusion and representation
P4 – typo? In march 2018,
P3 / 4 – structure – The final sentences of P3 seem to flow naturally into the later sentences of P4 – ie, “It represents on some level, how the questions [of?] inclusion and representation might affect both the decision…” The restructuring might help with the vague “it” at the beginning of that sentence, which at first I thought implied the replacement with the sleeker blog rather than the corpus more generally. Perhaps what is throwing me off is the second sentence of P4, about the version, which seems to break up the otherwise natural flow of the paragraphs.
P4 – The end of the paragraph about the contents of the dataset (“The dataset includes”) should be a new paragraph, to make it easier for people to find when skimming/reviewing the lesson.
P4 – For “The lesson files are located at https://github.com/mjlavin80/tf-idf-programming-historian.”, is it best to download the whole repo/project or just certain files? What exactly is meant by “the lesson files”?
P4 – Name the folders explicitly – ie, “The obits folder consists of .html files”, the “txt folder…”, rather than just saying “it includes a folder of .html files”
P6 – The first part of the Spärck Jones quotation made sense to me, but after the ellipses it actually started to muddle me up a bit. Also, is Spärck Jones the first significant proponent of tf-idf? Would it be relevant to say something like, “tf-idf become popular (?) after Spärck Jones pointed out in a 72 paper that etc etc”?
P7 – conciseness - weekend in a city that you’ve never visited before > new city?
P7 – typo - that’s just find
P11 – typo - her investigate journalism (though I did not know much about Nellie Bly before this and then went and read her Wikipedia page! Wow! Also, appreciate the foregrounding of women and African American folks as examples)
P 19 – “and the second new column multiplies the Count column to derive the final tf-idf score. – clarification – “multiplies the count column and idf column”
P 13 / 19 - The discussion of the effect of normalization is helpful (“adding 1 to the final idf value ensures…”), but something (even just a sentence) about the purpose of normalizing (either at the end of P 13 or 19) would help audience understand why this algorithm and others normalize (a quick google search doesn’t bring up an appropriate relevant answer for an intro audience).
P20 – following on the above, “adding 1 to the final idf value ensures that we will never multiply our Count columns by a number smaller than one.” < which is bad because?
P22 – typo - from the inside the lesson folder
P23 – typo - method from Python’s to generate a list
P25 – typo - The to most common variable naming patterns
P26 – for the code block and expected result, the print does not return ‘0101.txt’, but 'txt\0101.txt', which may confuse some beginners (my notebook is in the root of the master folder)
P28 – At “specific parameters (I’ll say more about these settings later)”, could refer to the title of the section (“Scikit-Learn Settings”)
P28 – by “The stored variable X is output of the fit_transform() method.” - is transformed_documents meant here?
P29 – LOVE building in printing length to make sure everything shipshape!
32 - We can use the get_feature_names() method to - perhaps worth specifying that this is a tfidfvectorizer method? (though considering it has vectorizer.get_feature_names() in the code block, perhaps unnecessary)
P33 – I’m unsure how Programming Historian handles differences between Windows/Mac/etc in its tutorials, but I’m running Windows, so the code block just wrote all the csvs into my txt folder. output_filenames = [txt_file.replace(".txt", ".csv").replace("txt/", "tf_idf_output/") for txt_file in all_txt_files]; instead, had to go in and modify txt\\ and tf_idf_output\\.
P37 – typo – if you the code excerpts
P39 – comment about bolding more helpful before the table
P42 – would love to see an example (even just a sentence, not even necessarily a full chart) of how the settings could produce different effects!
P42 – great example of close reading and Du Bois’s obit
P46 etc – all of the three methods are numbered as (1)
P50 – the following numbered list is all (1)
P53 – typo - It cn
P54 – another place where a brief mention of normalization earlier would help with clarification
P56 – greatly appreciated the references to how these methods are being used in more public-facing writing and projects
P57 – numbers all (1)
P58 – the keyness description is a little confusing – is tf-idf also a keyness test? Having the difference (That keyness gives a numerical indicator of atypical usage? Or that it is numerical?) earlier in the paragraph might be more clear.
P58 – grammar - A Chi-square test, for example, we can evaluate the relationship of a term frequency to an established norm
59 - espeicially
P59 – often want to run topic modeling on a corpus as a first step and, in at least some of those cases, tf-idf would be preferable. – as in, scholars should run tf-idf instead of topic modeling? Or as a pre-processing step for topic modeling somehow? By saying “in at least some of those cases”, does that specifically mean the bird’s eye view / exploration? A bit more clarity here might be helpful, especially for folks who are unfamiliar with topic modeling / new to these methods.

ZoeLeBlanc commented 5 years ago

Thanks @quinnanya & @broomgrass for these in depth reviews 👏 . I'll need a bit of time to summarize them but should have that done by next Monday April 8th. I'll ping you @mjlavin80 once I post them here, and then we can discuss the recommendations and determine a timeline for you to incorporate these changes into the lesson.

mjlavin80 commented 5 years ago

Roger that. Thank you @quinnanya, @broomgrass, and @ZoeLeBlanc!

ZoeLeBlanc commented 5 years ago

Just a quick update! Apologies for the delay I came down with a case of bad allergies the last few days. Currently working on consolidating both reviews and should have it here by Saturday April 13. Sorry again!

mjlavin80 commented 5 years ago

No problem @ZoeLeBlanc. Next week is our last week of classes, so I probably won't be able to look at it until Friday, April 19 anyway.

ZoeLeBlanc commented 5 years ago

Thanks @broomgrass and @quinnanya for such comprehensive reviews! I've compiled and consolidated both their reviews, and organized them by paragraph. @mjlavin80 if you have any questions or concerns feel free to message me here. I realize there's a lot of feedback to incorporate, so would a deadline of early June to address these suggestions and fix any typos make sense? If you need more time please let me know. Looking forward to getting this lesson published 🎉

To Dos:

[x] Move example code (and potentially a Jupyter notebook with the code) into a folder in the assets directory
p1:
- [x] the first time using TF-IDF use the full Term Frequency - Inverse Document Frequency for clarity
- [x] replace "informational retrieval" with "information retrieval"
- [x] I would recommend cutting the line "If you’ve gotten this far,"
p2: it sounds like the opening quote is a bit confusing. I'm a bit more agnostic on this point so if you feel strongly about keeping the quote you can though I think Quinn's suggestion is helpful here:
- [x] "It might help to paraphrase the quote and append it to the previous sentence, e.g. "a bigger subject in computationally oriented disciplines: how to identify words or phrases that characterize an author or genre" (and maybe include a citation to the Underwood post if you want there)."
Suggested prior skills:
- [x] replace "but tf-idf is available in many versions of Python and other programming languages." with "but there are packages [or could say libraries] that enable you to run tf-idf in many versions of Python and other programming languages"
- [x] Move "The precise level of code literacy or familiarity recommended is hard to estimate, but you will want to be comfortable with basic types and operations." to after "Code for this lesson is written in Python 3.6, but tf-idf is available in many versions of Python and other programming languages." and replace "basic types and operations" with "Python’s basic types (string, integer, float, list, tuple, dictionary), working with variables, writing loops in Python, and working with object classes/instances." then remove the second bullet point.
- [x] after "Experience with Excel or an equivalent spreadsheet application if you wish to examine the linked spreadsheet files." add "You can also use the pandas library in python to view the CSVs (footnote: See the sample jupyter notebook for examples.)"
p3/p4: Both Quinn and Catherine were a bit confused with the wording here and wanted more details. So I'm recommending the following edits but there's a lot here so happy discuss more in depth:
- [x] remove "relatively small" from the first line of p3 "Tf-idf, like many computational operations, is best understood by example. To this end, I’ve prepared a ~~relatively small~~ dataset of 366 New York Times historic obituaries scraped from https://archive.nytimes.com/www.nytimes.com/learning/general/onthisday/.
- [x] reference selected year for clarity in p3 "For each day of the year , The New York Times featured an obituary of someone born on that day. (There are 366 obituaries because of February 29 on the Leap Year.)"
- [x] change "The version of The New York Times “On This Day” website used for the dataset hasn’t been updated since 2011, and it has been replaced by a newer, sleeker blog located at https://learning.blogs.nytimes.com/on-this-day/." to "This dataset is from a version of The New York Times “On This Day” website which hasn’t been updated since 2011, "and it has been replaced by a newer, sleeker blog located at https://learning.blogs.nytimes.com/on-this-day/." move into a footnote for the first line of p3
- [x] Move to the beginning of p4: "This dataset is available at [PH repo]. The folder includes the central metadata.csv file with each obituary’s title and publication date. The original data is also available in the obituaries folder, containing the .html files downloaded from the 2011 “On This Day Website” and a folder of .txt files that represent the body of each obituary. These text files were generated using a Python library called BeautifulSoup, which is covered in another Programming Historian (see Intro to BeautifulSoup ). The lesson files are located at [PH repo"”
- [x] Move to the second half of p4: "This obituary corpus is ~~also~~ an historical object in its own right. It represents, on some level, how the questions of inclusion and representation might affect both the decision to publish an obituary, and the decision to highlight a particular obituary many years later. The significance of such decisions has been further highlighted in recent months by The New York Times itself. In March 2018, the newspaper began publishing obituaries for “overlooked women”.1 In the words of Amisha Padnani and Jessica Bennett, “who gets remembered — and how — inherently involves judgment. To look back at the obituary archives can, therefore, be a stark lesson in how society valued various achievements and achievers. In short, this isn’t a representative sample of historic obituaries, it’s a recent collection [think this could be instead: "it's a sample of who the New York Times in X considered worthy of an obituary".
[x] p6: Quinn suggests potentially dropping the quote from Spärck Jones and Catherine suggests dropping the section after the ellipsis. You could drop the paragraph and add Spärck Jones' name to p5. "First proposed by Karen Spärck Jones in 1972, Tf-idf stands..." and then move p6 to a footnote. Let's discuss these changes because if you feel strongly about keeping the quote in the main body of the lesson, I think we can make it work.
p7:
- [x] replace "in a city that you’ve never visited before" with "in a new city, called Idf City."
- [x] replace "that’s just find for your first goal" with "that's just fine"
- [x] I think this analogy needs a additional sentence or two to explain how it relates to TF-IDF, which is about trying to find something good and distinctive. I also like Quinn's idea of using a chain restaurant versus a local one to make this analogy a bit more concrete.
p11:
- [x] replace "her investigate journalism" with "her investigative journalism"
p12:
- [x] Quinn recommends moving the first line from p13 "Document frequency (df) is a count of how many documents from the corpus each word appears in. (Document frequency for a particular word can be represented as dfi.)" to the end of p12 for clarity of the table. Alternatively you could rename the table column Document Frequency (df).
p13: Quinn and Catherine both recommend giving a further explanation of normalization in TF-IDF here. I think we need to be careful to not get overly deep into the weeds but I do think explaining very briefly what normalization means in TF-IDF will help. I tried my hand at that below, but feel free to rewrite that sentence.
- [x] After "However, many implementations..." add the following sentence "Normalization is a statistical term for manipulating the distribution of data to account for outliers that might skew the results. In TF-IDF, normalization is generally used in two ways: first, to prevent bias in term frequency from terms in shorter or longer documents; second, to calculate each term's idf value (inverse document frequency)" then "For example..."
- [x] could also add a note that normalization parameters will be discussed later in the lesson
p19:
- [x] replace "multiplies the Count column" with “multiplies the count column and idf columns”
p20: Catherine recommended a bit more information about why adding scikit adds 1 to the final idf value
- [x] maybe at end of "This effect is also the result of our normalization method; adding 1 to the final idf value ensures that we will never multiply our Count columns by a number smaller than one", add "which would potentially lead to skewing in our distribution of term frequencies". I realize this is a bit more complicated and part of the issue is dealing with Zipf's law and trying to normalize tf values for short and long documents, so if you want to give a different rationale feel free.
p22:
- [x] replace "from the inside the lesson folder" with "from the inside of the lesson folder"
p23:
- [x] replace "method from Python’s to generate" with "method from Python to generate"
p25:
- [x] replace "The to most common variable naming" with "The two most common variable naming"
p26:
- [x] replace "the first file to make sure it’s ‘0101.txt’." with "the first file to make sure it’s ‘txt/0101.txt’." to make sure students aren't confused by the root directory being part of the file name
p28:
- [x] Quinn asks about tokenizing and stop word removal for other languages besides English. You could add a line here that if using a language besides english you can specify the tokenizer and stop list in TF-IDF. Also noticed that in the code block you're using stop_words=None, but you only explain that under Scikit-Learn Settings so maybe link to that section for explanation.
- [x] replace "so I instantiate it with specific parameters (I’ll say more about these settings later)" with "so I instantiate it with specific parameters (I’ll say more about these settings in the Scikit-Learn Settings section)"
p29: Quinn recommends moving the description of numpy array at the first line of p30 to before the code block in p29. I think it works either way because something having the code first and then explaining can be helpful, rather than having explanation first and someone wondering what you're talking about. So I'll leave to your discretion
p32:
- [x] in "We can use the get_feature_names() method" add ", which is built in to the TF-IDF class,"
p33:
- [ ] Catherine rightly points out that some of the os commands won't work on a windows machine properly. Let me double check with the rest of the editorial board over whether we want to include instructions for both, or just specify that this lesson is for Mac
- [ ] Here's Catherine comment: " I’m unsure how Programming Historian handles differences between Windows/Mac/etc in its tutorials, but I’m running Windows, so the code block just wrote all the csvs into my txt folder.output_filenames = [txt_file.replace(".txt", ".csv").replace("txt/", "tf_idf_output/") for txt_file in all_txt_files]; instead, had to go in and modify txt\\ and tf_idf_output\\."
p36: Quinn recommends explaining what a pandas dataframe is here. I think we want to avoid recovering too much ground, so we could either add Pandas to the list of required knowledge or just footnote to this lesson https://programminghistorian.org/en/lessons/visualizing-with-bokeh which explains pandas a bit more in depth
p37:
- [x] replace "If you the code excerpts above," with "After you run the code excerpts above,"
p38:
- [x] replace "that these terms lists" with "that these term lists"
- [x] Quinn finds "but will not necessarily be definitive in terms to defending claims." a bit confusing. You could reword this to be "but will not necessarily produce definitive claims."
p39:
- [x] Quinn asks about NER to screen out individualized words and points out that you could link to the TF-IDF Settings section. Maybe add "using my tf-idf settings, or named entity recognition."
- [x] Catherine and Quinn recommend moving up the first line of p40 to before the table "I’ve used boldface to indicate terms that seem overtly related to authorship or writing." I prefer descriptions after figures so I'll leave this your discretion again
p42:
- [x] Catherine suggests adding an example how changing the settings would produce different results. Quinn also requests something similar for p50 when you write "Instead, it’s best to understand exactly what each setting does so that you can describe and defend the choices you’ve made." I think this is a good idea but this is an introductory lesson should we should be careful in how much detail we include. Maybe you could add something about which types of parameters of the TF-IDF model you would alter to test the stability of your findings?
p46 & p50 & p57:
- [x] For some reason the examples are all listed as 1. Usually markdown handles incrementing numbers if you use 1, so maybe this is a github markdown issue?
p49:
- [x] Clarify what you mean by "Item 2"
- [x] Clarify and maybe directly reference what section you mean in "As I will show in the next section"
p51:
- [x] Add a sentence and link about where to find more information about the preset list of stopwords in scikit-learn
p53:
- [x] replace "It cn be" with "It can be"
p54:
- [x] Quinn and Catherine both recommend potentially adding more links to explain both smooth-idf and sublinear_tf. I'm not sure there's great resources for these settings besides stack overflow answers. If you can't find anything, I would say we can put this off for now and hopefully a future PH lesson will give more in depth dive to normalization in stats.
p58:
- [x] Catherine recommends clarifying the relationship between tf-idf and keyness. Tbh I'm not familiar with keyness so I'm just posting her comment, here: "the keyness description is a little confusing – is tf-idf also a keyness test? Having the difference (That keyness gives a numerical indicator of atypical usage? Or that it is numerical?) earlier in the paragraph might be more clear."
- [x] replace "measures that attempt to indicated" with "measures that attempt to indicate" or just "measures that indicate"
- [x] replace "A Chi-square test, for example, we can evaluate" with "For example, with a Chi-Square Test we can evaluate..."
- [x] replace "P-value indicating probability of encountering" with "P-value indicating the probability of encountering"
p59:
- [x] replace "Tf-idf is espeically appropriate" with "Tf-idf is especially appropriate"
- [x] Quinn asks for some clarification for "I find that newcomers to digital humanities often want to run topic modeling on a corpus as a first step and, in at least some of those cases, tf-idf would be preferable." and whether you mean here that TF-IDF should be used instead of topic modeling or as a first step in topic modeling. I believe both are true, but that the point you're making is the former rather than the later. I'm not exactly sure how we can clarify this further here without getting into the weeds of topic modeling, but I'll post the rest of Quinn's comment here too: "By saying “in at least some of those cases”, does that specifically mean the bird’s eye view / exploration? A bit more clarity here might be helpful, especially for folks who are unfamiliar with topic modeling / new to these methods."

mjlavin80 commented 5 years ago

@ZoeLeBlanc I just pushed a round of changes. I'm not sure I've resolved everything, but I want to ask you to look at what's there before I do more. Here's a summary of which edits I left incomplete and why:

Re: Jupyter Notebook. To "run a Jupyter Notebook from the inside of the lesson folder" I added ", copy/pasting blocks of code from this tutorial as you go." My assumption was that readers would begin with a clean notebook and fill it in. I think this is more likely to encourage going through the code step by step.
Suggested prior skills: replace "but tf-idf is available in many versions of Python and other programming languages." with "but there are packages [or could say libraries] that enable you to run tf-idf in many versions of Python and other programming languages"
Response Here I was worried about clarity ... I mean to suggest that tf-if is available in any combination of Python versions and several Python libraries. Also in other languages entirely. Here's how I rewrote it:

Code for this lesson is written in Python 3.6, but you can run tf-idf several different versions of Python, using one of several packages, and you can find tf-idf implementations in various other programming languages.

p3/p4 I tried my best to do all the recommended changes here, but I got a bit lost. I also had to write a longer answer for what year the corpus represents (all this is now in the footnote)
p6 For the edit re: Karen Spärck Jones, I changed your version slightly because Jones didn't ever use the name tf-idf, but she did recommend the operations that became tf-idf
p20 I'm a bit worried about the word skew here since it has a specific meaning in statistics.
p23 "method from Python’s to generate" ... I cut "from Python's" entirely
p32 this suggestion might be redundant since a method is definitionally part of a class. How about "We can use the TFIDFVectorizer class's get_feature_names() method"?
p33 will defer to house style on issues of Windows compatibility
p42 re: moving explanation of boldface. I left this as is. It seems more like an interpretive move, so I think it works better after the table. The other p42 recommendation seems like a big edit to me.
p46 & p50 & p57: I think this is caused by using #### (h4) with numbered lists. I tried to fix it but I'll wait to see how the transformation looks.
p 58 I left "for example" in the middle of the sentence but added "with" as suggested
Other

I also updated the bibliography and re-numbered the endnotes. I'll want to double check those once everything else is set. Please do let me know what you think of all this. Thanks!

mjlavin80 commented 5 years ago

@ZoeLeBlanc I'm just pinging this thread in case you didn't notice my last post. I'm holding off on doing any more edits before I hear back from you. If you're just busy doing other things, apologies for nagging.

ZoeLeBlanc commented 5 years ago

@mjlavin80 sorry and thanks for pinging me! For some reason didn't see your reply in my Github notifications. Let me go through your response and I'll post an updated reply today. Thanks for getting back to me so quickly 👍

ZoeLeBlanc commented 5 years ago

Here's my feedback @mjlavin80 . Let me know what you think and if you have questions.

Re: Jupyter Notebook. Let me ask the editorial board about whether we want to have a Jupyter notebook with the code already pasted in the folder. I agree that copying and pasting is more likely to encourage going through the whole tutorial, but I'm not sure if we have a set policy already on this.
Prior skills looks good to me
Really like the changes to p3/p4 and think it's much clearer now with the footnote too. I would go ahead and tick off the remaining to-do boxes with these changes.
Think p6 reads great, and had no idea about term specificity, but actually really like that term too. Also appreciated the additions in p9 to the restaurant analogy, think that's a lot more clear now.
p20 agree that skew might be misleading here. I think part of the issue is trying to briefly explain normalization and all its tradeoffs is difficult. I was trying to help you word something that explains how when multiplying a number smaller than one it changes the overall distribution of the data. Maybe something along the lines of "which preserves the original distribution of the data". Happy to discuss the exact wording though if you have thoughts.
p23 👍
p32 "We can use the TFIDFVectorizer class's get_feature_names() method" sounds good to me.
I'll get back to you on Windows (along with Jupyter notebooks)
p42 agree that detailing how to test for stability is probably too much to cover here (might be a topic for separate lesson even). Perhaps you could just add a parentheses in p43 after "robust results should be stable enough to appear with various settings" add "(Some of these settings are covered in the "Scikit-Learn Settings" section.)
Numbering seems to be fixed. Github markdown is so weird sometimes but thanks for finding that bug.
p 58 👍
I think the additional footnotes and references are really comprehensive so thank you for adding them!

Final thoughts reading over:

[x] p1 I would add 'text mining' or 'nlp' to 'measures and models', just to specify what you mean here.
[x] p2. This sentence is now really long and I think a bit confusing. Instead would suggest the following edits: "Looking closely at tf-idf will leave you with an immediately applicable text analysis method. This lesson will also introduce you to some of the questions and concepts of computationally oriented text analysis. Namely, if you want to know what words or phrases characterize a set of documents, how can you isolate one document’s most important words from the kinds of words that tend to be highly frequent in all documents in that language. This topic is too large to cover in one lesson, and I highly recommend Ted Underwood's 2011 blog post in addition to this lesson.1"
[x] p4 "which is covered in another Programming Historian" add "lesson" here
[x] p5 "how the questions [of] inclusion and representation"
[x] p6 "The procedure later called" remove "later"
[x] p13 "These steps parallel scikit learn’s tf-idf implementation" change "Scikit-Learn" to be consistent in the lesson

These were the only changes I would recommend at the moment, but I'll read it closely again before we hit publish. I think the lesson is looking really great though and thank you for all your hard work!

Let me know if you have any additional questions, but I would go ahead and make these changes, check off the rest of the todos, and I'll get back to you about the Windows and Jupyter notebooks policy.

mjlavin80 commented 5 years ago

@ZoeLeBlanc I just pushed changes. I've checked off the rest of the to-do items, all except for those relating to a Jupyter Notebook and handling paths in Windows. Thanks again for all your help!

mjlavin80 commented 5 years ago

@ZoeLeBlanc one quick follow-up: if you or others are still mulling a solution for Windows and Mac file path compatibility in Python, has anyone looked at pathlib? It’s part of Python 3's standard library, and it seems very straightforward. I'd be happy to add it to my code.

ZoeLeBlanc commented 5 years ago

Thanks so much @mjlavin80 for checking off these todos. I'm going to read through the lesson once more today but I think we pretty close to having this go live.

I consulted with the board and I think the conclusion is that we would like a Jupyter Notebook to be included in your lessons files zip that includes the code of the lesson. Having just checked out pathlib (which I didn't know about so thanks for the rec!), it seems like a much more interoperable library than os. I think if you just change the following code blocks in the lesson and push up a notebook with all the code, we should be good to go.

Let me know anything doesn't look right, and thanks for all your hard work!

P25

import os
all_txt_files =[]
for root, dirs, files in os.walk("txt"):
    for file in files:
        if file.endswith(".txt"):
            all_txt_files.append(os.path.join(root, file))
# counts the length of the list
n_files = len(all_txt_files)
print(n_files)

Should become 👇

from pathlib import Path
all_txt_files =[]
for file in Path("txt").rglob("*.txt"):
     all_txt_files.append(os.path.join(file.parent, file.name))
# counts the length of the list
n_files = len(all_txt_files)
print(n_files)

and P33

import pandas as pd
import os

# make the output folder if it doesn't already exist
if not os.path.exists("tf_idf_output"):
    os.makedirs("tf_idf_output")

[rest of code be fine]

Should become 👇

import pandas as pd

# make the output folder if it doesn't already exist
Path("./tf_idf_output").mkdir(parents=True, exist_ok=True)

mjlavin80 commented 5 years ago

@ZoeLeBlanc I tweaked one thing in your suggested code blocks above, as year p25 edit still contained a reference to os.path.join(). os can otherwise be removed entirely from the lesson, so I did the following:

from pathlib import Path

all_txt_files =[]
for file in Path("txt").rglob("*.txt"):
     all_txt_files.append(file.parent / file.name)
# counts the length of the list
n_files = len(all_txt_files)
print(n_files)

I tested this is a Jupyter Notebook on my mac, but I'm guessing someone should try it all on a windows machine.

Lastly, I changed the text of my lesson to refer to pathlib instead of os.walk(), and I changed how I described using a Jupyter Notebook to make reference to the one in the lesson files '.zip' file. I believe that's everything! Thanks!

ZoeLeBlanc commented 5 years ago

Two quick questions @mjlavin80:

Do you have a short bio I could use for your author information? I could also just condense your bio from here http://www.english.pitt.edu/person/matthew-j-lavin but wanted to ask your preference first.
We need a short abstract from the lesson. What do you think of this? 👇 "This lesson focuses on a foundational natural language processing and information retrieval method called Term Frequency - Inverse Document Frequency (tf-idf). This lesson explores the foundations of tf-idf, and will also introduce you to some of the questions and concepts of computationally oriented text analysis."

This is an example abstract from a recent lesson: "Machine learning and API extensions by HathiTrust and Internet Archive are making it easier to extract page regions of visual interest from digitized volumes. This lesson shows how to efficiently extract those regions and, in doing so, prompt new, visual research questions."

Feel free to propose changes to the abstract and I'll be pushing up the branch tonight on the main jekyll site 🎉 Thanks!

mjlavin80 commented 5 years ago

@ZoeLeBlanc the abstract looks great. For my bio, since they tend to run shorter than what's on my department page, how about this: "Matthew J. Lavin is a Clinical Assistant Professor of English and Director of the Digital Media Lab at the University of Pittsburgh. His current scholarship and teaching focus on book history, cultural analytics, turn-of-the-twentieth-century U.S. literature and culture."

ZoeLeBlanc commented 5 years ago

Perfect! Thanks so much Matt!

I'm adding a few links and copy edits but your lesson should hopefully be live by the end of this week 🎉

ZoeLeBlanc commented 5 years ago

Sorry for delay everyone! My computer died this last week and so it's been a frantic haze of data recovery.

But great news lesson is officially live and tweets go out tomorrow!!!! Thank you so much @mjlavin80 for writing this excellent lesson, and @quinnanya & @broomgrass for being such generous reviewers! Also HT to @acrymble for giving me guidance throughout this process.

Official link is here https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf and please feel free to tweet & retweet 🎉

ZoeLeBlanc commented 5 years ago

@mjlavin80 do you still have a twitter account? If you could send your handle that would be great but no worries if you don't have one at the moment. Just wanted to check before I started tweeting! Thanks!!!

mjlavin80 commented 5 years ago

@ZoeLeBlanc I'm not on twitter anymore, but I'm happy to help promote the piece in other ways!

ZoeLeBlanc commented 5 years ago

No worries @mjlavin80! Totally understand and just wanted to double check in case your handle had changed. I'm happy to tweet about this lesson since I think it needs to go on every DH syllabus ASAP 😄

acrymble commented 5 years ago

@ZoeLeBlanc is this ready to close? Is the editorial checklist fully complete?

programminghistorian / ph-submissions