Review Ticket: Exploratory Data Analysis of a Correspondence Text Corpus

acrymble commented 7 years ago

The Programming Historian has received the following tutorial on 'Exploratory Data Analysis of a Correspondence Text Corpus' by @zoews. This lesson is now under review and can be read at:

http://programminghistorian.github.io/ph-submissions/lessons/exploratory-data-analysis-with-nlp

Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.

I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. I will first read through the lesson and provide feedback, to which the author will respond.

Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.

I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me. You can always turn to @ianmilligan1 or @amandavisconti if you feel there's a need for an ombudsperson to step in.

Anti-Harassment Policy

This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. If anyone witnesses or feels they have been the victim of the above described activity, please contact our ombudspeople (Ian Milligan and Amanda Visconti - http://programminghistorian.org/project-team). Thank you for helping us to create a safe space.

WIth this review ticket, I close the corresponding 'proposal' ticket: #91

acrymble commented 7 years ago

@zoews I've had a chance to read through the lesson, but not try the code (I can't download 2GB at the moment because my internet is running off my phone).

I really liked this tutorial, and I think it's a fantastic example of using digital approaches to help directly with the research process. I also hope we'll see your published research based on these approaches in the near future.

I've got a few comments and questions, and a number of (often minor) suggestions that I'd like you to address before we get the formal reviews.

Firstly, regarding the ethics of the corpus:

1) What is the legal status of the Enron corpus? This material should be under copyright based on my understanding. Is there something specific that has occurred that would allow us to host a subset of this material without infringing on copyright?

2) Similarly, these are presumably in most cases living individuals. What are the ethical implications of sharing and using this corpus as you have done?

Secondly, regarding the scope of the lesson:

3) This feels like 2 lessons to me. The first teaches users how to conduct and interpret a sentiment analysis on a single or a small number of texts. The second teaches the user how to combine network analysis and sentiment analysis to understand how variables affect sentiment. I suggest splitting the lesson at paragraph 74. It would also address my queries about the opening paragraph being too broad (below)

More specifically:

[x] The opening paragraph needs work. At the moment you're trying to do too many things. You might consider cutting the sentences in p1 entirely, and leading with p5. You can explore data in so many ways (summary statistics of various kinds, etc). From reading the tutorial, I felt I learned how to conduct sentiment analysis, which is useful for exploratory data analysis. I don't feel that I learned how to conduct exploratory data analysis (which could be any number of things).
[x] p1 - You allude to the Enron crisis, but better to be explicit and link to Wikipedia. Many people won't know what it is.
[x] p1 & p104 - please drop the footnote and convert this to an 'acknowledgements' at the end.
[x] p1 - thinking about sustainability, can we get a clear list & links here at the end of the first paragraph that outlines all dependencies and things people will be able to download?
[x] p2 - interesting examples. Are there real published research examples that you could link to rather than hypotheticals? Or even blog posts of work in progress?
[x] p3 - what are my three academic personas? I found this paragraph confusing. Are you saying that you're going to help me understand what questions might be worth asking of my corpus?
[x] p4 - can you link to any of these tutorials?
[x] p5-p6 - this would be a better p1. Good explanation.
[x] p8 - can you link to the arXiv in addition to your reference?
[ ] p10 - can you link to some of this historiography? Just a couple of examples would be awesome.
[x] p30 - I love your example of Vader working. But can you problematise sentiment analysis? Lasts time I tried something like this, it was terrible at double-negatives and unusual phrasing. I'm not not happy; I don't not hate her; My name is Mr. Frown etc. It's also crap at sarcasm. Can you show it failing too? Just so people don't blindly trust them? Or link to essential reading?
[x] p20 - links to where I can find this stuff would be good. How do I know I've done it right?
[x] p34 - I interpret this as a negative message overall. Is that what I'm supposed to be thinking?
[x] p39 - I'm not clear how it's a positive overall sentiment from the outputs. It looks more neutral to me. What do the numbers mean and how am I to interpret them? Sorry if I missed this.
[x] p41 - link to more info on english.pickle? Do I have it installed?
[x] p54 - you lost me on the 3rd example. Too much jargon.
[x] p55 - you haven't talked about the directory structure of the dataset. Flip 56 and 55?
[x] p59 - a couple of examples of these colon separated values? Just to refresh people.
[x] p60 - The code is well commented, but can you just briefly talk us through the steps the code will do before you show it to us? Talk out the algorithm.
[x] p62 - the output is really just a matrix, and conceptually (as far as I can tell) no different from a CSV or a spreadsheet. Can you help readers make that mental link?
[x] p63/64 - recap where we are in lesson progress before starting the new section. What's been done and what still needs doing?
[x] p67 - tuple will be jargon for most people. So will be key/value. Maybe beef up the conceptual idea that the data isn't in the format we want it, so we're going to change it. We do this through a process known as 'mapping'.
[x] p69 - same with slices. If you need people to be familiar with python terminology as a precondition of being able to do this work, then that needs to be explicit in the introduction. In either case, link these types of terms to Wikipedia or python website.
[x] p74 - should this now be a new lesson? It looks like a substantial new development that brings in network analysis as well.
[x] p79 - you really need to be confident in python to understand this lesson, based on this code block. That's fine but needs to be explicit.
[x] p86 - you will have a 'data' folder to store these types of things. You could also just introduce this as a code block, noting that you've already done the work. Have the reader save it themselves.
[x] p89 - error != break.
[x] p91 - you used the word 'substantial'. That raises the question: how does one measure significance? Is this a statistically different sentiment? That's important if we're translating this to publishable research.

--

Can you take a look at these issues and get back to me, so we can come up with the best way forward?

cassws commented 7 years ago

@acrymble : thank you so much for your feedback! I'm glad to hear of your overall impression and happy to elaborate on the issues your raised:

The Federal Energy Regulatory Commission initially released the e-mails as a public disclosure in its Enron investigation, intended for public record/public domain data. There were a number of privacy issues in the initial release, and after a back-and-forth process they were re-released, cleaned again, and distributed via CMU in 2004. This release appears to be the closest thing to that original "public domain release" available online (the original uploads appear broken). The CMU release is the source of its use in a number of studies and hosting on mirror sites, as researchers commonly cite the release in their methods and works cited. CMU describes the dataset here: http://www.cs.cmu.edu/~enron/

According to CMU, "I am distributing this dataset as a resource for researchers who are interested in improving current email tools, or understanding how email is currently used." I think this is the standard that our use would have to fall under, and I think it's appropriate to both use the Enron emails generally and host the subset on our site for this purpose. The lesson would also function as is without a hosted subset, which I am 100% comfortable with, but would require sustained stewardship to make sure the URL remains active - CMU has maintained their hosting for 13 years, but that isn't a guarantee for the future of course.

One factor that would support the subset hosting use-case: we are not hosting a searchable raw-text version of the corpus that could be indexed by a search engine, but rather a compressed file version that must be downloaded. I noticed a comment thread on the blog for Enron-Mail.com, the primary browsable/searchable mirror site for the corpus, where the CMU team lead expresses disapproval towards hosting indexable text versions that could be picked up by search engines (but CMU continues to link to enron-mail.com on their site, and it doesn't seem to be a legal issue, fyi): https://www.blogger.com/comment.g?blogID=1117717936174889592&postID=2942910070405514601

It would also be possible to contact CMU directly to discuss their thoughts if appropriate.

I definitely agree this part is important to consider carefully. I tried to limit my discussions of identifiable individuals in the lesson generally, and when I did discuss somebody, I picked folks who had been central to the investigation. I also generally limited analysis to the 'sent' folders of individuals who had been singled out by the FERC, again to avoid folks who appear only incidentally/peripherally in the corpus.

There's also a large lineage of articles and studies that identify individuals to varying degrees, and have passed a wide variety of IRBs. If there's an ethical issue, saying "others have done it too!" is not sufficient of course, but from the perspective of minimizing harms I do not feel there is anything in this lesson that would cause new damages to individuals discussed.

If we did want to further minimizing discussion of actual individuals, we could take steps to remove the identifying information I've included in the text of the lesson when I've described the job title of somebody involved, etc.

Splitting into two lessons/a two-part lesson works for me if that makes sense for the Programming Historian lesson format. I can help with that process, either now or at a later point, just let me know!

Thank you, Zoë

cassws commented 7 years ago

I will also go through and make the paragraph-by-paragraph edits but it will take me a few more days -- I'll plan to complete them by end of Monday of next week.

acrymble commented 7 years ago

I do think 2 lessons here is a better idea. I can see value for some people wanting one skill and not the other. I hope that isn't too much extra work. It looks like a fairly clean break to me. We can either progress with one at a time or both concurrently if you'd like. I'll have to think about what that means for reviewers.

cassws commented 7 years ago

Okay definitely, that makes a lot of sense. That would also help me be a little more deliberate about the difficulty/complexity level for each part, as you noted with Python code along the way.

It sounds like, regardless of whether you decide on concurrent or sequential editing processes, the next step is still to make the edits you've mentioned above -- I will follow up end of Monday with those and alert you when they're ready!

acrymble commented 7 years ago

Just wanted to check in on progress. Can you give me an update on how you're making out?

cassws commented 7 years ago

Hi @acrymble ,

I finished making my way through your edits above.

I also divided the lesson into a Part 1 and Part 2, but I'm not sure whether it works split out in that way. Is the goal to allow a reader to start with Part 2 if they choose to? I don't feel like there will be enough context to jump in halfway. However, if the goal is to help make the lesson more readable and approachable for a reader who still expects to complete both parts, or expects to complete just the first part, then I do feel the separation would work well.

Regardless, I feel that it's ready for the next step of review, pending any other modification you would like me to make. Happy to move ahead in whatever way makes sense!

acrymble commented 7 years ago

Are these both available at the same link, or did you upload a new file for the second one?

cassws commented 7 years ago

@acrymble Both at the same link. For now I just separated them with Part 1/Part 2 headers and some explanation in the introduction.

acrymble commented 7 years ago

I'm just reaching out to reviewers. I'll post an update when I've got them committed.

acrymble commented 7 years ago

Two reviewers have agreed to offer formal reviews of the lesson by 11 October 2017. Once they have been posted I will summarise a path forwards and we'll go from there @zoews

cassws commented 7 years ago

Oh fantastic! Yes please let me know if there's anything I can help with along the way, and otherwise I'll plan on jumping back into the project then. Thanks @acrymble !

alsalin commented 7 years ago

Many thanks for providing me the opportunity to review this lesson! Before getting started, please let me know if anything is unclear. First of all, this tone of the lesson I think is fabulous. It hits the mark of explaining a method without relying to much on the "how to" over the "why."

Just a few comments with respect to sustainability and clarity.

I feel that for a user new to exploratory data analysis, the section in P7-12 would be better served up front. As someone new to this form of working with data, I wanted to know why I should be reading this lesson upfront. The first few paragraphs on specifics of software caused me to get lost on what exactly I'd be doing with the lesson. Again, this is from someone distinctly unfamiliar with NLP. While it may not make sense to move up the entire section, it might be helpful to provide a little more of an introduction to the applications within the first paragraph or so.
In P21, it's unclear what "step 2" refers to.
For skimmability, it might serve the reader best to use natural language processing (NLP) on first mention in each section and then NLP thereafter within a section
"natural language processing" is sometimes lowercase and sometimes in title case
P26 "NLTK toolkit" should be spelled out (it's been a while since we've seen this acronym)
P27 "Natural Language Toolkit, or NLTK," -> Natural Language Toolkit (NLTK)
P27 While you have the version number for Python, can you also provide your working version numbers for the toolkit and for pandas? This information should be repeated at the beginning of the lesson as a note to the user/reader.
In P36 we jump directly into coding - it might be helpful to direct the user to basic Python documentation they might need to get started if they are completely new to working with Python.
P58 yay for the note about permanently hosting the files on PH
P74 typo in "Succes"
Part 1 is a strong, methods-based lesson. It does not read as a quick read-through for tips on working with the toolkit, and I think that's fine. Great even! It's a deep dive into why you would work with such methods and the ways to do so.
Not sure that splitting the lesson at P86 works with the flow - there's so much of the lesson before p86 and part 2 seems to be more of a wrap up and next steps. I wonder if it might be possible to add a TOC to this lesson (similar to https://programminghistorian.org/lessons/basic-text-processing-in-r)? Rather than splitting it into 2.

From a sustainability perspective, this lesson is great and does not have many issues other than remarking about version specifics here and there. My only main concern was getting the reader introduced to the "why" sooner in the introductory section.

acrymble commented 7 years ago

Thanks @alsalin, for such a prompt review. We're just waiting on one other. In order to make sure both reviewers get a chance to address the same piece of writing, I'll ask that you don't make any changes yet @zoews

acrymble commented 6 years ago

Just an update that our second reviewer has asked for a couple of extra weeks to write the review. We are hoping for a report in early December.

CzarinaChalid71 commented 6 years ago

Hi Zoe,

Sorry for the delay in submitting this review. This write-up of yours is a commendable one in text analysis studies. However, I would like to suggest certain things as follows:

It would be clearer for the uninitiated readers to have a brief explanation on what the various technical jargons mean. Your opening paragraph is rather broad. Segmenting the paragraph to perhaps one main idea in each would be more helpful to readers.
Since this is a two-lesson tutorial, the lessons should be separated and taught in two distinct sessions.
Exploratory studies should still be guided with a motive on the outset. It would serve as a helpful guide to the readers in following this tutorial. The motivation for the deployment of this tool is rather too general and lacks direction.
para 1 - a bit more detailed description about the Sentiment Analysis, being the key analytical tool, would be helpful.
Since this is a tutorial, a more elaborated self-explanatory step-by-step procedure for performing the analysis would provide readers with a clearer understanding. Additionally, links to the tutorials may assist the readers better.
The corpus that you made the subject matter of your paper i.e. Enron crisis may be elusive or even unknown to some. Perhaps, once again, either you provide some links or description about the corpus or take some other more common corpus or cases.
para 27 - the mentioning of Python 3, NLTK and Pandas is rather too abrupt. I assume this tutorial is meant for advanced learners of these tools, isn't it?
para 36 - Abrupt discussion on coding. Newbies to Python would appreciate it more if they are guided to work with the fundamentals of Python before dealing with a more somewhat advanced matter.
para 62 - the initiation into the sub-topic 'Data Structuring' is too abrupt. Here, the layout of the entire tutorial would be more organised if it readers are supplied with signals like numbering for each topic and sub-topic.

Review summary: Overall, this paper is interesting. Presentable and informative.

acrymble commented 6 years ago

Thanks so much for the review. @zoews please give me a few days to consider these and I'll write up a suggested path forward that you can respond to. I will hope to have this by the end of the week!

cassws commented 6 years ago

Sounds great -- thank you so much for your edits @CzarinaChalid71 (and yours as well @alsalin !)

acrymble commented 6 years ago

Thanks to our reviewers. It looks like the advice is fairly straight forward and practical to implement. I'll let @zoews respond to each point in the final draft.

I am still of the opinion that this should be 2 tutorials, because I think people will want to know about sentiment analysis more than exploratory data analysis. From a Programming Historian perspective it also raises problems of where we would put a 'Sentiment Analysis' lesson since this one exists (but isn't framed as obviously a skills lesson in Sentiment Analysis.

If you want to discuss that further, I think it's worth us talking via Skype so we can have a conversation. Otherwise, I think at this stage we should drop Part 2 entirely, and polish up this lesson and publish it. Then return to part 2 as a new lesson with a new intro, and cross-reference the two so that readers who want to learn more can read on. But we can get Part 1 out fairly soon, so no need to do both at the same time.

Let me know what you think. I realise we're coming up on the holiday season, but if you can get this done before 15 January that would be great. Let me know if that's unrealistic.

cassws commented 6 years ago

@acrymble I definitely agree with you about 2 tutorials, especially with a little distance from the first draft (it's a lot to jump into as one continuous lesson!)

The timing is actually quite good for me -- I will be able to focus on this from the 11tb-20th of this month, and I will have extra time early January if necessary. My goal will be to get it to you by 20 December if at all possible, and certainly by 15 January at the latest.

Perhaps we could have a Skype conversation next week (the week of the 11th-15th) just to make sure I define the scope of the sentiment analysis part appropriately? If that works please shoot me an e-mail and we can narrow down a time. But yes, this is a great starting point and I will update as I go.

cassws commented 6 years ago

Hi all! @acrymble @alsalin @CzarinaChalid71

Thank you again for your detailed feedback. I decided to rewrite significant portions of the lesson in order to address some of the big-picture feedback that came up in all the comments. Here are the big picture changes I made followed by specific changes as per your comments:

Split into two sub-lessons

I reorganized the lesson into the following two lessons (with links to drafts):

Exploratory Data Analysis with Natural Language Processing (NLP), Part 1: Sentiment Analysis (link to second draft on Programming Historian)
Exploratory Data Analysis with Natural Language Processing (NLP), Part 2: Modeling and Exploring a Text Corpus (link to second draft on Programming Historian

Introduce the concept of exploratory data analysis and its relevance more effectively

Both reviewers also mention issues with the explanation of exploratory data analysis and the significance of these methods (e.g. ""I feel that for a user new to exploratory data analysis, the section in P7-12 would be better served up front.", "Exploratory studies should still be guided with a motive on the outset. It would serve as a helpful guide to the readers in following this tutorial. The motivation for the deployment of this tool is rather too general and lacks direction.").

As a result, I rewrote most of the introductory sections of both parts entirely. My goal was to make the goals of the lessons much more clear, and introduce exploratory data analysis in a way that leads naturally into these particular NLP methods. As @alsalin recommended, tried to pose and answer the "Why?" question earlier and more often in both lessons.

Make language clearer and more beginner-friendly

The issue of clarity and sufficient explanation came up repeatedly (e.g. "It would be clearer for the uninitiated readers to have a brief explanation on what the various technical jargons mean.").

I addressed this in several ways. First, I rewrote explanations of coding choices (e.g. discussions about implementing sentiment analysis, setting up the DataFrame, etc.) with an emphasis on using more plain writing and also giving more time to explain and contextualize the design choices.

I also broke up the code with more comments, especially in Part 1. I tried to ensure a beginner in coding could complete Part 1 successfully, and that Part 2 gave enough context for a beginner-to-intermediate coder to understand and succeed in, as opposed to requiring an advanced-level understanding of Python. Part 2 is still hard, but hopefully there is enough context for the reader to make it through. I may need to add more comments, however, so please let me know if you'd recommend that still.

Specific changes

I also tried to address each individual change not covered above:

Used abbreviations (NLTK, NLP) more consistently
Added version number information for all Python libraries
Linked to installation directions, as well as other Python lessons on PH
Wrote more elaborate comments for code and separated code into smaller segments at times.
Uploaded small Enron excerpt to PH GitHub. Still currently linking full Enron dataset to an external site, but will explore long-term hosting solutions before final publication

I hope these changes bring parts 1 and 2 closer to publishable! Please let me know if there are any additional changes I can make, and I'll also keep rereading and thinking about these lessons as well. Many thanks for your effort to help make these lessons better!

Warmly, Zoë

acrymble commented 6 years ago

Thanks @zoews I will try to read asap.

acrymble commented 6 years ago

Ok @zoews I've had a chance to go through and do copy-edits on the first lesson. I'm a heavy-handed copy editor, which I hope you won't take personally. The new file is available at:

http://programminghistorian.github.io/ph-submissions/lessons/sentiment-analysis

If there are any bits I've broken or otherwise made more confusing or wrong, let me know.

There are a few outstanding issues I need you to help overcome.

[x] LN 32 - please provide full reference to Tukey quote at end of document.
[x] LN 52 - we need proper referencing for these Enron history claims. These are living people so libel laws apply if the text is wrong.
[x] LN 56 - Harkin quote, full reference please at end of document.
[x] LN 60 - can you give specific references to studies in case someone wants to go read them? Could be footnoted and added to the end of document.
[x] LN 94 - is this appropriate for non-English text? (we do Spanish translations) Medieval text? Early modern text? 19th century text? early 20th century text? Just a brief warning or comment is fine.
[x] LN 108 - I don't see any advantage to shortening SentimentIntensityAnalyzer() to sid. This is just an extra thing people have to understand. Please keep the long version and update the text and code appropriately. Also please make sure variable names are meaningful. I don't know what 'sid' or 'ss' stand for. Please make updates.
[x] LN 143 - when I run this I get an error ' File "sentiment.py", line 2, in from nltk.sentiment.vader import SentimentIntensityAnalyzer ImportError: No module named nltk.sentiment.vader

I followed the instructions on the NLTK website as instructed.

[x] LN 153 - the word 'terminal' will be jargon for most people.
[x] LN 154 - please don't use non-meaningful variable names such as 'k'. Make the code human readable.
[x] LN 178 - 'LINK HERE' presumably should have a link there.
[x] LN 178 - you're assuming knowledge of using the command line to navigate the file structure and execute python files.
[x] LN 178 - by this point I've already failed to get the code working. If other users get this far and then it doesn't work they'll be angry and unable to debug. Make sure they're executing at each step along the way so they know exactly what causes an issue (if anything)

cassws commented 6 years ago

Hi @acrymble -- I just wanted to quickly jump in to help debug the module not found issue. I forgot this would create an error!

NLTK requires you to download some modules/corpuses/tools manually. To do this, you actually need to use Python 3 in the terminal (by typing python3 or python depending on how Python is installed, or using an IDE if relevant.. this is another thing I'll need to figure out how to explain). Once the terminal is running Python 3, the commands you use would be:

import nltk
nltk.download() This second command should open up a GUI window that lets you pick which packages to install. I believe vader_lexicon is the correct one.

(You could also write a script just containing those two lines and run it with Python 3 if you don't want to mess with running Python 3 in the terminal.)

I hope this helps debug this part! I agree, it would be incredibly frustrating if the installation step breaks down entirely.

acrymble commented 6 years ago

Ok I'll leave you to integrate that into the lesson and to address the other points and then I'll read it again. I think this is the home stretch for this first lesson.

cassws commented 6 years ago

@acrymble sounds good -- I'll try to turn this around quickly (next day or two) and ping you back when done.

cassws commented 6 years ago

@acrymble I've finished making edits on this draft.

I had accidentally placed the Works Cited in Part 2 without moving the citations over to Part 1 as well, so I moved the relevant citations here.

I provided more context for the type of English texts for which VADER is appropriate to use , and also included a paragraph explaining options for multilingual sentiment analysis. The solutions tend to be fairly complicated and involve using a sentiment tool in conjunction with a translation tool, but I provided some context and links to studies that would hopefully gives non-English researchers something to work with.

I added in additional examples and links to studies where appropriate.

After some trial and error, I realized that the "sid = SentimentIntensityAnalyzer()" step is actually necessary because SentimentIntensityAnalyzer is itself a class, and we must create an object that follows the class blueprint in order for our Python program to access the features. I don't want to go too far into object-oriented programming in this lesson, but thought it would be helpful to explain the core idea that Python requires us to create our Sentiment Analyzer tool as a single object, and then that object has all of the functions we need. Hopefully this explanation I added is not too jargony!

I removed unnecessary mentions of the terminal/other jargon whenever possible and made the variable names human readable (great feedback).

I want beginner coders to be successful with this lesson, and I'm struggling to figure out how much of the process of writing code, saving it as a "script.py" file, and using python in the command line to explain here. I added the caveat that readers who don't have this background might want to complete an earlier Python tutorial first, as this is something that takes more explanation than a couple of sentences or paragraphs. If this was of interest, I could even write this as a new (mini-) lesson for PH -- the process is complicated as it is dependent on a lot of factors that I think are best handled with a longer guided step-by-step process to account for the variability. I did try to explicitly link to external tutorial links for learners struggling to follow along, and do so as clearly as possible.

Finally, I added the directions for installing the relevant VADER code for NLTK as a single three-line installation script the user can write and run. I realized that the lesson was already providing these instructions for downloading the english.pickle code later on in the lesson, and so I simply moved this code snippet and instructions up earlier and added the nltk.download('vader_lexicon') line as well.

I think that's it -- please let me know what next steps I can take, and thanks again for the feedback. ~z

acrymble commented 6 years ago

Thanks @zoews I've gone over this again and can confirm I got all the code to work. I've also done a bit more copy editing, as a few new errors crept in during your latest edits (mostly replacing ss with scores) and adjusting some grammar.

The image isn't properly linked. Can I ask you to adjust that please? I've made you a new folder 'sentiment-analysis' to put it into in the /images folder.

I'll then look for an image for our icon and start working on the YAML header and publication bits and pieces so we can go live.

cassws commented 6 years ago

@acrymble Hmm I don't see the sentiment-analysis subfolder in images. Could you create it again? I uploaded the image file to images for now and can move it over then. Thanks!

acrymble commented 6 years ago

I managed to do it.

We'll use this image for the icon: https://www.flickr.com/photos/britishlibrary/11000942385/

@Zoews can you send me a 1 sentence bio that I can add? If you want to do a final read through of this let me know. Otherwise I think we can close this one.

Since it references the other lesson I think we'll have to wait to publish it officially so that internal references don't confuse readers.

acrymble commented 6 years ago

@zoews before I get stuck into the second one, can you go back over it and revise it based on the feedback and changes I made to the first one? In particular, looking for jargon, non-human-readable variable names, and adding in LOTS of wikipedia links?

That will save me a lot of time in copy editing.

cassws commented 6 years ago

@acrymble Oh I love that icon! That's brilliant haha.

Here's my 1 sentence bio: Zoë Wilkinson Saldaña is a graduate student at the University of Michigan School of Information, where she focuses on the intersection of critical data literacies, academic libraries, and data-informed learning.

As for changes to Part 2, I will start making those changes today and ping you when complete. I think I will also use this opportunity to slightly rewrite the code (I think the "average of sentence-based sentiment scores" approach I took is unnecessary and confusing, and not very typical relative to other studies I've read, and so I think I'll revert to entire message sentiment analysis to make this clearer.) This process may end up taking me a couple of days, as I'll be testing the code/etc too. More soon,

Zoë

cassws commented 6 years ago

@acrymble I'm done with the newest draft of Part 2!

I ended up making pretty extensive edits to simplify the language and code, as per your suggestions. I broke up many code sections, changed variable names, and tried to explain the programming choices wherever possible.

I also made a few substantial content changes:

I simplified the mapSentimentScores function so that it now generates a single set of scores for each e-mail rather than averaging by line. This now reads much more clearly and also seems to have generated better results. I changed all Outputs to reflect this change.
I removed the gender analysis part (using possible_gender, calculating average sentiment by gender of recipient-sender). This approach made me uncomfortable for a variety of reasons, and also involved the clunkiest and most difficult to parse code of the entire tutorial.
Instead, I added a section where the reader does a deep dive into the most positive/negative relationships, discovers e-mails for each extreme, and explores the full e-mail texts directly. This going from big-picture to a deep dive into specifics communicated the spirit of exploratory text analysis a little better I feel, and also helped reinforce the concept of a pandas DataFrame and how to directly output interesting information once we discover it through big-picture methods. I feel much better about the arc of Part 2 with these changes, but please let me know what you think.

I think it should be mostly ready, otherwise! Many thanks for your time and efforts and excited for this to get out there in the world.

Warmly, Zoë

acrymble commented 6 years ago

I've had a chance to read through the second lesson, and there are a number of things I think we need to discuss and work on. So I think the best way forward here is for us to slightly revise the first lesson to remove mentions of a series, publish that, and open a new review ticket where we can discuss lesson 2.

acrymble commented 6 years ago

This lesson is now ready for publication. It includes the following files:

/lessons/sentiment-analysis.md
/images/sentiment-analysis/sentiment-analysis1.png

I will make a pull request to the main site, at which point the lesson will be published.

acrymble commented 6 years ago

Thank you @zoews @CzarinaChalid71 and @alsalin, this lesson has now been published:

https://programminghistorian.org/lessons/sentiment-analysis

Please take a moment to share it with colleagues and on social media. This is a valuable contribution, and anything you can do to help us get the word out is greatly appreciated.

programminghistorian / ph-submissions

Review Ticket: Exploratory Data Analysis of a Correspondence Text Corpus #108

Anti-Harassment Policy