programminghistorian / ph-submissions

The repository and website hosting the peer review process for new Programming Historian lessons
http://programminghistorian.github.io/ph-submissions
135 stars 112 forks source link

Review Ticket: Exploratory Data Analysis of a Correspondence Text Corpus #108

Closed acrymble closed 6 years ago

acrymble commented 6 years ago

The Programming Historian has received the following tutorial on 'Exploratory Data Analysis of a Correspondence Text Corpus' by @zoews. This lesson is now under review and can be read at:

http://programminghistorian.github.io/ph-submissions/lessons/exploratory-data-analysis-with-nlp

Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.

I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. I will first read through the lesson and provide feedback, to which the author will respond.

Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.

I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me. You can always turn to @ianmilligan1 or @amandavisconti if you feel there's a need for an ombudsperson to step in.

Anti-Harassment Policy

This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. If anyone witnesses or feels they have been the victim of the above described activity, please contact our ombudspeople (Ian Milligan and Amanda Visconti - http://programminghistorian.org/project-team). Thank you for helping us to create a safe space.

WIth this review ticket, I close the corresponding 'proposal' ticket: #91

acrymble commented 6 years ago

@zoews I've had a chance to read through the lesson, but not try the code (I can't download 2GB at the moment because my internet is running off my phone).

I really liked this tutorial, and I think it's a fantastic example of using digital approaches to help directly with the research process. I also hope we'll see your published research based on these approaches in the near future.

I've got a few comments and questions, and a number of (often minor) suggestions that I'd like you to address before we get the formal reviews.

Firstly, regarding the ethics of the corpus:

1) What is the legal status of the Enron corpus? This material should be under copyright based on my understanding. Is there something specific that has occurred that would allow us to host a subset of this material without infringing on copyright?

2) Similarly, these are presumably in most cases living individuals. What are the ethical implications of sharing and using this corpus as you have done?

Secondly, regarding the scope of the lesson:

3) This feels like 2 lessons to me. The first teaches users how to conduct and interpret a sentiment analysis on a single or a small number of texts. The second teaches the user how to combine network analysis and sentiment analysis to understand how variables affect sentiment. I suggest splitting the lesson at paragraph 74. It would also address my queries about the opening paragraph being too broad (below)

More specifically:

--

Can you take a look at these issues and get back to me, so we can come up with the best way forward?

cassws commented 6 years ago

@acrymble : thank you so much for your feedback! I'm glad to hear of your overall impression and happy to elaborate on the issues your raised:

  1. The Federal Energy Regulatory Commission initially released the e-mails as a public disclosure in its Enron investigation, intended for public record/public domain data. There were a number of privacy issues in the initial release, and after a back-and-forth process they were re-released, cleaned again, and distributed via CMU in 2004. This release appears to be the closest thing to that original "public domain release" available online (the original uploads appear broken). The CMU release is the source of its use in a number of studies and hosting on mirror sites, as researchers commonly cite the release in their methods and works cited. CMU describes the dataset here: http://www.cs.cmu.edu/~enron/

According to CMU, "I am distributing this dataset as a resource for researchers who are interested in improving current email tools, or understanding how email is currently used." I think this is the standard that our use would have to fall under, and I think it's appropriate to both use the Enron emails generally and host the subset on our site for this purpose. The lesson would also function as is without a hosted subset, which I am 100% comfortable with, but would require sustained stewardship to make sure the URL remains active - CMU has maintained their hosting for 13 years, but that isn't a guarantee for the future of course.

One factor that would support the subset hosting use-case: we are not hosting a searchable raw-text version of the corpus that could be indexed by a search engine, but rather a compressed file version that must be downloaded. I noticed a comment thread on the blog for Enron-Mail.com, the primary browsable/searchable mirror site for the corpus, where the CMU team lead expresses disapproval towards hosting indexable text versions that could be picked up by search engines (but CMU continues to link to enron-mail.com on their site, and it doesn't seem to be a legal issue, fyi): https://www.blogger.com/comment.g?blogID=1117717936174889592&postID=2942910070405514601

It would also be possible to contact CMU directly to discuss their thoughts if appropriate.

  1. I definitely agree this part is important to consider carefully. I tried to limit my discussions of identifiable individuals in the lesson generally, and when I did discuss somebody, I picked folks who had been central to the investigation. I also generally limited analysis to the 'sent' folders of individuals who had been singled out by the FERC, again to avoid folks who appear only incidentally/peripherally in the corpus.

There's also a large lineage of articles and studies that identify individuals to varying degrees, and have passed a wide variety of IRBs. If there's an ethical issue, saying "others have done it too!" is not sufficient of course, but from the perspective of minimizing harms I do not feel there is anything in this lesson that would cause new damages to individuals discussed.

If we did want to further minimizing discussion of actual individuals, we could take steps to remove the identifying information I've included in the text of the lesson when I've described the job title of somebody involved, etc.

  1. Splitting into two lessons/a two-part lesson works for me if that makes sense for the Programming Historian lesson format. I can help with that process, either now or at a later point, just let me know!

Thank you, Zoë

cassws commented 6 years ago

I will also go through and make the paragraph-by-paragraph edits but it will take me a few more days -- I'll plan to complete them by end of Monday of next week.

acrymble commented 6 years ago

I do think 2 lessons here is a better idea. I can see value for some people wanting one skill and not the other. I hope that isn't too much extra work. It looks like a fairly clean break to me. We can either progress with one at a time or both concurrently if you'd like. I'll have to think about what that means for reviewers.

cassws commented 6 years ago

Okay definitely, that makes a lot of sense. That would also help me be a little more deliberate about the difficulty/complexity level for each part, as you noted with Python code along the way.

It sounds like, regardless of whether you decide on concurrent or sequential editing processes, the next step is still to make the edits you've mentioned above -- I will follow up end of Monday with those and alert you when they're ready!

acrymble commented 6 years ago

Just wanted to check in on progress. Can you give me an update on how you're making out?

cassws commented 6 years ago

Hi @acrymble ,

I finished making my way through your edits above.

I also divided the lesson into a Part 1 and Part 2, but I'm not sure whether it works split out in that way. Is the goal to allow a reader to start with Part 2 if they choose to? I don't feel like there will be enough context to jump in halfway. However, if the goal is to help make the lesson more readable and approachable for a reader who still expects to complete both parts, or expects to complete just the first part, then I do feel the separation would work well.

Regardless, I feel that it's ready for the next step of review, pending any other modification you would like me to make. Happy to move ahead in whatever way makes sense!

acrymble commented 6 years ago

Are these both available at the same link, or did you upload a new file for the second one?

cassws commented 6 years ago

@acrymble Both at the same link. For now I just separated them with Part 1/Part 2 headers and some explanation in the introduction.

acrymble commented 6 years ago

I'm just reaching out to reviewers. I'll post an update when I've got them committed.

acrymble commented 6 years ago

Two reviewers have agreed to offer formal reviews of the lesson by 11 October 2017. Once they have been posted I will summarise a path forwards and we'll go from there @zoews

cassws commented 6 years ago

Oh fantastic! Yes please let me know if there's anything I can help with along the way, and otherwise I'll plan on jumping back into the project then. Thanks @acrymble !

alsalin commented 6 years ago

Many thanks for providing me the opportunity to review this lesson! Before getting started, please let me know if anything is unclear. First of all, this tone of the lesson I think is fabulous. It hits the mark of explaining a method without relying to much on the "how to" over the "why."

Just a few comments with respect to sustainability and clarity.

From a sustainability perspective, this lesson is great and does not have many issues other than remarking about version specifics here and there. My only main concern was getting the reader introduced to the "why" sooner in the introductory section.

acrymble commented 6 years ago

Thanks @alsalin, for such a prompt review. We're just waiting on one other. In order to make sure both reviewers get a chance to address the same piece of writing, I'll ask that you don't make any changes yet @zoews

acrymble commented 6 years ago

Just an update that our second reviewer has asked for a couple of extra weeks to write the review. We are hoping for a report in early December.

CzarinaChalid71 commented 6 years ago

Hi Zoe,

Sorry for the delay in submitting this review. This write-up of yours is a commendable one in text analysis studies. However, I would like to suggest certain things as follows:

  1. It would be clearer for the uninitiated readers to have a brief explanation on what the various technical jargons mean. Your opening paragraph is rather broad. Segmenting the paragraph to perhaps one main idea in each would be more helpful to readers.

  2. Since this is a two-lesson tutorial, the lessons should be separated and taught in two distinct sessions.

  3. Exploratory studies should still be guided with a motive on the outset. It would serve as a helpful guide to the readers in following this tutorial. The motivation for the deployment of this tool is rather too general and lacks direction.

  4. para 1 - a bit more detailed description about the Sentiment Analysis, being the key analytical tool, would be helpful.

  5. Since this is a tutorial, a more elaborated self-explanatory step-by-step procedure for performing the analysis would provide readers with a clearer understanding. Additionally, links to the tutorials may assist the readers better.

  6. The corpus that you made the subject matter of your paper i.e. Enron crisis may be elusive or even unknown to some. Perhaps, once again, either you provide some links or description about the corpus or take some other more common corpus or cases.

  7. para 27 - the mentioning of Python 3, NLTK and Pandas is rather too abrupt. I assume this tutorial is meant for advanced learners of these tools, isn't it?

  8. para 36 - Abrupt discussion on coding. Newbies to Python would appreciate it more if they are guided to work with the fundamentals of Python before dealing with a more somewhat advanced matter.

  9. para 62 - the initiation into the sub-topic 'Data Structuring' is too abrupt. Here, the layout of the entire tutorial would be more organised if it readers are supplied with signals like numbering for each topic and sub-topic.

Review summary: Overall, this paper is interesting. Presentable and informative.

acrymble commented 6 years ago

Thanks so much for the review. @zoews please give me a few days to consider these and I'll write up a suggested path forward that you can respond to. I will hope to have this by the end of the week!

cassws commented 6 years ago

Sounds great -- thank you so much for your edits @CzarinaChalid71 (and yours as well @alsalin !)

acrymble commented 6 years ago

Thanks to our reviewers. It looks like the advice is fairly straight forward and practical to implement. I'll let @zoews respond to each point in the final draft.

I am still of the opinion that this should be 2 tutorials, because I think people will want to know about sentiment analysis more than exploratory data analysis. From a Programming Historian perspective it also raises problems of where we would put a 'Sentiment Analysis' lesson since this one exists (but isn't framed as obviously a skills lesson in Sentiment Analysis.

If you want to discuss that further, I think it's worth us talking via Skype so we can have a conversation. Otherwise, I think at this stage we should drop Part 2 entirely, and polish up this lesson and publish it. Then return to part 2 as a new lesson with a new intro, and cross-reference the two so that readers who want to learn more can read on. But we can get Part 1 out fairly soon, so no need to do both at the same time.

Let me know what you think. I realise we're coming up on the holiday season, but if you can get this done before 15 January that would be great. Let me know if that's unrealistic.

cassws commented 6 years ago

@acrymble I definitely agree with you about 2 tutorials, especially with a little distance from the first draft (it's a lot to jump into as one continuous lesson!)

The timing is actually quite good for me -- I will be able to focus on this from the 11tb-20th of this month, and I will have extra time early January if necessary. My goal will be to get it to you by 20 December if at all possible, and certainly by 15 January at the latest.

Perhaps we could have a Skype conversation next week (the week of the 11th-15th) just to make sure I define the scope of the sentiment analysis part appropriately? If that works please shoot me an e-mail and we can narrow down a time. But yes, this is a great starting point and I will update as I go.

cassws commented 6 years ago

Hi all! @acrymble @alsalin @CzarinaChalid71

Thank you again for your detailed feedback. I decided to rewrite significant portions of the lesson in order to address some of the big-picture feedback that came up in all the comments. Here are the big picture changes I made followed by specific changes as per your comments:

Split into two sub-lessons

I reorganized the lesson into the following two lessons (with links to drafts):

Introduce the concept of exploratory data analysis and its relevance more effectively

Both reviewers also mention issues with the explanation of exploratory data analysis and the significance of these methods (e.g. ""I feel that for a user new to exploratory data analysis, the section in P7-12 would be better served up front.", "Exploratory studies should still be guided with a motive on the outset. It would serve as a helpful guide to the readers in following this tutorial. The motivation for the deployment of this tool is rather too general and lacks direction.").

As a result, I rewrote most of the introductory sections of both parts entirely. My goal was to make the goals of the lessons much more clear, and introduce exploratory data analysis in a way that leads naturally into these particular NLP methods. As @alsalin recommended, tried to pose and answer the "Why?" question earlier and more often in both lessons.

Make language clearer and more beginner-friendly

The issue of clarity and sufficient explanation came up repeatedly (e.g. "It would be clearer for the uninitiated readers to have a brief explanation on what the various technical jargons mean.").

I addressed this in several ways. First, I rewrote explanations of coding choices (e.g. discussions about implementing sentiment analysis, setting up the DataFrame, etc.) with an emphasis on using more plain writing and also giving more time to explain and contextualize the design choices.

I also broke up the code with more comments, especially in Part 1. I tried to ensure a beginner in coding could complete Part 1 successfully, and that Part 2 gave enough context for a beginner-to-intermediate coder to understand and succeed in, as opposed to requiring an advanced-level understanding of Python. Part 2 is still hard, but hopefully there is enough context for the reader to make it through. I may need to add more comments, however, so please let me know if you'd recommend that still.

Specific changes

I also tried to address each individual change not covered above:

I hope these changes bring parts 1 and 2 closer to publishable! Please let me know if there are any additional changes I can make, and I'll also keep rereading and thinking about these lessons as well. Many thanks for your effort to help make these lessons better!

Warmly, Zoë

acrymble commented 6 years ago

Thanks @zoews I will try to read asap.

acrymble commented 6 years ago

Ok @zoews I've had a chance to go through and do copy-edits on the first lesson. I'm a heavy-handed copy editor, which I hope you won't take personally. The new file is available at:

http://programminghistorian.github.io/ph-submissions/lessons/sentiment-analysis

If there are any bits I've broken or otherwise made more confusing or wrong, let me know.

There are a few outstanding issues I need you to help overcome.

I followed the instructions on the NLTK website as instructed.

cassws commented 6 years ago

Hi @acrymble -- I just wanted to quickly jump in to help debug the module not found issue. I forgot this would create an error!

NLTK requires you to download some modules/corpuses/tools manually. To do this, you actually need to use Python 3 in the terminal (by typing python3 or python depending on how Python is installed, or using an IDE if relevant.. this is another thing I'll need to figure out how to explain). Once the terminal is running Python 3, the commands you use would be:

(You could also write a script just containing those two lines and run it with Python 3 if you don't want to mess with running Python 3 in the terminal.)

I hope this helps debug this part! I agree, it would be incredibly frustrating if the installation step breaks down entirely.

acrymble commented 6 years ago

Ok I'll leave you to integrate that into the lesson and to address the other points and then I'll read it again. I think this is the home stretch for this first lesson.

cassws commented 6 years ago

@acrymble sounds good -- I'll try to turn this around quickly (next day or two) and ping you back when done.

cassws commented 6 years ago

@acrymble I've finished making edits on this draft.

I had accidentally placed the Works Cited in Part 2 without moving the citations over to Part 1 as well, so I moved the relevant citations here.

I provided more context for the type of English texts for which VADER is appropriate to use , and also included a paragraph explaining options for multilingual sentiment analysis. The solutions tend to be fairly complicated and involve using a sentiment tool in conjunction with a translation tool, but I provided some context and links to studies that would hopefully gives non-English researchers something to work with.

I added in additional examples and links to studies where appropriate.

After some trial and error, I realized that the "sid = SentimentIntensityAnalyzer()" step is actually necessary because SentimentIntensityAnalyzer is itself a class, and we must create an object that follows the class blueprint in order for our Python program to access the features. I don't want to go too far into object-oriented programming in this lesson, but thought it would be helpful to explain the core idea that Python requires us to create our Sentiment Analyzer tool as a single object, and then that object has all of the functions we need. Hopefully this explanation I added is not too jargony!

I removed unnecessary mentions of the terminal/other jargon whenever possible and made the variable names human readable (great feedback).

I want beginner coders to be successful with this lesson, and I'm struggling to figure out how much of the process of writing code, saving it as a "script.py" file, and using python in the command line to explain here. I added the caveat that readers who don't have this background might want to complete an earlier Python tutorial first, as this is something that takes more explanation than a couple of sentences or paragraphs. If this was of interest, I could even write this as a new (mini-) lesson for PH -- the process is complicated as it is dependent on a lot of factors that I think are best handled with a longer guided step-by-step process to account for the variability. I did try to explicitly link to external tutorial links for learners struggling to follow along, and do so as clearly as possible.

Finally, I added the directions for installing the relevant VADER code for NLTK as a single three-line installation script the user can write and run. I realized that the lesson was already providing these instructions for downloading the english.pickle code later on in the lesson, and so I simply moved this code snippet and instructions up earlier and added the nltk.download('vader_lexicon') line as well.

I think that's it -- please let me know what next steps I can take, and thanks again for the feedback. ~z

acrymble commented 6 years ago

Thanks @zoews I've gone over this again and can confirm I got all the code to work. I've also done a bit more copy editing, as a few new errors crept in during your latest edits (mostly replacing ss with scores) and adjusting some grammar.

The image isn't properly linked. Can I ask you to adjust that please? I've made you a new folder 'sentiment-analysis' to put it into in the /images folder.

I'll then look for an image for our icon and start working on the YAML header and publication bits and pieces so we can go live.

cassws commented 6 years ago

@acrymble Hmm I don't see the sentiment-analysis subfolder in images. Could you create it again? I uploaded the image file to images for now and can move it over then. Thanks!

acrymble commented 6 years ago

I managed to do it.

We'll use this image for the icon: https://www.flickr.com/photos/britishlibrary/11000942385/

@Zoews can you send me a 1 sentence bio that I can add? If you want to do a final read through of this let me know. Otherwise I think we can close this one.

Since it references the other lesson I think we'll have to wait to publish it officially so that internal references don't confuse readers.

acrymble commented 6 years ago

@zoews before I get stuck into the second one, can you go back over it and revise it based on the feedback and changes I made to the first one? In particular, looking for jargon, non-human-readable variable names, and adding in LOTS of wikipedia links?

That will save me a lot of time in copy editing.

cassws commented 6 years ago

@acrymble Oh I love that icon! That's brilliant haha.

Here's my 1 sentence bio: Zoë Wilkinson Saldaña is a graduate student at the University of Michigan School of Information, where she focuses on the intersection of critical data literacies, academic libraries, and data-informed learning.

As for changes to Part 2, I will start making those changes today and ping you when complete. I think I will also use this opportunity to slightly rewrite the code (I think the "average of sentence-based sentiment scores" approach I took is unnecessary and confusing, and not very typical relative to other studies I've read, and so I think I'll revert to entire message sentiment analysis to make this clearer.) This process may end up taking me a couple of days, as I'll be testing the code/etc too. More soon,

Zoë

cassws commented 6 years ago

@acrymble I'm done with the newest draft of Part 2!

I ended up making pretty extensive edits to simplify the language and code, as per your suggestions. I broke up many code sections, changed variable names, and tried to explain the programming choices wherever possible.

I also made a few substantial content changes:

I think it should be mostly ready, otherwise! Many thanks for your time and efforts and excited for this to get out there in the world.

Warmly, Zoë

acrymble commented 6 years ago

I've had a chance to read through the second lesson, and there are a number of things I think we need to discuss and work on. So I think the best way forward here is for us to slightly revise the first lesson to remove mentions of a series, publish that, and open a new review ticket where we can discuss lesson 2.

acrymble commented 6 years ago

This lesson is now ready for publication. It includes the following files:

I will make a pull request to the main site, at which point the lesson will be published.

acrymble commented 6 years ago

Thank you @zoews @CzarinaChalid71 and @alsalin, this lesson has now been published:

https://programminghistorian.org/lessons/sentiment-analysis

Please take a moment to share it with colleagues and on social media. This is a valuable contribution, and anything you can do to help us get the word out is greatly appreciated.