Closed acrymble closed 6 years ago
@zoews I've had a chance to read through the lesson, but not try the code (I can't download 2GB at the moment because my internet is running off my phone).
I really liked this tutorial, and I think it's a fantastic example of using digital approaches to help directly with the research process. I also hope we'll see your published research based on these approaches in the near future.
I've got a few comments and questions, and a number of (often minor) suggestions that I'd like you to address before we get the formal reviews.
Firstly, regarding the ethics of the corpus:
1) What is the legal status of the Enron corpus? This material should be under copyright based on my understanding. Is there something specific that has occurred that would allow us to host a subset of this material without infringing on copyright?
2) Similarly, these are presumably in most cases living individuals. What are the ethical implications of sharing and using this corpus as you have done?
Secondly, regarding the scope of the lesson:
More specifically:
[x] The opening paragraph needs work. At the moment you're trying to do too many things. You might consider cutting the sentences in p1 entirely, and leading with p5. You can explore data in so many ways (summary statistics of various kinds, etc). From reading the tutorial, I felt I learned how to conduct sentiment analysis, which is useful for exploratory data analysis. I don't feel that I learned how to conduct exploratory data analysis (which could be any number of things).
[x] p1 - You allude to the Enron crisis, but better to be explicit and link to Wikipedia. Many people won't know what it is.
[x] p1 & p104 - please drop the footnote and convert this to an 'acknowledgements' at the end.
[x] p1 - thinking about sustainability, can we get a clear list & links here at the end of the first paragraph that outlines all dependencies and things people will be able to download?
[x] p2 - interesting examples. Are there real published research examples that you could link to rather than hypotheticals? Or even blog posts of work in progress?
[x] p3 - what are my three academic personas? I found this paragraph confusing. Are you saying that you're going to help me understand what questions might be worth asking of my corpus?
[x] p4 - can you link to any of these tutorials?
[x] p5-p6 - this would be a better p1. Good explanation.
[x] p8 - can you link to the arXiv in addition to your reference?
[ ] p10 - can you link to some of this historiography? Just a couple of examples would be awesome.
[x] p30 - I love your example of Vader working. But can you problematise sentiment analysis? Lasts time I tried something like this, it was terrible at double-negatives and unusual phrasing. I'm not not happy; I don't not hate her; My name is Mr. Frown etc. It's also crap at sarcasm. Can you show it failing too? Just so people don't blindly trust them? Or link to essential reading?
[x] p20 - links to where I can find this stuff would be good. How do I know I've done it right?
[x] p34 - I interpret this as a negative message overall. Is that what I'm supposed to be thinking?
[x] p39 - I'm not clear how it's a positive overall sentiment from the outputs. It looks more neutral to me. What do the numbers mean and how am I to interpret them? Sorry if I missed this.
[x] p41 - link to more info on english.pickle? Do I have it installed?
[x] p54 - you lost me on the 3rd example. Too much jargon.
[x] p55 - you haven't talked about the directory structure of the dataset. Flip 56 and 55?
[x] p59 - a couple of examples of these colon separated values? Just to refresh people.
[x] p60 - The code is well commented, but can you just briefly talk us through the steps the code will do before you show it to us? Talk out the algorithm.
[x] p62 - the output is really just a matrix, and conceptually (as far as I can tell) no different from a CSV or a spreadsheet. Can you help readers make that mental link?
[x] p63/64 - recap where we are in lesson progress before starting the new section. What's been done and what still needs doing?
[x] p67 - tuple will be jargon for most people. So will be key/value. Maybe beef up the conceptual idea that the data isn't in the format we want it, so we're going to change it. We do this through a process known as 'mapping'.
[x] p69 - same with slices. If you need people to be familiar with python terminology as a precondition of being able to do this work, then that needs to be explicit in the introduction. In either case, link these types of terms to Wikipedia or python website.
[x] p74 - should this now be a new lesson? It looks like a substantial new development that brings in network analysis as well.
[x] p79 - you really need to be confident in python to understand this lesson, based on this code block. That's fine but needs to be explicit.
[x] p86 - you will have a 'data' folder to store these types of things. You could also just introduce this as a code block, noting that you've already done the work. Have the reader save it themselves.
[x] p89 - error != break.
[x] p91 - you used the word 'substantial'. That raises the question: how does one measure significance? Is this a statistically different sentiment? That's important if we're translating this to publishable research.
--
Can you take a look at these issues and get back to me, so we can come up with the best way forward?
@acrymble : thank you so much for your feedback! I'm glad to hear of your overall impression and happy to elaborate on the issues your raised:
According to CMU, "I am distributing this dataset as a resource for researchers who are interested in improving current email tools, or understanding how email is currently used." I think this is the standard that our use would have to fall under, and I think it's appropriate to both use the Enron emails generally and host the subset on our site for this purpose. The lesson would also function as is without a hosted subset, which I am 100% comfortable with, but would require sustained stewardship to make sure the URL remains active - CMU has maintained their hosting for 13 years, but that isn't a guarantee for the future of course.
One factor that would support the subset hosting use-case: we are not hosting a searchable raw-text version of the corpus that could be indexed by a search engine, but rather a compressed file version that must be downloaded. I noticed a comment thread on the blog for Enron-Mail.com, the primary browsable/searchable mirror site for the corpus, where the CMU team lead expresses disapproval towards hosting indexable text versions that could be picked up by search engines (but CMU continues to link to enron-mail.com on their site, and it doesn't seem to be a legal issue, fyi): https://www.blogger.com/comment.g?blogID=1117717936174889592&postID=2942910070405514601
It would also be possible to contact CMU directly to discuss their thoughts if appropriate.
There's also a large lineage of articles and studies that identify individuals to varying degrees, and have passed a wide variety of IRBs. If there's an ethical issue, saying "others have done it too!" is not sufficient of course, but from the perspective of minimizing harms I do not feel there is anything in this lesson that would cause new damages to individuals discussed.
If we did want to further minimizing discussion of actual individuals, we could take steps to remove the identifying information I've included in the text of the lesson when I've described the job title of somebody involved, etc.
Thank you, Zoë
I will also go through and make the paragraph-by-paragraph edits but it will take me a few more days -- I'll plan to complete them by end of Monday of next week.
I do think 2 lessons here is a better idea. I can see value for some people wanting one skill and not the other. I hope that isn't too much extra work. It looks like a fairly clean break to me. We can either progress with one at a time or both concurrently if you'd like. I'll have to think about what that means for reviewers.
Okay definitely, that makes a lot of sense. That would also help me be a little more deliberate about the difficulty/complexity level for each part, as you noted with Python code along the way.
It sounds like, regardless of whether you decide on concurrent or sequential editing processes, the next step is still to make the edits you've mentioned above -- I will follow up end of Monday with those and alert you when they're ready!
Just wanted to check in on progress. Can you give me an update on how you're making out?
Hi @acrymble ,
I finished making my way through your edits above.
I also divided the lesson into a Part 1 and Part 2, but I'm not sure whether it works split out in that way. Is the goal to allow a reader to start with Part 2 if they choose to? I don't feel like there will be enough context to jump in halfway. However, if the goal is to help make the lesson more readable and approachable for a reader who still expects to complete both parts, or expects to complete just the first part, then I do feel the separation would work well.
Regardless, I feel that it's ready for the next step of review, pending any other modification you would like me to make. Happy to move ahead in whatever way makes sense!
Are these both available at the same link, or did you upload a new file for the second one?
@acrymble Both at the same link. For now I just separated them with Part 1/Part 2 headers and some explanation in the introduction.
I'm just reaching out to reviewers. I'll post an update when I've got them committed.
Two reviewers have agreed to offer formal reviews of the lesson by 11 October 2017. Once they have been posted I will summarise a path forwards and we'll go from there @zoews
Oh fantastic! Yes please let me know if there's anything I can help with along the way, and otherwise I'll plan on jumping back into the project then. Thanks @acrymble !
Many thanks for providing me the opportunity to review this lesson! Before getting started, please let me know if anything is unclear. First of all, this tone of the lesson I think is fabulous. It hits the mark of explaining a method without relying to much on the "how to" over the "why."
Just a few comments with respect to sustainability and clarity.
From a sustainability perspective, this lesson is great and does not have many issues other than remarking about version specifics here and there. My only main concern was getting the reader introduced to the "why" sooner in the introductory section.
Thanks @alsalin, for such a prompt review. We're just waiting on one other. In order to make sure both reviewers get a chance to address the same piece of writing, I'll ask that you don't make any changes yet @zoews
Just an update that our second reviewer has asked for a couple of extra weeks to write the review. We are hoping for a report in early December.
Hi Zoe,
Sorry for the delay in submitting this review. This write-up of yours is a commendable one in text analysis studies. However, I would like to suggest certain things as follows:
It would be clearer for the uninitiated readers to have a brief explanation on what the various technical jargons mean. Your opening paragraph is rather broad. Segmenting the paragraph to perhaps one main idea in each would be more helpful to readers.
Since this is a two-lesson tutorial, the lessons should be separated and taught in two distinct sessions.
Exploratory studies should still be guided with a motive on the outset. It would serve as a helpful guide to the readers in following this tutorial. The motivation for the deployment of this tool is rather too general and lacks direction.
para 1 - a bit more detailed description about the Sentiment Analysis, being the key analytical tool, would be helpful.
Since this is a tutorial, a more elaborated self-explanatory step-by-step procedure for performing the analysis would provide readers with a clearer understanding. Additionally, links to the tutorials may assist the readers better.
The corpus that you made the subject matter of your paper i.e. Enron crisis may be elusive or even unknown to some. Perhaps, once again, either you provide some links or description about the corpus or take some other more common corpus or cases.
para 27 - the mentioning of Python 3, NLTK and Pandas is rather too abrupt. I assume this tutorial is meant for advanced learners of these tools, isn't it?
para 36 - Abrupt discussion on coding. Newbies to Python would appreciate it more if they are guided to work with the fundamentals of Python before dealing with a more somewhat advanced matter.
para 62 - the initiation into the sub-topic 'Data Structuring' is too abrupt. Here, the layout of the entire tutorial would be more organised if it readers are supplied with signals like numbering for each topic and sub-topic.
Review summary: Overall, this paper is interesting. Presentable and informative.
Thanks so much for the review. @zoews please give me a few days to consider these and I'll write up a suggested path forward that you can respond to. I will hope to have this by the end of the week!
Sounds great -- thank you so much for your edits @CzarinaChalid71 (and yours as well @alsalin !)
Thanks to our reviewers. It looks like the advice is fairly straight forward and practical to implement. I'll let @zoews respond to each point in the final draft.
I am still of the opinion that this should be 2 tutorials, because I think people will want to know about sentiment analysis more than exploratory data analysis. From a Programming Historian perspective it also raises problems of where we would put a 'Sentiment Analysis' lesson since this one exists (but isn't framed as obviously a skills lesson in Sentiment Analysis.
If you want to discuss that further, I think it's worth us talking via Skype so we can have a conversation. Otherwise, I think at this stage we should drop Part 2 entirely, and polish up this lesson and publish it. Then return to part 2 as a new lesson with a new intro, and cross-reference the two so that readers who want to learn more can read on. But we can get Part 1 out fairly soon, so no need to do both at the same time.
Let me know what you think. I realise we're coming up on the holiday season, but if you can get this done before 15 January that would be great. Let me know if that's unrealistic.
@acrymble I definitely agree with you about 2 tutorials, especially with a little distance from the first draft (it's a lot to jump into as one continuous lesson!)
The timing is actually quite good for me -- I will be able to focus on this from the 11tb-20th of this month, and I will have extra time early January if necessary. My goal will be to get it to you by 20 December if at all possible, and certainly by 15 January at the latest.
Perhaps we could have a Skype conversation next week (the week of the 11th-15th) just to make sure I define the scope of the sentiment analysis part appropriately? If that works please shoot me an e-mail and we can narrow down a time. But yes, this is a great starting point and I will update as I go.
Hi all! @acrymble @alsalin @CzarinaChalid71
Thank you again for your detailed feedback. I decided to rewrite significant portions of the lesson in order to address some of the big-picture feedback that came up in all the comments. Here are the big picture changes I made followed by specific changes as per your comments:
Split into two sub-lessons
I reorganized the lesson into the following two lessons (with links to drafts):
Introduce the concept of exploratory data analysis and its relevance more effectively
Both reviewers also mention issues with the explanation of exploratory data analysis and the significance of these methods (e.g. ""I feel that for a user new to exploratory data analysis, the section in P7-12 would be better served up front.", "Exploratory studies should still be guided with a motive on the outset. It would serve as a helpful guide to the readers in following this tutorial. The motivation for the deployment of this tool is rather too general and lacks direction.").
As a result, I rewrote most of the introductory sections of both parts entirely. My goal was to make the goals of the lessons much more clear, and introduce exploratory data analysis in a way that leads naturally into these particular NLP methods. As @alsalin recommended, tried to pose and answer the "Why?" question earlier and more often in both lessons.
Make language clearer and more beginner-friendly
The issue of clarity and sufficient explanation came up repeatedly (e.g. "It would be clearer for the uninitiated readers to have a brief explanation on what the various technical jargons mean.").
I addressed this in several ways. First, I rewrote explanations of coding choices (e.g. discussions about implementing sentiment analysis, setting up the DataFrame, etc.) with an emphasis on using more plain writing and also giving more time to explain and contextualize the design choices.
I also broke up the code with more comments, especially in Part 1. I tried to ensure a beginner in coding could complete Part 1 successfully, and that Part 2 gave enough context for a beginner-to-intermediate coder to understand and succeed in, as opposed to requiring an advanced-level understanding of Python. Part 2 is still hard, but hopefully there is enough context for the reader to make it through. I may need to add more comments, however, so please let me know if you'd recommend that still.
Specific changes
I also tried to address each individual change not covered above:
I hope these changes bring parts 1 and 2 closer to publishable! Please let me know if there are any additional changes I can make, and I'll also keep rereading and thinking about these lessons as well. Many thanks for your effort to help make these lessons better!
Warmly, Zoë
Thanks @zoews I will try to read asap.
Ok @zoews I've had a chance to go through and do copy-edits on the first lesson. I'm a heavy-handed copy editor, which I hope you won't take personally. The new file is available at:
http://programminghistorian.github.io/ph-submissions/lessons/sentiment-analysis
If there are any bits I've broken or otherwise made more confusing or wrong, let me know.
There are a few outstanding issues I need you to help overcome.
I followed the instructions on the NLTK website as instructed.
Hi @acrymble -- I just wanted to quickly jump in to help debug the module not found issue. I forgot this would create an error!
NLTK requires you to download some modules/corpuses/tools manually. To do this, you actually need to use Python 3 in the terminal (by typing python3 or python depending on how Python is installed, or using an IDE if relevant.. this is another thing I'll need to figure out how to explain). Once the terminal is running Python 3, the commands you use would be:
(You could also write a script just containing those two lines and run it with Python 3 if you don't want to mess with running Python 3 in the terminal.)
I hope this helps debug this part! I agree, it would be incredibly frustrating if the installation step breaks down entirely.
Ok I'll leave you to integrate that into the lesson and to address the other points and then I'll read it again. I think this is the home stretch for this first lesson.
@acrymble sounds good -- I'll try to turn this around quickly (next day or two) and ping you back when done.
@acrymble I've finished making edits on this draft.
I had accidentally placed the Works Cited in Part 2 without moving the citations over to Part 1 as well, so I moved the relevant citations here.
I provided more context for the type of English texts for which VADER is appropriate to use , and also included a paragraph explaining options for multilingual sentiment analysis. The solutions tend to be fairly complicated and involve using a sentiment tool in conjunction with a translation tool, but I provided some context and links to studies that would hopefully gives non-English researchers something to work with.
I added in additional examples and links to studies where appropriate.
After some trial and error, I realized that the "sid = SentimentIntensityAnalyzer()" step is actually necessary because SentimentIntensityAnalyzer is itself a class, and we must create an object that follows the class blueprint in order for our Python program to access the features. I don't want to go too far into object-oriented programming in this lesson, but thought it would be helpful to explain the core idea that Python requires us to create our Sentiment Analyzer tool as a single object, and then that object has all of the functions we need. Hopefully this explanation I added is not too jargony!
I removed unnecessary mentions of the terminal/other jargon whenever possible and made the variable names human readable (great feedback).
I want beginner coders to be successful with this lesson, and I'm struggling to figure out how much of the process of writing code, saving it as a "script.py" file, and using python in the command line to explain here. I added the caveat that readers who don't have this background might want to complete an earlier Python tutorial first, as this is something that takes more explanation than a couple of sentences or paragraphs. If this was of interest, I could even write this as a new (mini-) lesson for PH -- the process is complicated as it is dependent on a lot of factors that I think are best handled with a longer guided step-by-step process to account for the variability. I did try to explicitly link to external tutorial links for learners struggling to follow along, and do so as clearly as possible.
Finally, I added the directions for installing the relevant VADER code for NLTK as a single three-line installation script the user can write and run. I realized that the lesson was already providing these instructions for downloading the english.pickle code later on in the lesson, and so I simply moved this code snippet and instructions up earlier and added the nltk.download('vader_lexicon') line as well.
I think that's it -- please let me know what next steps I can take, and thanks again for the feedback. ~z
Thanks @zoews I've gone over this again and can confirm I got all the code to work. I've also done a bit more copy editing, as a few new errors crept in during your latest edits (mostly replacing ss with scores) and adjusting some grammar.
The image isn't properly linked. Can I ask you to adjust that please? I've made you a new folder 'sentiment-analysis' to put it into in the /images folder.
I'll then look for an image for our icon and start working on the YAML header and publication bits and pieces so we can go live.
@acrymble Hmm I don't see the sentiment-analysis subfolder in images. Could you create it again? I uploaded the image file to images for now and can move it over then. Thanks!
I managed to do it.
We'll use this image for the icon: https://www.flickr.com/photos/britishlibrary/11000942385/
@Zoews can you send me a 1 sentence bio that I can add? If you want to do a final read through of this let me know. Otherwise I think we can close this one.
Since it references the other lesson I think we'll have to wait to publish it officially so that internal references don't confuse readers.
@zoews before I get stuck into the second one, can you go back over it and revise it based on the feedback and changes I made to the first one? In particular, looking for jargon, non-human-readable variable names, and adding in LOTS of wikipedia links?
That will save me a lot of time in copy editing.
@acrymble Oh I love that icon! That's brilliant haha.
Here's my 1 sentence bio: Zoë Wilkinson Saldaña is a graduate student at the University of Michigan School of Information, where she focuses on the intersection of critical data literacies, academic libraries, and data-informed learning.
As for changes to Part 2, I will start making those changes today and ping you when complete. I think I will also use this opportunity to slightly rewrite the code (I think the "average of sentence-based sentiment scores" approach I took is unnecessary and confusing, and not very typical relative to other studies I've read, and so I think I'll revert to entire message sentiment analysis to make this clearer.) This process may end up taking me a couple of days, as I'll be testing the code/etc too. More soon,
Zoë
@acrymble I'm done with the newest draft of Part 2!
I ended up making pretty extensive edits to simplify the language and code, as per your suggestions. I broke up many code sections, changed variable names, and tried to explain the programming choices wherever possible.
I also made a few substantial content changes:
I think it should be mostly ready, otherwise! Many thanks for your time and efforts and excited for this to get out there in the world.
Warmly, Zoë
I've had a chance to read through the second lesson, and there are a number of things I think we need to discuss and work on. So I think the best way forward here is for us to slightly revise the first lesson to remove mentions of a series, publish that, and open a new review ticket where we can discuss lesson 2.
This lesson is now ready for publication. It includes the following files:
I will make a pull request to the main site, at which point the lesson will be published.
Thank you @zoews @CzarinaChalid71 and @alsalin, this lesson has now been published:
https://programminghistorian.org/lessons/sentiment-analysis
Please take a moment to share it with colleagues and on social media. This is a valuable contribution, and anything you can do to help us get the word out is greatly appreciated.
The Programming Historian has received the following tutorial on 'Exploratory Data Analysis of a Correspondence Text Corpus' by @zoews. This lesson is now under review and can be read at:
http://programminghistorian.github.io/ph-submissions/lessons/exploratory-data-analysis-with-nlp
Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.
I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. I will first read through the lesson and provide feedback, to which the author will respond.
Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.
I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me. You can always turn to @ianmilligan1 or @amandavisconti if you feel there's a need for an ombudsperson to step in.
Anti-Harassment Policy
This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.
WIth this review ticket, I close the corresponding 'proposal' ticket: #91