programminghistorian / ph-submissions

The repository and website hosting the peer review process for new Programming Historian lessons
http://programminghistorian.github.io/ph-submissions
138 stars 114 forks source link

Proposed Lesson: Exploratory Data Analysis of a Correspondence Text Corpus #91

Closed acrymble closed 7 years ago

acrymble commented 7 years ago

The Programming Historian has received the following proposal for a lesson on 'Exploratory Data Analysis of a Correspondence Text Corpus with NLTK and NetworkX' by @zoews. The proposed learning outcomes of the lesson are:

We recognize that there is a potential for overlap with a current proposed lesson (#70) and I will speak to the other team to ensure an understanding and a clear space for both sets of authors.

In order to promote speedy publication of this important topic, we have agreed to a submission date of no later than 31 July 2017. The author agrees to contact the editor in advance if they need to revise the deadline.

If the lesson is not submitted by 31 July 2017, the editor will attempt to contact the author. If they do not receive an update, this ticket will be closed. The ticket can be reopened at a future date at the request of the author(s).

The main editorial contact for this lesson is @acrymble. If there are any concerns from the authors they can contact the Ombudsperson @ianmilligan1 or @amandavisconti.

acrymble commented 7 years ago

Note that the draft of the other group's lesson is available at #92 . @zoews please make sure you take a look at it as you write, to avoid overlap.

cassws commented 7 years ago

@acrymble Will do, thanks!

cassws commented 7 years ago

@acrymble I wanted to provide an update and outline of the draft so far. If you have any feedback or revisions at this point please let me know -- otherwise I'm about 1-2 weeks away from submitting the draft.

In developing the lesson, I relied upon the Pandas library in order to bring text data into a DataFrame structure for robust exploratory analysis and apply transformations to data. I feel that this process is at the heart of learning goals two and three listed above, and thus I would like to slightly shift the focus of the lesson to "NLTK and Pandas" and away from "NLTK and NetworkX". I would still like to include a snippet of code that draws upon NetworkX to generate additional fields within a DataFrame, but as an example of one possible method of data transformation/analysis that I will pre-compute for users. This will encourage any users motivated to understand and implement network analysis to pursue lesson #92 and thus address the potential issue of overlap between the lessons.

Here is an outline of topics:

Thank you! Zoë

acrymble commented 7 years ago

Thanks @zoews for this outline. I've got a few initial queries that I'll throw to you for comment, particularly in terms of disambiguating tools from skills, and sustainability.

You know the workflow better than me, but from what I can tell, you plan to:

  1. structure some data
  2. conduct a sentiment analysis
  3. look for some patterns across the corpus based on some variables

One of our big challenges is making sure people understand what they're learning, rather than just how to do it. Our second challenge is to do everything we can to make sure lessons don't break over time, so that people can continue to use this lesson for 10+ years from now.

With that in mind, I'll suggest what looks to me like the bits that might creak with age and you can let me know what you think. It looks like for your lesson to keep working, we're reliant on the following not changing substantially:

If any one of those changes in a non-backwards compatible way, we've got a lesson that doesn't work anymore. It looks like NLTK is pretty central to the lesson, as is Python.

Can we achieve enough without Network-X to be meaningful, or is that core to what you want to achieve?

Secondly, I note you say you want to use Pandas dataframes. I'm not really familiar with them (which suggests I'm a bit out of date with my skills, but also that this is a newish thing) but from a quick look online, they look like matrices to me. Even though it might not be the most efficient/coolest/up-to-date, could this data structuring use a CSV file instead, as this is likely to be more sustainable long-term? This also has the advantage of being conceptually quite basic as everyone will know what it is, whereas a 'Pandas DataFrame' is one more thing to explain to a learner.

Let me know your thoughts. And thanks for the outline.

cassws commented 7 years ago

Hi @acrymble , thank you for these notes! I definitely hear you w/r/t long-term sustainability and focusing on teaching critical thinking and research competencies over just the rote process to get there.

I think using CSV data is a great idea for Task 3. And I could certainly exclude the code snippet that generates network analysis calculations (which will themselves be plugged into that pre-calculated CSV file along with gender and some other data to play with) if there's a worry about that network analysis code going out of date. I could, in that case, simply mention the calculation/network analysis principle used (such as indegree and outdegree). The benefit I imagine of showing a snippet of code with NetworkX library calls would be to create an explicit connection in the reader's head of "oh, if I want to calculate this myself, I have to figure out how to use NetworkX or a similar tool!" but the downside as you say is that code/syntax could very well go out of date over the years. It's not the most important element of what's going on here so I'm totally happy to leave it out or include with a caveat/commented out/etc.

With regards to using DataFrames in Task 2 as opposed to a CSV and associated Python data structures: that came about because the Enron email corpus is formatted as a series of raw text files rather than a tidy CSV file. I was initially hoping to use CSV entirely from the beginning, but realized I needed some kind of tool to wrangle the data effectively, and that's where I realized the power of DataFrames to easily apply transformations to a whole column of data, give quick statistical summaries of the data, etc. Pandas strikes me as a really powerful tool to teach the data analysis competencies I hope to cover in this tutorial. I could convert the Enron data to a tidier CSV format myself for Task 2, but I think there's value in using Pandas to wrangle the original formatting as 1) this is how the data was initially released in the mid-2000s, and that continues to persist on the Internet and form the basis of hundreds of other studies, and 2) I think this step of wrangling messy data will be of high interest to historians and others who will need to use Pandas or a similar tool to get the data to a workable point. Pandas also enables the powerful "map any function to your data and generate a new column based on the results" that I use for Task 2 as well, and I think this principle is really helpful for communicating the possibilities of exploratory data analysis (as in, I'd love the readers to feel emboldened to try writing their own map functions and generating new columns based on whatever research questions capture their interest!)

I can definitely conduct some further research to understand how proprietary DataFrames is as a concept and approach -- from some cursory review of other tutorials out there it looks like the NLTK+Pandas toolchain is fairly well represented, which makes me hopeful it is robust and widely adopted enough to get past that 10 year mark! But all of the functionality mentioned above can be hard coded in Python if need be. I can reflect further, and simplify and pare down as needed.

So my tl;dr: would be that I'd like to use Python, NLTK, and Pandas together and will reflect on the best ways to really emphasize the relevant research processes (using sentiment analysis and data structuring in the service of exploratory analysis). As I finish up, I will especially heed the points where code could get creaky and overly dependent on packages that change.

Please let me know if I should further modify my approach at this point, and otherwise I will keep going -- looking forward to working with folks on the draft!

acrymble commented 7 years ago

This sounds fine. Just note #95 may take care of your goal to teach the dataFrames.

cassws commented 7 years ago

Ah good to know, thank you. I'll proceed cautiously with Pandas but keep those code portions modular & think about using straight Python instead.. more in a couple of weeks!

cassws commented 7 years ago

@acrymble and all: The draft submission is almost complete, but still working on finalizing some of the language. Is it acceptable to submit it by Wednesday morning instead of today? Thank you for your patience.

cassws commented 7 years ago

@acrymble I'm ready to submit the draft! However, it doesn't appear I have permissions to upload to the to the https://github.com/programminghistorian/ph-submissions/tree/gh-pages/lessons directory. Could I be granted access to that folder, or alternatively should I upload them elsewhere or e-mail them? Thank you.

mdlincoln commented 7 years ago

@zoews I've just given you write access to this repository, so you should be able to add your draft now.

cassws commented 7 years ago

@mdlincoln thank you for the permissions invite!

The lesson is now up and running on the preview site: http://programminghistorian.github.io/ph-submissions/lessons/exploratory-data-analysis-with-nlp

Please note that, in addition to installing a few dependencies along the way, the tutorial will ask readers to install a copy of the Enron email corpus (which is ~2GB and currently stored off-site) and an additional small .py module (which is currently stored in the /lessons/ folder). Directions are included in the tutorial, but I wanted to flag that right now. I will ultimately be preparing a pared-down version of the Enron email corpus (limited to sent folders, which constitutes ~5% of the data) for long-term hosting on PH as per @acrymble 's suggestion, and will upload it once it's ready.

Please let me know if you have any questions at this stage, and otherwise very much looking forward to any and all feedback on the project.

Many thanks, Zoë

acrymble commented 7 years ago

This has been moved to a review ticket: #108