Closed jtkiley closed 6 years ago
Thanks, Jason. This is great. Added to "data selection" bullet in outline.
I would also:
(1) have a clear visual of the overall process and overlay our workshop items as the day's agenda (I can do that) (2) Talk to AOM regarding starting the discussion informally during lunch, similar to an OMT Cafe, to extend our time and reach a bit (Tim?) (3) Identifying what we need to install (Anaconda, possibly - Laura, thoughts here?) and how and one or two of us getting there early to help, if need be
A first cut at workshop PPT (including process steps). Feel free to edit.
Folks: so sorry I missed the online chat the other week. It accepted the invitation and then it never made it into my Google calendar. Technology failed me. I do have the next meeting (March 7?) on my calendar.
I've been traveling but I'm back now and can devote time again to this. The outline is looking great. My worry is it's too much for a workshop if people don't already know Python. Some specific thoughts:
1) I do still recommend Anaconda, although I have been having great experiences with Binder lately, so it's possible we could avoid installing things altogether. Anaconda is relatively easy though, and then they would have the software on their machines. 2) I just did a tutorial on the NYTimes API in class this week, and it was a bit of a disaster. It requires pretty extensive and nuanced knowledge of data types in Python, in particular lists and dictionaries, and the specific features of both of these. NYT returns a JSON file, so they need to know what a dictionary is, how to traverse a dictionary of dictionaries and then a list of dictionaries, and how to turn a list of dictionaries into a dataframe. The public API is also a bit finicky, and often fails. We had to run our GET request multiple times to get it to run successfully all the way through. But the end product was a graph of the number of articles that mention a particular keyword per year, and that was a satisfying outcome.
Thanks, Laura. Very helpful. Yes, our next meeting is next Wed., 11-12noon EST.
For a short workshop, I wonder if we're not better off making the front end (software install and API access) less of a hassle, since our focus is on data curation / cleaning.
Anaconda is pretty easy to download. But pandas won't be intuitive for everyone. So, I'd vote for a pre-loaded dataset (perhaps NYTimes, like we had said) and Jupyter or Binder. But I'm also ok with Anaconda, if we decide to go that way.
Hovig Tchalian | Assistant Professor of Practice
Peter F. Drucker and Masatoshi Ito Graduate School of Management
Claremont Graduate University
1021 N. Dartmouth Avenue, Claremont, CA 91711 T 909.607.9203 | hovig.tchalian@cgu.edu Kathy Holden | Support T 909.607.9061 | kathy.holden@cgu.edu
On Fri, Mar 2, 2018 at 2:17 PM, Laura Nelson notifications@github.com wrote:
Folks: so sorry I missed the online chat the other week. It accepted the invitation and then it never made it into my Google calendar. Technology failed me. I do have the next meeting (March 7?) on my calendar.
I've been traveling but I'm back now and can devote time again to this. The outline is looking great. My worry is it's too much for a workshop if people don't already know Python. Some specific thoughts:
- I do still recommend Anaconda, although I have been having great experiences with Binder lately, so it's possible we could avoid installing things altogether. Anaconda is relatively easy though, and then they would have the software on their machines.
- I just did a tutorial on the NYTimes API in class this week, and it was a bit of a disaster. It requires pretty extensive and nuanced knowledge of data types in Python, in particular lists and dictionaries, and the specific features of both of these. NYT returns a JSON file, so they need to know what a dictionary is, how to traverse a dictionary of dictionaries and then a list of dictionaries, and how to turn a list of dictionaries into a dataframe. The public API is also a bit finicky, and often fails. We had to run our GET request multiple times to get it to run successfully all the way through. But the end product was a graph of the number of articles that mention a particular keyword per year, and that was a satisfying outcome.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/jtkiley/curation_workshop/issues/1#issuecomment-370068865, or mute the thread https://github.com/notifications/unsubscribe-auth/Aa3KRMBojIDfUgaJM3lgtFscunSKqn_Mks5tacTsgaJpZM4SQC1O .
Hi all. I added a new workshop philosophy/implications document to the repo. Check it out and let the feedback flow.
I can do some NYT API work if someone, Hovig or Tim? points me to a few firms that make sense to compare.
Thanks, Laura.
If we're looking for firms in general, might look at coverage of GM and Tesla, say in 2010 (Tesla IPO in mid-year). I mention GM, because I have a few projects around EVs that use the two firms as a basis of comparison. So I could provide contextual info.
If we're looking at female CEOs, like we were discussing, it might be interesting to pick two of the three companies below, who have (or had) female CEOs:
All are well-known, and it might be interesting to compare Mayer with Rometty (same industry, tech) or Mayer with Barra (across industries).
I had said I would look up the Northwestern paper - abridged Proceedings version attached. Requested a copy of the full paper, which I'll send along. I'm happy to come back with analyses we might run, based on the full version.
Hovig Tchalian | Assistant Professor of Practice
Peter F. Drucker and Masatoshi Ito Graduate School of Management
Director, Claremont Game Lab
Claremont Graduate University
1021 N. Dartmouth Avenue, Claremont, CA 91711 T 909.607.9203 | hovig.tchalian@cgu.edu Kathy Holden | Support T 909.607.9061 | kathy.holden@cgu.edu
On Mon, Mar 12, 2018 at 5:34 PM, Laura Nelson notifications@github.com wrote:
I can do some NYT API work if someone, Hovig or Tim? points me to a few firms that make sense to compare.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/jtkiley/curation_workshop/issues/1#issuecomment-372507218, or mute the thread https://github.com/notifications/unsubscribe-auth/Aa3KRFoQdgxS7rx8vRjUQ2J7N_jVdZR5ks5tdxQLgaJpZM4SQC1O .
File attachment, for above... AOM Best Paper Submission # 13997.pdf
Awesome. I'll work up a GM/Tesla comparison. Amount of coverage and simple sentiment analysis? We'll have to hand-wave a bit over JSON as a data structure, but we can point them to tutorials for that.
@jtkiley I assume we could merge that with your data (sorry if I sound naive with this question, this is quite outside my area of expertise :( )
If it seems like too much for the tutorial we don't have to include it.
Amount of coverage and simple sentiment analysis sounds great.
Our dataset is pretty large - 80k+ articles (1985-2014), across several types (PR, reviews, and newspaper articles, including NYT), from Factiva. May be too much to incorporate and analyze. But happy to share, if we need more volume.
If we can use an API pull from 2010 to 2014, that might work as well. That last five years include the Tesla IPO and also coincide with the period of institutional 'lift' (increasing volume of discussion + 25-fold increase in
Hovig Tchalian | Assistant Professor of Practice
Peter F. Drucker and Masatoshi Ito Graduate School of Management
Director, Claremont Game Lab
Claremont Graduate University
1021 N. Dartmouth Avenue, Claremont, CA 91711 T 909.607.9203 | hovig.tchalian@cgu.edu Kathy Holden | Support T 909.607.9061 | kathy.holden@cgu.edu
On Mon, Mar 12, 2018 at 8:45 PM, Laura Nelson notifications@github.com wrote:
Awesome. I'll work up a GM/Tesla comparison. Amount of coverage and simple sentiment analysis? We'll have to hand-wave a bit over JSON as a data structure, but we can point them to tutorials for that.
@jtkiley https://github.com/jtkiley I assume we could merge that with your data (sorry if I sound naive with this question, this is quite outside my area of expertise :( )
If it seems like too much for the tutorial we don't have to include it.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/jtkiley/curation_workshop/issues/1#issuecomment-372537312, or mute the thread https://github.com/notifications/unsubscribe-auth/Aa3KRAiX56vn1rq8u7SSmEVpgojcYlNNks5td0DhgaJpZM4SQC1O .
Added an example comparing Tesla and GM using the NYT API. Looks like polarity and subjectivity went up in 2012 for both. Does that make sense to you, @tchalian ? Also subjectivity is perfectly correlated with the number of articles published for GM. That's likely a fluke, but maybe not?
The notebook is in the scripts folder. If you want to reproduce it I'd appreciate it if you change the API key to your own. Don't want to tax my key.
My general feeling is that this is way too complicated for the workshop, but the example is there. It's easy to change to different keywords if anyone wants to play around with it. Note that the function 'grab_data' will save a copy of the data on your hard drive.
Also note that this is just a draft. If we wanted to do this in the workshop I would clean it up. In particular, I would separate the API function into its own separate script/notebook. Then I would create a second script/notebook for cleaning and graphing. But I don't want to spend more time if this isn't the direction we want to go.
I like the direction we're heading in here with Tesla and GM as comparators. I wonder if the query could be slightly more focused with "GM" & "electric"? Also, I'm wondering if we want to include subjectivity as a subset of sentiment? Although textBlob includes it, is this one extra thing we need to explain?
Further, @lknelson do we want to say anything about the broader practice of obtaining an API key?
There's many potential problems with this example.
I'd vote NLTkK. A few catch-up points, on my end:
Laura, haven't forgotten about your request to take a look at the data you pulled. Will get to it in a bit.
Hovig Tchalian | Assistant Professor of Practice
Peter F. Drucker and Masatoshi Ito Graduate School of Management
Director, Claremont Game Lab
Claremont Graduate University
1021 N. Dartmouth Avenue, Claremont, CA 91711 T 909.607.9203 | hovig.tchalian@cgu.edu Kathy Holden | Support T 909.607.9061 | kathy.holden@cgu.edu
On Tue, Mar 13, 2018 at 1:22 PM, Laura Nelson notifications@github.com wrote:
There's many potential problems with this example.
- TextBlob is not included in the base Anaconda distribution, so we would have to coach people through conda install (bless our hearts this is tricky in a short workshop!). We could instead use NLTK to do sentiment? That's not too tricky to change.
- The public NYT API probably can't handle all the requests we'll send it. But it usually works eventually if people keep trying.
- People would need to get an API key. I don't want to publicly release mine, of course. One option I've seen is to provide a set of API keys that we produce ahead of time (maybe 5 or so different keys), and then let the participants choose one. This has been successful in the past.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jtkiley/curation_workshop/issues/1#issuecomment-372805217, or mute the thread https://github.com/notifications/unsubscribe-auth/Aa3KRLk-Y5bFw-O3-G2OqKQNWHdZ5Lpaks5teCptgaJpZM4SQC1O .
Ok, NLTK sounds great! We need to let the organizers know by tomorrow what we want them to send out for "a software download for your ALW session". I think it's clear that we'll be asking participants to just download Anaconda (which includes NLTK).
I like the idea of a set of shared public API keys to use for the session.
Closing. We obviously iterated on it and figured it out.
Let's firm up our outline as a first step. I'll throw out a starting place and edit the issue as we decide on things. Once we've firmed it up, we can divide and conquer.
Things for us to think about:
What ideas and practices do we want to advocate and demonstrate?
Workshop outline (adapted from HT's 2018-02-21 notes)
df.to_
anddf.read_
methods for multiple types)