jtkiley commented 6 years ago

Let's firm up our outline as a first step. I'll throw out a starting place and edit the issue as we decide on things. Once we've firmed it up, we can divide and conquer.

Things for us to think about:

We only have so much time, so knowing what we want to foreground (i.e. demonstrate directly) and what we can background (i.e. mention without demo) would help.
If something is going to be covered in one of the later sessions, we can mention it. Providing a road map to choosing a next session would be a value add.
Is there anything that we want to get across that's not there in our current design?
If there are other things that we want to get across, can we cover it before the demo portion? It's an active learning workshop, so we should spend a good chunk of time hands on, but some things are too complex for that.
Anything else that I'm not thinking of that we should think of (meta-thinking?)

What ideas and practices do we want to advocate and demonstrate?

Data access
- This can be a lot of work, so think critically about what you need to examine your theoretical question.
- APIs provide access to unique data in an efficient manner.
Data selection [HT: moved up one]
- Types of data available (news, financials / Compustat, databases, e.g., Lexis-Nexis and Factiva)
- Importance of metadata
- Combining different data sources not a trivial taskData cleaning
- Save raw data directly. Avoid exposure to time-based or silent data updates.
- Make all changes in code. This gives you an audit trail, easy ability to change anything, and potential for code reuse.
Data analysis
- Show something simple, like a visualization.
- Otherwise, our "finish line" is having data ready for analysis.
Open source tools
- Powerful tools that exceed the functionality and performance of some frequently used commercial software.
- The heavy use of Python in industry means many more results for common searches and help.

Workshop outline (adapted from HT's 2018-02-21 notes)

Framing: data curation as a process – workflow
- Series of hands-on practices
- Walk-thru of our steps – clean workflow
- Connection to our central narrative
Introduction
- What is Python?
- How do you deal with unstructured data (vs. numerical data)?
- How do you define a Research Q (how to pull right kind of data)?
Workflow steps
1. Fetch the data (NYT API and Coinbase API?)
2. Save an archive of the raw data (point out pandas df.to_ and df.read_ methods for multiple types)
3. Create pandas dataframes for each
4. Clean the data, keeping the original data untouched
5. Show a summarize and merge (depending on the time data, we may edit live to show day/hour/minute resolution)
6. Show some simple content analysis (Textblob)
7. Visualize it
Reflections
- Does what we’ve presented seem doable?
- Other questions, concerns?
- How might you use this approach & workflow in your own research?
Conclusion
- Parting words from presenters: next steps for participants
- Explicit shoutouts for 4p sessions
  - Introduction to Data Science using Python (34MS01 Business Insights Lab)
  - Natural Language Processing 101 Turning Text into Data and Insights (72MS03 Classroom; same as our room)
  - (Others?)

tchalian commented 6 years ago

Thanks, Jason. This is great. Added to "data selection" bullet in outline.

I would also:

(1) have a clear visual of the overall process and overlay our workshop items as the day's agenda (I can do that) (2) Talk to AOM regarding starting the discussion informally during lunch, similar to an OMT Cafe, to extend our time and reach a bit (Tim?) (3) Identifying what we need to install (Anaconda, possibly - Laura, thoughts here?) and how and one or two of us getting there early to help, if need be

tchalian commented 6 years ago

A first cut at workshop PPT (including process steps). Feel free to edit.

AOM Data Curation Workshop (2.23.18).pptx

lknelson commented 6 years ago

Folks: so sorry I missed the online chat the other week. It accepted the invitation and then it never made it into my Google calendar. Technology failed me. I do have the next meeting (March 7?) on my calendar.

I've been traveling but I'm back now and can devote time again to this. The outline is looking great. My worry is it's too much for a workshop if people don't already know Python. Some specific thoughts:

1) I do still recommend Anaconda, although I have been having great experiences with Binder lately, so it's possible we could avoid installing things altogether. Anaconda is relatively easy though, and then they would have the software on their machines. 2) I just did a tutorial on the NYTimes API in class this week, and it was a bit of a disaster. It requires pretty extensive and nuanced knowledge of data types in Python, in particular lists and dictionaries, and the specific features of both of these. NYT returns a JSON file, so they need to know what a dictionary is, how to traverse a dictionary of dictionaries and then a list of dictionaries, and how to turn a list of dictionaries into a dataframe. The public API is also a bit finicky, and often fails. We had to run our GET request multiple times to get it to run successfully all the way through. But the end product was a graph of the number of articles that mention a particular keyword per year, and that was a satisfying outcome.

tchalian commented 6 years ago

Thanks, Laura. Very helpful. Yes, our next meeting is next Wed., 11-12noon EST.

For a short workshop, I wonder if we're not better off making the front end (software install and API access) less of a hassle, since our focus is on data curation / cleaning.

Anaconda is pretty easy to download. But pandas won't be intuitive for everyone. So, I'd vote for a pre-loaded dataset (perhaps NYTimes, like we had said) and Jupyter or Binder. But I'm also ok with Anaconda, if we decide to go that way.

Hovig Tchalian | Assistant Professor of Practice

Peter F. Drucker and Masatoshi Ito Graduate School of Management

Claremont Graduate University

1021 N. Dartmouth Avenue, Claremont, CA 91711 T 909.607.9203 | hovig.tchalian@cgu.edu Kathy Holden | Support T 909.607.9061 | kathy.holden@cgu.edu

On Fri, Mar 2, 2018 at 2:17 PM, Laura Nelson notifications@github.com wrote:

Folks: so sorry I missed the online chat the other week. It accepted the invitation and then it never made it into my Google calendar. Technology failed me. I do have the next meeting (March 7?) on my calendar.

I've been traveling but I'm back now and can devote time again to this. The outline is looking great. My worry is it's too much for a workshop if people don't already know Python. Some specific thoughts:

I do still recommend Anaconda, although I have been having great experiences with Binder lately, so it's possible we could avoid installing things altogether. Anaconda is relatively easy though, and then they would have the software on their machines.

I just did a tutorial on the NYTimes API in class this week, and it was a bit of a disaster. It requires pretty extensive and nuanced knowledge of data types in Python, in particular lists and dictionaries, and the specific features of both of these. NYT returns a JSON file, so they need to know what a dictionary is, how to traverse a dictionary of dictionaries and then a list of dictionaries, and how to turn a list of dictionaries into a dataframe. The public API is also a bit finicky, and often fails. We had to run our GET request multiple times to get it to run successfully all the way through. But the end product was a graph of the number of articles that mention a particular keyword per year, and that was a satisfying outcome.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/jtkiley/curation_workshop/issues/1#issuecomment-370068865, or mute the thread https://github.com/notifications/unsubscribe-auth/Aa3KRMBojIDfUgaJM3lgtFscunSKqn_Mks5tacTsgaJpZM4SQC1O .

jtkiley commented 6 years ago

Hi all. I added a new workshop philosophy/implications document to the repo. Check it out and let the feedback flow.

lknelson commented 6 years ago

I can do some NYT API work if someone, Hovig or Tim? points me to a few firms that make sense to compare.

tchalian commented 6 years ago

Thanks, Laura.

If we're looking for firms in general, might look at coverage of GM and Tesla, say in 2010 (Tesla IPO in mid-year). I mention GM, because I have a few projects around EVs that use the two firms as a basis of comparison. So I could provide contextual info.

If we're looking at female CEOs, like we were discussing, it might be interesting to pick two of the three companies below, who have (or had) female CEOs:

Yahoo (Marissa Mayer)
GM (Mary Barra)
IBM (Gini Rometty)

All are well-known, and it might be interesting to compare Mayer with Rometty (same industry, tech) or Mayer with Barra (across industries).

I had said I would look up the Northwestern paper - abridged Proceedings version attached. Requested a copy of the full paper, which I'll send along. I'm happy to come back with analyses we might run, based on the full version.

Hovig Tchalian | Assistant Professor of Practice

Peter F. Drucker and Masatoshi Ito Graduate School of Management

Director, Claremont Game Lab

Claremont Graduate University

1021 N. Dartmouth Avenue, Claremont, CA 91711 T 909.607.9203 | hovig.tchalian@cgu.edu Kathy Holden | Support T 909.607.9061 | kathy.holden@cgu.edu

On Mon, Mar 12, 2018 at 5:34 PM, Laura Nelson notifications@github.com wrote:

I can do some NYT API work if someone, Hovig or Tim? points me to a few firms that make sense to compare.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/jtkiley/curation_workshop/issues/1#issuecomment-372507218, or mute the thread https://github.com/notifications/unsubscribe-auth/Aa3KRFoQdgxS7rx8vRjUQ2J7N_jVdZR5ks5tdxQLgaJpZM4SQC1O .

tchalian commented 6 years ago

File attachment, for above... AOM Best Paper Submission # 13997.pdf

lknelson commented 6 years ago

Awesome. I'll work up a GM/Tesla comparison. Amount of coverage and simple sentiment analysis? We'll have to hand-wave a bit over JSON as a data structure, but we can point them to tutorials for that.

@jtkiley I assume we could merge that with your data (sorry if I sound naive with this question, this is quite outside my area of expertise :( )

If it seems like too much for the tutorial we don't have to include it.

tchalian commented 6 years ago

Amount of coverage and simple sentiment analysis sounds great.

Our dataset is pretty large - 80k+ articles (1985-2014), across several types (PR, reviews, and newspaper articles, including NYT), from Factiva. May be too much to incorporate and analyze. But happy to share, if we need more volume.

If we can use an API pull from 2010 to 2014, that might work as well. That last five years include the Tesla IPO and also coincide with the period of institutional 'lift' (increasing volume of discussion + 25-fold increase in

of U.S. charging stations).

Hovig Tchalian | Assistant Professor of Practice

Peter F. Drucker and Masatoshi Ito Graduate School of Management

Director, Claremont Game Lab

Claremont Graduate University

1021 N. Dartmouth Avenue, Claremont, CA 91711 T 909.607.9203 | hovig.tchalian@cgu.edu Kathy Holden | Support T 909.607.9061 | kathy.holden@cgu.edu

On Mon, Mar 12, 2018 at 8:45 PM, Laura Nelson notifications@github.com wrote:

Awesome. I'll work up a GM/Tesla comparison. Amount of coverage and simple sentiment analysis? We'll have to hand-wave a bit over JSON as a data structure, but we can point them to tutorials for that.

@jtkiley https://github.com/jtkiley I assume we could merge that with your data (sorry if I sound naive with this question, this is quite outside my area of expertise :( )

If it seems like too much for the tutorial we don't have to include it.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/jtkiley/curation_workshop/issues/1#issuecomment-372537312, or mute the thread https://github.com/notifications/unsubscribe-auth/Aa3KRAiX56vn1rq8u7SSmEVpgojcYlNNks5td0DhgaJpZM4SQC1O .

lknelson commented 6 years ago

Added an example comparing Tesla and GM using the NYT API. Looks like polarity and subjectivity went up in 2012 for both. Does that make sense to you, @tchalian ? Also subjectivity is perfectly correlated with the number of articles published for GM. That's likely a fluke, but maybe not?

The notebook is in the scripts folder. If you want to reproduce it I'd appreciate it if you change the API key to your own. Don't want to tax my key.

My general feeling is that this is way too complicated for the workshop, but the example is there. It's easy to change to different keywords if anyone wants to play around with it. Note that the function 'grab_data' will save a copy of the data on your hard drive.

Also note that this is just a draft. If we wanted to do this in the workshop I would clean it up. In particular, I would separate the API function into its own separate script/notebook. Then I would create a second script/notebook for cleaning and graphing. But I don't want to spend more time if this isn't the direction we want to go.

timhannigan commented 6 years ago

I like the direction we're heading in here with Tesla and GM as comparators. I wonder if the query could be slightly more focused with "GM" & "electric"? Also, I'm wondering if we want to include subjectivity as a subset of sentiment? Although textBlob includes it, is this one extra thing we need to explain?

timhannigan commented 6 years ago

Further, @lknelson do we want to say anything about the broader practice of obtaining an API key?

lknelson commented 6 years ago

There's many potential problems with this example.

TextBlob is not included in the base Anaconda distribution, so we would have to coach people through conda install (bless our hearts this is tricky in a short workshop!). We could instead use NLTK to do sentiment? That's not too tricky to change.
The public NYT API probably can't handle all the requests we'll send it. But it usually works eventually if people keep trying.
People would need to get an API key. I don't want to publicly release mine, of course. One option I've seen is to provide a set of API keys that we produce ahead of time (maybe 5 or so different keys), and then let the participants choose one. This has been successful in the past.

tchalian commented 6 years ago

I'd vote NLTkK. A few catch-up points, on my end:

Yes, NLTK may be simplest way to stay in Python / Jupyter universe. Like it.
Maybe we mention NYT APIs and point folks to database downloads (e.g., Lexis-Nexis, ProQuest, Factiva) - i.e., avoid complexity issue of API keys and say what we did with an API, you can do in your library's database pulls. If you'd like more info about APIs, we'll be around after the session
Agree with TIm's point re Tesla and GM pulls including another term (we used "electric vehicle()" and "EV").

Laura, haven't forgotten about your request to take a look at the data you pulled. Will get to it in a bit.

Hovig Tchalian | Assistant Professor of Practice

Peter F. Drucker and Masatoshi Ito Graduate School of Management

Director, Claremont Game Lab

Claremont Graduate University

1021 N. Dartmouth Avenue, Claremont, CA 91711 T 909.607.9203 | hovig.tchalian@cgu.edu Kathy Holden | Support T 909.607.9061 | kathy.holden@cgu.edu

On Tue, Mar 13, 2018 at 1:22 PM, Laura Nelson notifications@github.com wrote:

There's many potential problems with this example.

TextBlob is not included in the base Anaconda distribution, so we would have to coach people through conda install (bless our hearts this is tricky in a short workshop!). We could instead use NLTK to do sentiment? That's not too tricky to change.

The public NYT API probably can't handle all the requests we'll send it. But it usually works eventually if people keep trying.

People would need to get an API key. I don't want to publicly release mine, of course. One option I've seen is to provide a set of API keys that we produce ahead of time (maybe 5 or so different keys), and then let the participants choose one. This has been successful in the past.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jtkiley/curation_workshop/issues/1#issuecomment-372805217, or mute the thread https://github.com/notifications/unsubscribe-auth/Aa3KRLk-Y5bFw-O3-G2OqKQNWHdZ5Lpaks5teCptgaJpZM4SQC1O .

timhannigan commented 6 years ago

Ok, NLTK sounds great! We need to let the organizers know by tomorrow what we want them to send out for "a software download for your ALW session". I think it's clear that we'll be asking participants to just download Anaconda (which includes NLTK).

I like the idea of a set of shared public API keys to use for the session.

jtkiley commented 6 years ago

Closing. We obviously iterated on it and figured it out.

jtkiley / curation_workshop

Workshop outline #1

Things for us to think about:

What ideas and practices do we want to advocate and demonstrate?

Workshop outline (adapted from HT's 2018-02-21 notes)

of U.S. charging stations).