sebastianbarfort / sds

Social Data Science, course at University of Copenhagen
http://sebastianbarfort.github.io/sds/
12 stars 17 forks source link

Group 15: Assignment 3 #67

Closed adamingwersen closed 8 years ago

adamingwersen commented 8 years ago

title: '

Decay rate of hashtag usage on Twitter:
Terrorist Attacks, Natural Disasters and Celebrity Scandals
' subtitle: 'Assignment 3 - Project Description' author: "Group 15" date: "23. nov. 2015"

output: html_document

Please use link for nice layout & images

Idea & Motivation

In response to the terrorist attacks in Paris 13th November 2015 a surge in social media attention and activity was directed towards expressing compassion to the victims of the episode. The benevolent behaviour took the form of e.g. tweeting with the hashtag #PrayForParis, #parisattacks and flagging of the french national colours on posted pictures. However, when the message has ressonated throughout the social media community the interest wears off - and the attention is turned towards politicians and decision makers to punish and prevent. Meanwhile, another attack carried out by Boko Haram in Nigeria claimed the lives of 32 people - this incident did not see a fraction of the media/social media coverage as the Paris attacks. Interestingly both episodes rapidly lost the interest of the internet-community as a whole. At the other end of the spectrum: Charlie Sheen was diagnosed with AIDS - this topic also recieved a lot of attention. This event, however, seems to have a higher degree of persistency in regards attention given by the online-community.

alt text alt text

This raises the question: Which factors determine the rate of decay in interest on social media - in particular Twitter, triggered by a real-life event. A similar idea was posted to The Economist[^1]. Here the issue of differences in percieved interest is assessed: They conclude that cultural proximity is a driver for foreign attention towards a terrorist attack. They also conclude that, even when controlling for such factors as geography and cultural idiosyncracies - there still seems to be a gap in “empathy” between internet users. In investigating social media response to the Woolwich attacks, Burmap, Williams et. al., comes to the conclusion that positively sentimented tweets has a higher probability of being retweeted than negative ones. Researchers from MIT, Penn & U. of Washington have developed a mathematical model for predicting the rate of retweets. The model (Twoujia) as such, adresses the question raised above - retweets are by definition captured in the overall attention metric: Usage frequency of a particular hashtag.

Data

Ultimately we are interested in obtaining a cross-sectional dataframe. It should contain infomration on any particular event/hashtag; date, geography of event, type of event, number of tweets/retweets, number of casualties, rate of decay(linear) amongst other variables. In doing so, we need to investigate each event as a time-series and derive constant metrics from the data to plug in to the cross-section. First step: Collect relevant hashtags based on events of interest given our problem throughout 2015. We will be gathering data using Twitters API via the R-package TwitteR. In particular the function searchTwtiter() will be utilized. We will be pairing the data obtained through Twitter with categorical data for “type of event” with e.g. three options as listed in the header - the way to go about this is to construct two dummy variables. One serious limitation to this approach is that we are cherry-picking observations/events based on our perception of “important” events - this opinion will be inherently biased due to local media coverage of global events and perhaps the reach of tweets based on geographics. We will attempt to combat these issues by picking out events based on the most objective metric we can identify.

Methodology

When data is obtained, we are interested in prediction. Precise inference will not be of interest as we will not acquire sufficient observations - however, some externally invalid postulates may be presented in coercion with the predictive analysis. In constructing a prediciton model for the rate of depreciation of attention in Twitter hashtags, we will be relying on statistical learning techniques. We will attempt at classification and linear regression methods developed in chapters 3 & 4[^2]

[^1]: The Global Empathy Gap Between Paris and Beirut, November 19th 2015 [^2]: The Elements of Statistical Learning: Hastie, Tibshirani, Friedman.