Project Description - Group 26

title: 'Project Description: Social Networking Phenomena on US Presidential Candidates, Based on Twitter Data' author: "by Group 26" date: "Nov 23 2015"

output: html_document

Summary

By exploiting both tweet- and user-specific data we want to explore whether there are certain geographical or timeframe trends with or clear distinctions between likely supporters of either of the two most prominent presidential candidates in the US.

Motivation

In times where people with internet-connected devices are overwhelmed by a constant overflow of information and social noise in social media and the resulting complexity of issues arising, it is often a preferred solution to look for easy answers and form bipolar prejudices. Yet this does not exactly make us smarter neither individually nor as a society, which is why we try to take advantage of the data that is available to us. With it, we want to provide an objective overview about the online conversation on the US Presidential Candidacy. After all, as many examples show, conversations held online and sentiments expressed within often differ substantially from those that traditional media suggest. So, for example, even though Donald Trump is being ridiculed by newspapers, he can still be sure of support of the ordinary folk, or so the thinking goes.

Data Description

The twitteR package allows us to retrieve data on actual tweets based on specific dates, locations, and, most importantly, topics as materialized by hashtags. The data contains 16 variables, amongst which there is the actual tweet, its location (if not protected), its retweet count, and the respective user. From there, it is also easy to get user-specific data, such as amongst others their followers count, their total number of tweets, the date of their profile activation, their self-indicated location, and so on. It is also possible to retrieve individual user timelines. As some tweets contain links, this mere fact can also be observed and used. Ultimately, in case the links lead to online (newspaper) articles, the text of these can be used to identify its readability.

Data Analysis and Desired Outcomes

The analysis will focus on finding patterns from the gathered data that can illustrate general characteristics on the twitter-users using hashtags that either support Donald Trump (#DT) or Hillary Clinton (#HC) as the next president of the United States.

We are planning on doing the things listed below, however, changes can be made during the work on the project.

We will make a graphical analysis where we look for geographical patterns by plotting the location data (longi- and latitude data) on a map over the US and see if there are any locations where many users uses #DT or #HC.
- This will (if possible) be combined with election-data from prior elections to see if the state from where the tweet is made is a typical republican or democratic state.
We will make a semantic analysis where we search for the most common used words when making a tweet with either #DT or #HC. In doing this we will be able to describe which political issues/topics the user of either #DT or #HC are most concerned about.
- Together with the individual twitter-users timelines, we can extract the most common used hashtags from these users, which will also give us some general characteristics about users in favor of DT or HC.
- We also plan on using a function that evaluates sentiments of tweets, which like the other text analysis instruments will add additional info to the characteristics of the general users in favor of either DT or HC.
- We also want to investigate whether there are a difference in the number of influential users that uses #DT or #HC. An influential Twitter user is measured as a user with a high number of followers, total number of tweets among other characteristics.

We also want to address the potential problem of trolling when using #DT or #HC. A way to mitigate this problem could be to look for potential outlier users when looking at their characteristics. We will investigate other solutions that could make the problem of trolling less severe.

Lastly, we will on the basis of the gathered characteristics build a predictive model, that predicts if users are likely to be either DT or HC followers based on their geographical location, their most common used words in tweets, their timeline of prior hashtags (sentiments in tweets, number of followers etc.). We haven't yet decided on an actual model framework.

sebastianbarfort / sds