sebastianbarfort / sds

Social Data Science, course at University of Copenhagen
http://sebastianbarfort.github.io/sds/
12 stars 17 forks source link

Project Description - Group 26 #66

Closed kasperwerlauff closed 8 years ago

kasperwerlauff commented 8 years ago

title: 'Project Description: Social Networking Phenomena on US Presidential Candidates, Based on Twitter Data' author: "by Group 26" date: "Nov 23 2015"

output: html_document

Summary

By exploiting both tweet- and user-specific data we want to explore whether there are certain geographical or timeframe trends with or clear distinctions between likely supporters of either of the two most prominent presidential candidates in the US.

Motivation

In times where people with internet-connected devices are overwhelmed by a constant overflow of information and social noise in social media and the resulting complexity of issues arising, it is often a preferred solution to look for easy answers and form bipolar prejudices. Yet this does not exactly make us smarter neither individually nor as a society, which is why we try to take advantage of the data that is available to us. With it, we want to provide an objective overview about the online conversation on the US Presidential Candidacy. After all, as many examples show, conversations held online and sentiments expressed within often differ substantially from those that traditional media suggest. So, for example, even though Donald Trump is being ridiculed by newspapers, he can still be sure of support of the ordinary folk, or so the thinking goes.

Data Description

The twitteR package allows us to retrieve data on actual tweets based on specific dates, locations, and, most importantly, topics as materialized by hashtags. The data contains 16 variables, amongst which there is the actual tweet, its location (if not protected), its retweet count, and the respective user. From there, it is also easy to get user-specific data, such as amongst others their followers count, their total number of tweets, the date of their profile activation, their self-indicated location, and so on. It is also possible to retrieve individual user timelines. As some tweets contain links, this mere fact can also be observed and used. Ultimately, in case the links lead to online (newspaper) articles, the text of these can be used to identify its readability.

Data Analysis and Desired Outcomes

The analysis will focus on finding patterns from the gathered data that can illustrate general characteristics on the twitter-users using hashtags that either support Donald Trump (#DT) or Hillary Clinton (#HC) as the next president of the United States.

We are planning on doing the things listed below, however, changes can be made during the work on the project.

We also want to address the potential problem of trolling when using #DT or #HC. A way to mitigate this problem could be to look for potential outlier users when looking at their characteristics. We will investigate other solutions that could make the problem of trolling less severe.

Lastly, we will on the basis of the gathered characteristics build a predictive model, that predicts if users are likely to be either DT or HC followers based on their geographical location, their most common used words in tweets, their timeline of prior hashtags (sentiments in tweets, number of followers etc.). We haven't yet decided on an actual model framework.