Group 15: Assignment 2 - Githubissues

title: "Assignment 2 - Social Data Science" author: "Group 15" date: "November 9, 2015"

output: html_document

Scrape & Analysis of www.ipaidabribe.com

We are to scrape the website and clean the gathered data primarily using the R-packages: stringr, plyr, dplyr & rvest. Then, we will delve into a very light econometric analysis paired with geo-spatial data-analysis using ggplot, countrycode etc. This assignment can be viewed in html here

Scraping the website: Steps 1-2

The website is constructed in a way such that for each page of recorded bribes contains 10 posts per page. It did not prove useful to use selector-gadget for identifying the underlying html-table. However, any page of 10 posts only has unique URL-element; http://www.ipaidabribe.com/reports/paid?page={**any integer**}. Thus, we took the following approach: create vector of integers from 0:1000 in intervals of 10. Insert 0, 10, 20 etc. into the element in URL that determines page number via loop - here looping with plyr was deemed useful as it is simple to coerce into dataframe and is rather fast.

library("knitr")
knitr::opts_chunk$set(cache=TRUE)

library("plyr")
library("rvest")
library("stringr") 

#1.1) 
#Need to create values of 0:1000 by 10 in order construct list of viable URLS to scrape for the appropriate css.selector for each post.
x1000 <- c(0:1000)
n1000 <- length(x1000)
var1000 = x1000[seq(1, n1000, 10)]

#Looking at the structure of the URL's we see that /reports/paid?page= defines the span for subsets of 10 posts
# Thus, if we can insert 0:1000 by intervals of 10 into "LANK", we actually have all 100 pages, each consisting of 10 posts 
linksub.li = "http://www.ipaidabribe.com/reports/paid?page=LANK"

#1.2) FUNCTION : 
#Create function that runs through all numbers in var1000 and replaces LANK with the 0:1000 by 10s

link_str_replace = function(var1000){
  link.sub = gsub("\\LANK", var1000, linksub.li)
}

#1.3) LOOP : 
# Using plyr (ld) for simple, efficient looping - list-to-list, llply works, but not optimally as transformations has to be made afterwards
num.link.li = ldply(var1000, link_str_replace)
num.link.li2 = num.link.li$V1

head(num.link.li2, 5)

We have now obtained 100 pages each containing 10 posts. Directly from these pages, the post-information we are interested in is accessible, given the appropriate CSS-selectors:

## Let's identify the relevant css-selectors

as2.css.selector_1 = ".heading-3 a"       #URL and/or TITLE depends on 'html_attr(name = href/title)'
as2.css.selector_2 = "span.date"          #date
as2.css.selector_3 = "li.paid-amount"     #amount paid
as2.css.selector_4 = "div.key > a"        #location
as2.css.selector_5 = "li.name > a"        #department
as2.css.selector_6 = "li.transaction > a" #transaction details
as2.css.selector_7 = "li.views"           #number of views

Defining function and creating loop for iterating trhough all observations in num.link.li2. We make use of rvest in order to fetch data.

# Function requesting on CSS-selectors for each webpage

scrape_post_bribe = function(num.link.li2){
  post.url = read_html(num.link.li2, encoding = "UTF-8")
  post.title = post.url %>%
    html_nodes(css = as2.css.selector_1) %>%
    html_attr(name = 'title')
  post.date = post.url %>%
    html_nodes(css = as2.css.selector_2) %>%
    html_text()
  post.paid = post.url %>%
    html_nodes(css = as2.css.selector_3) %>%
    html_text()
  post.location = post.url %>%
    html_nodes(css = as2.css.selector_4) %>%
    html_attr(name = 'title')
  post.dept = post.url %>%
    html_nodes(css = as2.css.selector_5) %>%
    html_attr(name = 'title')
  post.trans = post.url %>%
    html_nodes(css = as2.css.selector_6) %>%
    html_attr(name = 'title')
  post.views = post.url %>%
    html_nodes(css = as2.css.selector_7) %>%
    html_text()
  return(cbind(post.title, post.date, post.location, post.dept, post.trans, post.views, post.url))
}

# Loop - sleep-timer set to 0.01 

post.bribe.df = list()
for(i in num.link.li2){
  print(paste("processing", i, sep = " "))
  post.bribe.df[[i]] = scrape_post_bribe(i)
  Sys.sleep(0.01)
  cat("done!\n")
}

Cleaning gathered data, preparation & external data : Step 3-4

When the 1.000 posts have been obtained, we coerce from list into a dataframe using ldply([data], data.frame). From here we need to do basic data-cleaning primarily utilizing functions contained in stringr in order to have a tidy, interpretable dataframe.

#DATA :
# Now that the data has been gathered, we need to do a little cleaning - first step is to set up a dataframe and remove duplicate observations
# ldply(data, data.frame) fixes this for us
IN.Bribe.df = ldply(post.bribe.df, data.frame)

It's however obvious that this dataframe need further manipulation/cleaning:

#3.1) Manupulation of IN.Bribe.df - preparation for analysis
IN.Bribe.df$post.views = gsub("\\views.*$", "", IN.Bribe.df$post.views)                           # Seperating numeric from views
IN.Bribe.df$region = gsub(".*,", "", IN.Bribe.df$post.location)                                   # Using regex with gsub to seperate words by comma
IN.Bribe.df$post.city = gsub("\\,.*", "", IN.Bribe.df$post.location)                              # ...
IN.Bribe.df$bribe.paid.INR = as.numeric(gsub("[^\\d]+", "", IN.Bribe.df$post.title, perl=TRUE))   # Extracting numeric value of bribe from title using PERL-type regular expression

# Dates, using simple as.Date function

IN.Bribe.df$post.date = gsub("\\,", "", IN.Bribe.df$post.date)
IN.Bribe.df$post.date = gsub("\\November", "11", IN.Bribe.df$post.date)
IN.Bribe.df$post.date = gsub("\\October", "10", IN.Bribe.df$post.date)
IN.Bribe.df$num.date = strptime(IN.Bribe.df$post.date, "%m %d %Y")

#3.2) Deleting obsolete variables/columns
IN.Bribe3.df = data.frame(lapply(IN.Bribe.df, as.character), stringsAsFactors=FALSE)
IN.Bribe3.df$post.url = NULL
IN.Bribe3.df$post.location = NULL
IN.Bribe3.df$.id = NULL
IN.Bribe3.df$post.region = NULL
IN.Bribe3.df$town = IN.Bribe3.df$post.city

It may prove insightful to add additional data to the existing dataframe, such as literacy. Literacy may affect the number of bribes in a given city, and it will definitely affect the number of bribes reported on ipaidabribe, since literacy will be a barrier for actually adding a post to the page especially since the webpage is in English. Therefore one could imagine that cities with a high literacrate experiences a relatively low number of bribes reported, whereas one on the other hand could argue that high literacyrates at the same time would cause many bribes since they may not ask as many questions about the bribes. Here, data from wikipedia on literacy in India is used - imported as .csv:

#4)
Ext1.list = read.csv("https://raw.githubusercontent.com/adamingwersen/Data.for.ass2_SDS/master/India.Region.Literacy.csv.csv", sep = ";")
# Remove trailing/leading whitespace in existning dataframe for merge/join:
IN.Bribe3.df$region = gsub("^\\s+|\\s+$", "", IN.Bribe3.df$region)  #Trim trailing/leading whitespace
Ext1.list$region = as.character(Ext1.list$region)

# Join dataframes by region
library("dplyr")
combi.df = right_join(Ext1.list, IN.Bribe3.df, by = "region", copy = TRUE, all.x = TRUE)

Analysis

Exploring geographical and time-dependent relationships within the generated data-frame

Mapping geo-spatial data: Step 5

In order to visualize discrepancies in post-views and bribes paid across cities in India, a map seems ideal. The ggmap-package allows for visualizations using Googles Maps service for plotting overlays on top. Using the maps-package we fetch the geo-coordinates needed.

#MAP DATA
library("maps")
data(world.cities)

world.cities$name = gsub("\\'", "", world.cities$name)
world.cities$town = world.cities$name
world.cities1 = world.cities[world.cities$country.etc == "India",]

library("dplyr")
india.spatial.df = inner_join(world.cities1, combi.df, by = "town", all.y = TRUE)
india.spatial.df$town <- as.character(india.spatial.df$town)
india.spatial.df$pop = as.numeric(india.spatial.df$pop) 
india.spatial.df$bribe.paid.INR = as.numeric(india.spatial.df$bribe.paid.INR)
india.spatial.df$post.views = as.numeric(india.spatial.df$post.views)

# GGMAP-PLOT 
library("ggmap")
map <- get_map("India", zoom = 5, maptype = "terrain")
p <- ggmap(map)
ggsave(p, file = "map1.png", width = 5, height = 5, type = "cairo-png")

This map illustrates population and views of posts by city:

ind = p + geom_point(aes(x=long, y=lat), data=india.spatial.df, col="orange", alpha=0.4, size = log(india.spatial.df$citiziens)) +  
  geom_point(aes(x = long, y = lat), data = india.spatial.df, col = "blue", alpha = 0.4, size = log(india.spatial.df$post.views)) + scale_size(name = "Population") +
  labs(x = "Longitude", y = "Latitude") + ggtitle("Population and post views in Indian cities")

plot(ind)

From the graph, it's seen that their seems to be a correlation among large cities, i.e. the less transparent orange dots, and the number of post views for the respective cities, the filled blue dots. Now this can possibly stem from a lot of different sources, for instance that larger cities have more post hence more aggregated post views, that people is more interested in reading about bribes in their own city, which is why larger cities have more post views or some other explanation. If there were room and time for some more deep visualization, these mentioned things could definetely be a nice thing to look more into, i.e. plot the average number of views in cities to the population, to the literacy rate for which we have gathered for our dataframe or other suitable plots.'

top10=india.spatial.df %>% 
  group_by(town, ratio)  %>% 
  summarise(bribe.paid=n())
top10=arrange(top10, desc(bribe.paid))

head(top10, 10)

One could imagine that more litterate cities would experience less bribes than inlitterates, but as seen by this table, there is no clear relationship between litteracy and the number of bribes paid. If we look at the cities with the highest litteracy ratio, i.e. the most litterate cities, there are only a few bribes in each city. If we instead look at the top 10 cities for number of bribes, the picture is not as clear as shown by the former table – as one can see, the cities in top 10 of bribes is neither the most or least litterate cities. We therefore have no clear evidence from the collected data that there will be more bribes in a non-litterate city, and since the number of views must be closely related to the degree of litteracy, there is neither a clear connection between views and number of bribes.

Other data visualizations: Step 6

Another interesting plot could be to look at the days for posted bribes, in other words we want to see if there's any time-dependency in the usage of the website. Extracting weekdays. Setting OS-derived locale to english in order to get english weekdays.

india.spatial.df$POSIXct = as.POSIXct(india.spatial.df$num.date)
Sys.setlocale("LC_TIME","English")
india.spatial.df$wday = weekdays(india.spatial.df$POSIXct)
#6.2) Factoring weekdays for order in plot
india.spatial.df$wday <- factor(india.spatial.df$wday, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

#6.3) Density plot of posts on a given weekday
library("ggplot2")
library("scales")
wp = ggplot(india.spatial.df, aes(x = wday))
wp = wp + geom_histogram(aes(group = wday, colour = wday, fill = wday), alpha = 0.4)
wp = wp + labs(x = "Weekday", y = "Posts", title = "Density: Posts on weekdays")
wp = wp + theme_minimal()

plot(wp)

Apparently mondays are popular. We will investigate, what drives this tendency; geographical or departmental.

ep = ggplot(india.spatial.df, aes(x = wday))
ep = ep + geom_histogram(aes(group = wday, colour = wday, fill = wday), alpha = 0.4)
ep = ep + labs(x = "Weekday", y = "Posts", title = "Density: Posts on weekdays by region")
ep = ep + theme(panel.grid.major = element_blank(), 
                                      panel.grid.minor = element_blank(), 
                                      panel.background = element_blank(), 
                                      axis.line = element_line(colour = "blue"),
                                      axis.text.x = element_text(angle = 90))
ep = ep + facet_wrap(~region, scales = "free_y")

plot(ep)

Some regions seem to be following the overall trend, however others only post on e.g. saturdays - maybe we should pick out the most populous, as the differences in population between these are rather large. The larger regions appear to be more representative in terms of this particular tendency.

spat2 = subset(india.spatial.df, citiziens > 50000000)
pl = ggplot(spat2, aes(x = region, y = citiziens/100000000))
pl = pl + geom_bar(stat="identity")
pl = pl + theme_minimal()
pl = pl + labs(x = "Region", y = "Inhabitants", title = "Regional population in millions")
pl = pl + theme(panel.grid.major = element_blank(), 
                panel.grid.minor = element_blank(), 
                panel.background = element_blank(), 
                axis.line = element_line(colour = "blue"),
                axis.text.x = element_text(angle = 50))

plot(pl)

Picking out all regions with populaion > 50M. Conducting simple regression with stated model:

reg3 = lm(as.numeric(spat2$post.views)~as.numeric(spat2$ratio))
plot(as.numeric(spat2$post.views)~as.numeric(spat2$ratio), main = "Post Views ~ Literacy Rate", xlab = "Literacy Rate", ylab = "Post Views")
abline(reg3$coefficients)
res3 = resid(reg3)
hist(res3, freq = FALSE)

The relationship seems dubious, however stronger than when not subsetting the dataframe to 50M+ regions.

combi.df$logbribe = log(as.numeric(combi.df$bribe.paid.INR))
reg4 = lm(as.numeric(combi.df$logbribe)~as.numeric(combi.df$post.views))
plot(as.numeric(combi.df$logbribe)~as.numeric(combi.df$post.views), main = "log(Bribe) ~ Post Views", xlab = "Post Views", ylab = "log(Bribe)")
abline(reg4$coefficients)
res4 = residuals(reg4)
hist(res4, freq = FALSE)
curve(dnorm, add = TRUE)

sebastianbarfort / sds

Group 15: Assignment 2 #49