Hi, this is a re-submission as agreed in todays lecture, since our first issue for some random matter did not work. Best regards

Title: 'Assignment 2: Social Data Science: Web Scraping + Analysis' Authors: "Group 5: Mathias Dalsten, Anne Dammeyer, Kristina Poulsen og Morten Djernes" Date: "9. November 2015"

output: html_document

Data needed for assignment 2 are scraped from www.ipaidabribe.com in order to extract the following: 1) title 2) amount 3) name of department 4) transaction detail 5) number of views 6) city. A date variable is later added

In this paper we will analyse data from www.ipadeabribe.com, which is a crowdsourced attempt to measure corruption in India, with emphasis on various kinds of retail corruption. We look at the latest 1000 reports and analyse the bribes with respect to departments.

Firstly, needed packages are loaded:

library(rvest); library(ggplot2); library(dplyr); library(stringr); library(plyr); library(lubridate); library(zoo); library(maps); library(knitr);

Web Scraping

Given the specific webpage, the css selectors approach is used instead of the xpath approach. Therefore a css selector will be defined needed to scrape the specific data using 'SelectorGadget' app. In order to contain all the relevant data in a signle vector (not only the 10 most recent updates on a given link), the structure of the webpage's are used, which combined with a loop results in the needed data. It is worth mentioning that the most recent webpage has a distinct tag #, and is thus hard to fit into the loop structure. In order to 'bypass' this, the initial page is called "0#".

#http://www.ipaidabribe.com/reports/paid?page=#gsc.tab=0
#http://www.ipaidabribe.com/reports/paid?page=10#gsc.tab=0
#http://www.ipaidabribe.com/reports/paid?page=20#gsc.tab=0
#http://www.ipaidabribe.com/reports/paid?page=30#gsc.tab=0
#...
#http://www.ipaidabribe.com/reports/paid?page=[i]0#gsc.tab=0

# Creating "link"-vector to loop over:
link <- rep(NA, 100)

for (i in 1:100){
    link[i] <- paste("http://www.ipaidabribe.com/reports/paid?page=",i-1, "0#gsc.tab=0", sep="")
}
print(link[1:10])

Here the following piece of code loops over the link vector in order to scrape the 10 separate comments appearing on each page.

css.selector <- ".name a , .transaction a , .overview .views , .paid-amount span , .unique-reference , .location , .date , .heading-3 a"
data_empty <- list()

for (i in link[1:100]){
print(paste("Processing", i, sep=" "))
  data_empty[[i]] <- read_html(i) %>%
                   html_nodes(css = css.selector) %>%
                   html_text()
# waiting one secound between hits
cat("Done!\n")
Sys.sleep(1)
}

Data Manipulation

Transformation of the list to an 'untidy' dataframe.

dat <- as.data.frame(data_empty)
dat.vector <- as.vector(as.matrix(dat)) # creating vector and then a new dataframe in order to secure that the df has the structure of 'tidy data'.

# Creating a matrix, then a dataframe from the vector above. 
scrape.matrix <- matrix(dat.vector, ncol = 8,  byrow=TRUE)
df.untidy <- data.frame(scrape.matrix)
names(df.untidy) <- c("views", "title", "department", "section", "bribe", "date", "location", "reference.no")
df.untidy[1:30,]

The actual tranformation into a 'tidy' dataframe

df <- df.untidy
df$bribe <- sub("Paid INR |,", "", df.untidy$bribe) # removes text 
df$bribe <- gsub(",", "", df$bribe)
df$bribe <- str_trim(df$bribe) # removes 'white space'
df$bribe <- as.numeric(df$bribe) # transforms character vector to numeric vector
df$title <- str_trim(df.untidy$title) # trims the title vecto
df$views <- gsub(" .*", "", df.untidy$views) # removes "views" from the numeric count vector
df$views <- as.numeric(df$views) # transforming views from character to numeric variable
df$location <- str_split(df$location, ",")
df$city <- unlist(lapply(df$location, function(x) x[1]))
df$region <- unlist(lapply(df$location, function(x) x[2]))

df[1:10,]

For this assignment we first look into the relationship between bribes and departments. Later we asses the relation between GPD for different regions of India, and try to analyse whether this is colleration with bribes.

Bribes in different departments:


detach("package:plyr", unload=TRUE) 

library("dplyr")
df.department = df %>% 
  group_by(department) %>% 
  summarise(
    number = n()
  )  %>%
  arrange(desc(number), desc(department))

p1 <- ggplot(na.omit(df.department), aes(x = reorder(department, number), y = number))
p1 <- p1 + geom_bar(stat="identity", alpha=.2, fill="red") + coord_flip()
p1 <- p1 + labs(title = "Bribes per department", x = "Deparment", y = "Number")
p1 + theme_minimal()

Citizens are confronted with retail corruption when they need a service from the public sector. The figure beneath shows how the latest 1000 bribes reported on the webpage is distributed across the public departments in India.

The figure bwlow shows, that it is "Municipal Service" that accepts most bribes. They accept almost 100 bribes more than number two "Food, Civil Suppliers and Consumer Affairs". Number 2, 3 and 4 (Police, Transport and Food, Civil Suppliers and Consumer Affairs) has accepted almost the same number of bribes. the figure above also shows that "Water and sewage", "Policy works department", "Labour" and "Airports" only accepted one bribe.

Even though it is Municipal Service that receive most bribes, it is not necessarily here the highest amounts are paid. To analyse the amount of the bribes distributed across departments the figur below shows a boxplot of the amount paid for every department.

df.boxplot=df
# Convert the variable department from a string to a factor 
df.boxplot$department=as.factor(df.boxplot$department)
# make a boxplot, with mean and median, outliers are stars
p<-ggplot(df.boxplot, aes(x=department, y=bribe)) + geom_boxplot(outlier.colour = "red", outlier.shape = 8)+stat_summary(fun.y=mean, geom="point")

p
# rotate the boxplot and log scale
p+ coord_flip() + scale_y_continuous(trans="log10")

The boxplot in the figure above shows, that in general the amount paid is relatively equally distributed across departments. And that the range for the amount paid also is very alike across departments. The red dots are outliers. If we look at "Municipal Service" we see that the boxplot is the tallest. This indicated that bribes paid to Municipal Service varies a lot in size compared to other departments. In addition Municipal service has the highest outlier, i.e. they accepted the highest bribe among the observations. The boxplot for "Police" and "Transport" look very much like each other: They are relatively short which indicate that the amount paid not differ very much from bribe to bribe. "Food, Civil Suppliers and Consumer Affairs", the department with second most bribes has one of the shortest ranges. This means that most bribes paid to this department is the approximately the same amount.

GDP vs. Bribe

Now the relationship between GPD in different regions of India and the bribe amount is analyzed. Firstly, GDP data for India is scraped and 'tidyed' up.

css.selector = "#table_id .name+ .data , #table_id .name" 
link = "http://www.statisticstimes.com/economy/gdp-of-indian-states.php"

indiagdp = read_html(link) %>% 
  html_nodes(css = css.selector) %>% 
  html_text()
indiagdp

long.links=paste(link, indiagdp, sep="")

dat <- as.data.frame(indiagdp)
dat.vector <- as.vector(as.matrix(dat)) # creating vector and then a new dataframe in order to secure that the df has the structure of 'tidy data'.

# Creating a matrix, then a dataframe from the vector above. 
scrape.matrix <- matrix(dat.vector, ncol = 2, byrow=TRUE)
df.untidy <- data.frame(scrape.matrix)
names(df.untidy) <- c("State", "GSDP")
df.untidy[1:40,]

#removing empty cells
df.indiagsdp = df.untidy[!df.untidy$GSDP=="-"&!df.untidy$GSDP=="",]
# merging the two datasets 
df.factor=df
df.factor$region=substring(df.factor$region, 2)
df.factor$region=as.factor(df.factor$region)
df.merged=merge(df.factor,df.indiagsdp,by.x = "region",by.y = "State")

#Plotting
p = ggplot(df.merged, aes(x = df.merged$GSDP, y = bribe))
p = p + labs(title = "GDP for India vs. amount and number of bribes", x = "GDP", y = "Bribe Amount")
p = p + geom_point() + scale_y_log10() + 
  facet_wrap(~ region, scales = "free")
p

This plot has some exponential power but it is hard to see. Given the setup of our data for GDP, it's clear, that only having one GDP measure for each region gives a 'murcky' intpretation. What can be used from this plot, is the number (aswell as the amount) of bribes, where fx Maharashtra, Uttar Pradesh, Madhya Pradesh etc. deals with high amounts of bribe. This has to be held in contrats with outlined outliers in the previous boxplot.

Due to the mixed results a different approach is taking:

# plot for all the regions together are made
ggplot(df.merged, aes(x = GSDP, y = bribe)) +
  geom_point() +
  theme_minimal() +
  stat_smooth(method = lm) +
  stat_smooth(method = loess) +
  geom_text(aes(label = region), size = 3.0) +
  labs(x="GDP pr. Capita", y = "Total Bribes", title = "Correlation between wealth and bribery")

#Filter data in order to estimate a possible correlation between bribes and GPD
df.merged1 <- df.merged %>%
  filter(bribe < 1000000) %>%
  filter(bribe > 10000)

ggplot(df.merged1, aes(x = GSDP, y = bribe)) +
  geom_point() +
  theme_minimal() +
  stat_smooth(method = lm) +
  stat_smooth(method = loess) +
  geom_text(aes(label = region), size = 3.0)+
  labs(x="GDP pr. Capita", y = "Total Bribes", title = "Correlation between wealth and bribery, sampled")

The potential (and sought out) correlation between GDP and bribe amount is still not clear from above. Even when filtering potential outliers and very low amounts of bribes, it's hard to see a postive/negative correlation.

Conclusion:

Several points can be made from the above estimates, one thing is worth mentioning which clouds out the other subconclusions: It is difficult to conclude something general when working with self-reported data, which we are in this case. If the webpage was equally used in alle regions, some results would be more robust, and maybe some urban areas have better access to IT, which also affects the results. Scraping extra observations and/or perhaps at different times during the year might help validate the results further.

sebastianbarfort / sds