sebastianbarfort / sds

Social Data Science, course at University of Copenhagen
http://sebastianbarfort.github.io/sds/
12 stars 17 forks source link

Group 3 : Assignment 2 #41

Closed RolfCarlsen closed 8 years ago

RolfCarlsen commented 8 years ago

title: "test ass 2" author: "Rolf Carlsen" date: "9. nov. 2015"

output: html_document

Using the css.selector for Google Chrome, we scrape data from the wepage www.ipaidabribe and select information on the title, amount payed, class of transaction, number of views and the city in which the bribe took place, on the latest 1000 reports on the website. The 1000 reports are submitted in the three weeks from October 12, 2015 - November 2, 2015, which means that there is approx. 47 reported bribes every day. In order to make our analysis we also scrape a table from Wikipedia containing information on the population size and region of the 200 largest Indian cities. After cleaning and preparing both datasets, we merge them by city, which leaves us with a dataset of 768 observations. However, more than half of the reports (423) are reported in the city of Bangalore and we only have seven cities with more than 14 reported bribes. We are interested in learning whether there is any interesting differences in the characteristics of the bribes, in the seven cities with the most bribes. By getting the data from Wikipedia, which contains number of inhabitants in each city, we can convert the data from our initial "ipaidabribe"-scrape to per capita terms, which is important since the cities vary greatly in size.

We first consider the number of bribes per capita in each of the seven cities. We find that the Bangalore has the highest amount of reported bribes per capita of all the Indian cities. We then proceed to investigate whether there is a difference in the average amount of the reported bribes.

When analyzing self-submitted data one has to consider that there might be important "self-selection" issues at play. It could be that many large bribes are not reported, because both parts in such a bribe might in fact be better off. It could also be that reporting a very large bribe increases the risk of being caught, compared to the risk of reporting a small bribe. Another possibility why the inhabitants of Bangalore seem more corrupt might be that are simply more honest than in other parts of India. Thus, the higher average number of bribes per capita might be due to a higher degree of honesty, rather than an actual higher level of corruption. It is therefore very difficult to say anything about the differences in the levels of corruption across the cities. However, it is fair to conclude that bribery and corruption is a very common in big parts of the Indian society.

#Loading packages

library('rvest')
library('plyr')
library('dplyr')
library('stringr')
library('ggplot2')

# First we create a list of webpages to be scraped

link=list()
for (i in 1:100){
link[i]<-paste("http://www.ipaidabribe.com/reports/paid?page=",(i-1)*10,"#gsc.tab=0",sep="")
}

# Here we load the css selectors

css.selector.title=".heading-3 a"
css.selector.amount=".paid-amount span"
css.selector.namedep=".name a"
css.selector.detail=".transaction a"
css.selector.views=".overview .views"
css.selector.city=".location"
css.selector.date=".date"

# Here we define a function which scrapes the data and outputs them i columns

bribe<- function(link){
liste=read_html(link)

link.title=liste %>%  
html_nodes(css=css.selector.title) %>%
html_text() 

link.amount=liste %>% 
html_nodes(css=css.selector.amount) %>% 
html_text()

link.namedep=liste %>% 
  html_nodes(css=css.selector.namedep) %>% 
  html_text()

link.detail=liste %>% 
  html_nodes(css=css.selector.detail) %>% 
  html_text()

link.views=liste %>% 
  html_nodes(css=css.selector.views) %>% 
  html_text()

link.city=liste %>% 
  html_nodes(css=css.selector.city) %>% 
  html_text()

link.date=liste %>% 
  html_nodes(css=css.selector.date) %>% 
  html_text()

return(cbind(link.title,link.amount,link.namedep,link.detail,link.views,link.city,link.date))  
}

#Here we scrape the data using the previous defined function

bribe.list<- list()
for( i in link[1:100]){
print(paste("Processing ",i,sep=""))
bribe.list[[i]] <- bribe(i)
Sys.sleep(1)
cat("done !\n")
}

# Here we transform it into a dataframe

df.bribe <- ldply(bribe.list)

# Here we gather population data from indian cities
# We do this to analyse bribes in pr. capita terms

df.india <- read_html("https://es.wikipedia.org/wiki/Anexo:Ciudades_de_la_India_por_poblaci%C3%B3n") %>%
          html_node(".wikitable") %>% 
          html_table()

#Here we clean the bribe data and split the location variable
#into city and region

df.bribe$link.amount <- gsub("Paid INR ","",df.bribe$link.amount)
df.bribe$link.amount <- as.numeric(gsub(",","",df.bribe$link.amount))
df.bribe$link.views <- as.numeric(gsub(" views","",df.bribe$link.views))
df.bribe$city <- str_extract(df.bribe$link.city,"[A-z]+.[A-z]+")
df.bribe$region <- gsub(",","",str_extract(df.bribe$link.city,", [A-z]+.[A-z]+.[A-z]+")) 

# Here we clean the population data in df.india dataset
# keeping only city and population

keeps<-c(2,3)
df.india.new<-df.india[keeps]

names<-rbind("city","population")
names(df.india.new) <- names

df.india.new$population<-gsub(" ","",df.india.new$population)
df.india.new$population<- as.numeric(str_extract(df.india.new$population,"[0-9]+"))

# Here we rename Bombay and Delhi so they fit with the bribe dataset

df.india.new$city <- gsub("Bombay","Mumbai",df.india.new$city)
df.india.new$city <- gsub("Delhi","New Delhi",df.india.new$city)

# Here we merge the population and bribe dataset by city

df<-join(df.bribe,df.india.new,type="left",match="first")

# From now on we only consider observations with population data

df.bribe <- df %>%  filter(!is.na(population))

#Here we summarise by city
# Creating variables for number of bribes pr. capita, mean bribes
# and total amount of bribes pr. capita

df.corrupt <- df.bribe %>% 
  group_by(city) %>%
  summarise(bribes=n(),amount=sum(link.amount),population=mean(population)) %>% 
  mutate(bribe.capita=bribes/population,amount.capita=amount/population,mean.bribe=amount/bribes)

# From now on we only consider cities with more than 14 observations/bribes
# this gives a total of 7 cities

df.corrupt.filter<- df.corrupt %>% 
          filter(bribes>14)

# This is a plot of the number of bribes pr. capita

p_1<- ggplot(data=df.corrupt.filter, aes(x=city, y=bribe.capita))
p_1 <- p_1 + geom_bar(stat="identity")
p_1 <- p_1 + scale_y_continuous("Number of bribes pr. capita")
p_1 <- p_1 + theme_minimal()+ggtitle("Number of bribes pr. capita")
p_1

# This is a plot of the total value of bribes pr. capita

p_2 <- ggplot(data=df.corrupt.filter, aes(x=city, y=amount.capita))
p_2 <- p_2 + geom_bar(stat="identity")
p_2 <- p_2 + scale_y_continuous("Number of bribes pr. capita")
p_2 <- p_2 + theme_minimal()+ggtitle("Amount of bribes pr. capita")
p_2

# This is a plot of the mean bribe in the cities

p_3 <- ggplot(data=df.corrupt.filter, aes(x=city, y=mean.bribe))
p_3 <- p_3 + geom_bar(stat="identity")
p_3 <- p_3 + scale_y_continuous("Number of bribes pr. capita")
p_3 <- p_3 + theme_minimal()+ggtitle("mean bribe ")
p_3