sebastianbarfort / sds

Social Data Science, course at University of Copenhagen
http://sebastianbarfort.github.io/sds/
12 stars 17 forks source link

Group 27 - assignment 2 #42

Closed StineWesselhoff closed 8 years ago

StineWesselhoff commented 9 years ago

title: "Assignment 2" author: Valdemar Stentoft-Hansen, Esben Østergaard, Robert Mark Ziegler, Stine Lund Wesselhoff date: "November 9, 2015"

output: html_document

Assignment 2

Introduction

In this paper we will look into whether it is possible to make a benchmark on the price for bribes within different transaction types. Furhtermore, we will compare prices with the population size ofcities in India.

Data quality

We scraped our data from the website http://www.ipaidabribe.com. The dataset consists of 14,180 observations, in order to make the benchmark as accurate as possible. The observations is reported in the period from the 17th of Febuary 2015 until the 4th of November 2015. The reported amount for bribes is reported in Indian rupees. We chose to report the amount in DKK with an average exchange rate at 10.42 INR for 100 DKK over the past 12 months.

In order to get a representative dataset we tried to sort out all bribes where the amount was only 1 INR. 1 INR is 0,1042 DKK and therfore, we argue, that it is a to small amount for a bribe in consideration that some of the bribes are way larger. It is also considerable whether people will go online to report such a small amount for a bribe. Another consideration is that some people will have incentive to report a lot of bribes to make it look like the total number of bribes is bigger in some cities. When we sorted out the observation with only 1 INR the picture didn't change and therefore we kept the observations with a bribe at 1 INR.

Next, the data scraping,


#Data scraping
bribe.link = 'http://www.ipaidabribe.com/reports/paid?page='

pagenumber = c(0,1:4999*10)

links = paste(bribe.link, pagenumber, sep = "")

bribe.data = list()

for(i in links[1:5000]){
  link = read_html(i, encoding = "UTF-8")
  title = link %>% html_nodes(".heading-3 a") %>% html_text() %>% str_trim()
  amount = link %>% html_nodes(".paid-amount span") %>% html_text() %>% str_trim() %>% 
    extract_numeric()
  dept = link %>% html_nodes(".name a") %>% html_text() %>% str_trim()
  trans = link %>% html_nodes(".transaction a") %>% html_text() %>% str_trim()
  views = link %>% html_nodes(".overview .views") %>% html_text() %>% str_trim() %>% 
    extract_numeric()
  city = link %>% html_nodes(".location") %>% html_text() %>% str_trim()
  city = gsub(city, pattern = ", [A-z0-9 ]+", replacement = "") %>% str_trim()
  date = link %>% html_nodes(".date") %>% html_text() %>% str_trim()
  bribe.data[[i]] = cbind(title, amount, dept, trans , views, city, date)

  Sys.sleep(1)
  cat(" done!\n")
}

df.bribe = ldply(bribe.data)
df.bribe$.id = NULL
df.bribe$date = parse_date_time(as.character.Date(df.bribe$date), "BdY")
head(df.bribe, 15)¨
library("plyr")
library("dplyr")
library("readr")
library("rvest")
library("stringr")
library("tidyr")
library("lubridate")
library("knitr")
library("ggplot2")

People selfreport a bribe to the webpage and we should be aware that people can report too large or small bribes in order to get it to look like there is more corruption than there really is. Furthermore, it is considerable whether there is a bias in the amnount of reports from different cities.

Furthermore, we extract a dataset from https://en.wikipedia.org/wiki/List_of_cities_in_India_by_population that consists of all the cities in India and the number of inhabitants. The reason for scraping this dataset is to compare whether there is a correlation between the size of the cities with the cost on bribes.

library("readr")
df.bribe = read_csv("https://raw.githubusercontent.com/vstentoft/SDS-Group-27-Assigment-2/master/df.bribe.csv")
pop.link = "https://en.wikipedia.org/wiki/List_of_cities_in_India_by_population"

pop.data = list()

link = read_html(pop.link, encoding = "UTF-8")
city = link %>% html_nodes("td:nth-child(2)") %>% html_text() %>% str_trim()
city = city[1:269]
city[city == "Hyderabad[4]"] = "Hyderabad"
city[city == "Chennai[N 1]"] = "Chennai"
city[city == "Visakhapatnam[5]"] = "Visakhapatnam"
city[city == "Coimbatore[N 2]"] = "Coimbatore"
city[city == "Madurai[N 3]"] = "Madurai"
rank = link %>% html_nodes("td:nth-child(1)") %>% html_text() %>% str_trim() %>% 
  extract_numeric()
rank = rank[1:269]
pop = link %>% html_nodes("td:nth-child(3)") %>% html_text() %>% str_trim() %>% 
  extract_numeric()

df.pop = data_frame(city, rank, pop)

df = left_join(df.bribe, df.pop)

df$trans = as.character(df$trans)
df$amount = as.numeric(df$amount)
df$date = as.Date(df$date)

Which transaction are the most common?

First we made a table to show the most reported transactions. With 1379 observations the C Form document of the Bureau of Immigration of India is the most mentioned transaction on the bribe webpage. This document is necessary for sales persons who want to get a tax refund of the government after they bought products of another sales person. The second and third most mentioned transactions are "Police Verification for Passport" and "Customs Check and Clearance".

df.Q1 = df %>%
 group_by(trans) %>% 
  count(trans) %>% 
  top_n(10, n) %>% 
  arrange(-n) 
names(df.Q1) [names(df.Q1) == "trans"] <- "Transaction"
names(df.Q1) [names(df.Q1) == "n"] <- "Number of bribes"
kable(head(df.Q1,10), caption = "Number of bribes for different transactions")

Which department is the most corrupt?

In total there is 32 departments. In table 2 shows the 11 departments with the highest reported bribes or the "most corrupt" departments. The rest of the deparments have below 35 reported bribes. The police departments are leading with 3621 reported bribes. We thought that this could also be not completely reliable as there might be a small biasses. People are probably more likely to report a bribe paid to the police than a bribe paid to other departments. People might be more disappointed by a not working police system.

df.Q2 = df %>%
  count(dept) %>% 
  top_n(11, n) %>% 
  arrange(-n)
names(df.Q2) [names(df.Q2) == "dept"] <- "Department"
names(df.Q2) [names(df.Q2) == "n"] <- "Number of bribes"
kable(head(df.Q2,11), caption = "Number of reported bribes within departments", 
      #format = "pandoc", col.names = colnames(df.Q2$dept, do.NULL = TRUE, prefix = #"Department")
      )

In‬ which city, do the average inhabitant spend the most on reported bribes?

In order to get a better feeling for prize of the bribe we converted Indian Rupees to Danish kroner according to the exchange rate of november 3rd, 2015. We scrape the population in the 270 biggest cities in India. We use this data to calculate the average spending on bribes per person. The 10 cities where a person spend the most on bribes are presented in the table below. We see a big difference in the amount of spending per person within top 10, but also a big difference in the number of reported bribes. In the #1 city the average spending is 350 DKK, while the spending is only 0,4 DKK in city ranked 10.

df.trans = df %>% 
filter(!is.na(pop)) %>%
count(trans) %>% 
top_n(10, n) %>% 
arrange(-n)
#Value‬ into DKK
df.dkk = df %>% 
mutate(dkk.amount = amount * 1/10.42)

#In‬ which city, do the average inhabitant spend the most on reported bribes?
df.spending.bribe = df.dkk %>% 
select(city, dkk.amount) %>% 
group_by(city) %>% 
summarize(s.amount = sum(dkk.amount))
df.spending.bribe = inner_join(df.spending.bribe, df.pop, by = "city") %>%
mutate(bribe.per.person = s.amount/pop) %>%
arrange(-bribe.per.person)
df.bribe.count = df.dkk %>% 
select(city) %>% 
count(city)
df.spending.bribe = inner_join(df.spending.bribe, df.bribe.count, by ="city" )
df.spending.bribe = head(df.spending.bribe, 10)
names(df.spending.bribe) [names(df.spending.bribe) == "city"] <- "City"
names(df.spending.bribe) [names(df.spending.bribe) == "s.amount"] <- "Sum of amount"
names(df.spending.bribe) [names(df.spending.bribe) == "pop"] <- "Population"
names(df.spending.bribe) [names(df.spending.bribe) == "bribe.per.person"] <- "Bribe per person"
names(df.spending.bribe) [names(df.spending.bribe) == "n"] <- "Number of bribes per city"
kable(head(select(df.spending.bribe, -rank),10), digits = 2, caption = "The most expensive cities")

Boxplot

In the following you see a boxplot of the most reported bribe transactions. The first thing that comes to mind is how similar the prizes across transaction types are. They all cost around 50 DKK, which we think can be explained by " street" bribes turning in to an established market, with a clearing prize, that does not depend very much on the type of the bribe but rather on the fact that it is a bribe.

df.box = inner_join(df, df.trans, by = "trans")
df.box = df.box %>% 
mutate(dkk.amount = amount * 1/10.42) %>%
filter(!is.na(amount)) %>% 
filter(amount < 100000) %>%
arrange(-n)
p = ggplot(df.box, aes(y = dkk.amount, x = trans))
p = p + geom_boxplot(outlier.shape = NA) + coord_cartesian(ylim = c(0, 300)) +
  xlab("Transaction type") + ylab("Amount in DKK") +
  theme(axis.text.x = element_text(size=8, angle = 90),
        axis.title.x  = element_text(size=12, angle = 0))
p

Regressions

We check if there is correlation between the mean price of a transaction type and the size of the city where the bribe was reported. We plot the ten most common transactions, with city size on the x axis and average bribe cost in DKK on the y axis. Our thesis is, that there is a positive correlation between city size and mean cost of a bribe, since the price index is usually higher in bigger cities. The plot did not return any specific trends or significant results. So the city size can is not explanatory on the average bribe cost.

#Why‬ is there a difference in price? Can we find a correlation with city size?
df.city = df.dkk %>% 
select(trans, city, dkk.amount, pop) %>% 
filter(!is.na(pop)) %>%
group_by(trans, city, pop) %>% 
summarize(m.amount = mean(dkk.amount)) %>% 
arrange(desc(trans))
df.mean = inner_join(df.city, df.trans, by = "trans")
df.mean = df.mean %>% 
filter(m.amount < 50000)

p = ggplot(df.mean, aes(x = pop, y = m.amount))
p = p + geom_point(alpha = .3) + scale_y_log10() + scale_x_log10() +
geom_smooth() +
facet_wrap(~ trans, scales = "free") +
xlab("Log population, city") + ylab("Log mean cost of bribe") + 
ggtitle("Correlation between city population and cost of bribe in DKK")
p
sebastianbarfort commented 8 years ago

Good assignment!

Good use of graphs and generally good efficient code. You perhaps could have discussed a little bit the issue of self-selection and selection bias, but generally the assignment is good.

APPROVED