sebastianbarfort / sds

Social Data Science, course at University of Copenhagen
http://sebastianbarfort.github.io/sds/
12 stars 17 forks source link

Group 3 : Assignment 2 #48

Closed RolfCarlsen closed 8 years ago

RolfCarlsen commented 8 years ago

title: "Assignment 2" author: "Group 3 : Rolf Carlsen, Emil Brodersen and Peter Haurum" date: "9. nov. 2015"

output: html_document

#Loading packages

library('rvest')
library('plyr')
library('dplyr')
library('stringr')
library('ggplot2')

# First we create a list of webpages to be scraped

link=list()
for (i in 1:100){
link[i]<-paste("http://www.ipaidabribe.com/reports/paid?page=",(i-1)*10,"#gsc.tab=0",sep="")
}

# Here we load the css selectors

css.selector.title=".heading-3 a"
css.selector.amount=".paid-amount span"
css.selector.namedep=".name a"
css.selector.detail=".transaction a"
css.selector.views=".overview .views"
css.selector.city=".location"
css.selector.date=".date"

# Here we define a function which scrapes the data and outputs them i columns

bribe<- function(link){
liste=read_html(link)

link.title=liste %>%  
html_nodes(css=css.selector.title) %>%
html_text() 

link.amount=liste %>% 
html_nodes(css=css.selector.amount) %>% 
html_text()

link.namedep=liste %>% 
  html_nodes(css=css.selector.namedep) %>% 
  html_text()

link.detail=liste %>% 
  html_nodes(css=css.selector.detail) %>% 
  html_text()

link.views=liste %>% 
  html_nodes(css=css.selector.views) %>% 
  html_text()

link.city=liste %>% 
  html_nodes(css=css.selector.city) %>% 
  html_text()

link.date=liste %>% 
  html_nodes(css=css.selector.date) %>% 
  html_text()

return(cbind(link.title,link.amount,link.namedep,link.detail,link.views,link.city,link.date))  
}

#Here we scrape the data using the previous defined function

bribe.list<- list()
for( i in link[1:100]){
print(paste("Processing ",i,sep=""))
bribe.list[[i]] <- bribe(i)
Sys.sleep(1)
cat("done !\n")
}

# Here we transform it into a dataframe

df.bribe <- ldply(bribe.list)

# Here we gather population data from indian cities
# We do this to analyse bribes in pr. capita terms

df.india <- read_html("https://es.wikipedia.org/wiki/Anexo:Ciudades_de_la_India_por_poblaci%C3%B3n") %>%
          html_node(".wikitable") %>% 
          html_table()

#Here we clean the bribe data and split the location variable
#into city and region

df.bribe$link.amount <- gsub("Paid INR ","",df.bribe$link.amount)
df.bribe$link.amount <- as.numeric(gsub(",","",df.bribe$link.amount))
df.bribe$link.views <- as.numeric(gsub(" views","",df.bribe$link.views))
df.bribe$city <- str_extract(df.bribe$link.city,"[A-z]+.[A-z]+")
df.bribe$region <- gsub(",","",str_extract(df.bribe$link.city,", [A-z]+.[A-z]+.[A-z]+")) 

# Here we clean the population data in df.india dataset
# keeping only city and population

keeps<-c(2,3)
df.india.new<-df.india[keeps]

names<-rbind("city","population")
names(df.india.new) <- names

df.india.new$population<-gsub(" ","",df.india.new$population)
df.india.new$population<- as.numeric(str_extract(df.india.new$population,"[0-9]+"))

# Here we rename Bombay and Delhi so they fit with the bribe dataset

df.india.new$city <- gsub("Bombay","Mumbai",df.india.new$city)
df.india.new$city <- gsub("Delhi","New Delhi",df.india.new$city)

# Here we merge the population and bribe dataset by city

df<-join(df.bribe,df.india.new,type="left",match="first")

# From now on we only consider observations with population data

df.bribe <- df %>%  filter(!is.na(population))

#Here we summarise by city
# Creating variables for number of bribes pr. capita, mean bribes
# and total amount of bribes pr. capita

df.corrupt <- df.bribe %>% 
  group_by(city) %>%
  summarise(bribes=n(),amount=sum(link.amount),population=mean(population)) %>% 
  mutate(bribe.capita=bribes/population,amount.capita=amount/population,mean.bribe=amount/bribes)

# From now on we only consider cities with more than 14 observations/bribes
# this gives a total of 7 cities

df.corrupt.filter<- df.corrupt %>% 
          filter(bribes>14)

# This is a plot of the number of bribes pr. capita

p_1<- ggplot(data=df.corrupt.filter, aes(x=city, y=bribe.capita))
p_1 <- p_1 + geom_bar(stat="identity")
p_1 <- p_1 + scale_y_continuous("Number of bribes pr. capita")
p_1 <- p_1 + theme_minimal()+ggtitle("Number of bribes pr. capita")

# This is a plot of the total value of bribes pr. capita

p_2 <- ggplot(data=df.corrupt.filter, aes(x=city, y=amount.capita))
p_2 <- p_2 + geom_bar(stat="identity")
p_2 <- p_2 + scale_y_continuous("Number of bribes pr. capita")
p_2 <- p_2 + theme_minimal()+ggtitle("Amount of bribes pr. capita")

# This is a plot of the mean bribe in the cities

p_3 <- ggplot(data=df.corrupt.filter, aes(x=city, y=mean.bribe))
p_3 <- p_3 + geom_bar(stat="identity")
p_3 <- p_3 + scale_y_continuous("Number of bribes pr. capita")
p_3 <- p_3 + theme_minimal()+ggtitle("mean bribe ")

The data

Using the css.selector for Google Chrome, we scrape data from the webpage www.ipaidabribe and select information on the title, amount payed, class of transaction, number of views and the city in which the bribe took place, on the latest 1000 reports on the website. The 1000 reports are submitted in the three weeks from October 12, 2015 - November 2, 2015, which means that there is approx. 47 reported bribes every day. In order to make our analysis we also scrape a table from Wikipedia containing information on the population size and region of the 200 largest Indian cities. After cleaning and preparing both datasets, we merge them by city, which leaves us with a dataset of 768 observations. However, more than half of the reports (423) are reported in the city of Bangalore and we only have seven cities with more than 14 reported bribes.

Is there any difference between cities?

We are interested in learning whether there is any interesting differences in the characteristics of the bribes, in the seven cities with the most bribes. By getting the data from Wikipedia, which contains number of inhabitants in each city, we can convert the data from our initial "ipaidabribe"-scrape to per capita terms, which is important since the cities vary greatly in size.

Bribes pr. capita

We first consider the number of bribes pr. capita in each of the seven cities. We find that Bangalore has the highest amount of reported bribes pr. capita of all the Indian cities. Also there are alot of variaty between the cities. There are relatively few bribes pr. capita in the largest cities, Mumbai and New Delhi.

p_1

Of course all bribes are not reported on the site, meaning that the absolute numbers may not be very information. But as long as the share of bribes reported is the same i all cities (which may very well not be the case), we can still compare between cities.

Average bribe

We then proceed to investigate whether there is a difference in the average amount of the reported bribes.

p_2

Here we see that, even though Bangalore has many bribes pr. capita, they are not of a very large amount. Where as Agra has the higest amount of bribes pr. capita. Also the amount bribed in Mumbai is relatively large. In general there looking at the number of bribes pr. capita and the amount bribed pr. capita, gives two very different discriptions.

The mean bribe

Lastly we investigate the mean bribe. This confirms that the bribes in Mumbai are relatively large compared to the other cities. This suggests that even though the citizens in Mumbai does not bribe very often, when they bribe, they bribe alot.

p_3

In the date we gathered there are some bribes which are very large compared to others, these may be outliers (fx. there is one bribe from Mumbai which is 2 million Rs. However we have not sorted them out.

Conclusion

When analyzing self-submitted data one has to consider that there might be important "self-selection" issues at play. It could be that many large bribes are not reported, because both parts in such a bribe might in fact be better off. It could also be that reporting a very large bribe increases the risk of being caught, compared to the risk of reporting a small bribe. Another possibility why the inhabitants of Bangalore seem more corrupt might be that are simply more honest than in other parts of India. Thus, the higher average number of bribes per capita might be due to a higher degree of honesty, rather than an actual higher level of corruption. It is therefore very difficult to say anything about the differences in the levels of corruption across the cities. However, it is fair to conclude that bribery and corruption is a very common in big parts of the Indian society.

sebastianbarfort commented 8 years ago

Good thoughts on self-selection. Maybe could have considered looking at spatial distribution, but otherwise ok.

APPROVED