sebastianbarfort / sds

Social Data Science, course at University of Copenhagen
http://sebastianbarfort.github.io/sds/
12 stars 17 forks source link

Group 18: Assignment 2 #50

Closed jensnielsen1988 closed 8 years ago

jensnielsen1988 commented 8 years ago

title: "Assignment 2 - graphs" author: "Group 18" date: "9. nov. 2015"

output: html_document

# Firstly, the packages expected to be used are loaded.
library(rvest); library(ggplot2); library(plyr); library(stringr); library(lubridate); library(zoo); library(maps); library(raster); library(rgdal); library(rgeos); library(sp); library(RColorBrewer); library(ggmap); library(maptools); library(dplyr)

Data

Web Scraping

In order to solve Homework 2, data will be scraped from www.ipaidabribe.com in order to extract 1) title 2) amount 3) name of department 4) transaction detail 5) number of views 6) city.

Given the specific webpage, the css selectors approach is judged more efficient than the xpath approach. Hence, a css selector will be defined below and the selector will be used to extract the specific information from the webpage. In order to have a vector containing all links (not only the 10 most recent updates), we will make use of the webpage's structure. Hence, realising that the webpages are named in a structured manner we can loop over the webpages.

#http://www.ipaidabribe.com/reports/paid?page=#gsc.tab=0
#http://www.ipaidabribe.com/reports/paid?page=10#gsc.tab=0
#http://www.ipaidabribe.com/reports/paid?page=20#gsc.tab=0
#http://www.ipaidabribe.com/reports/paid?page=30#gsc.tab=0

#http://www.ipaidabribe.com/reports/paid?page=[i]0#gsc.tab=0

# Creating "link"-vector to loop over:
link <- rep(NA, 100)

for (i in 1:100){
  link[i] <- paste("http://www.ipaidabribe.com/reports/paid?page=",i-1, "0#gsc.tab=0", sep="")
}
print(link[1:5])

The following piece of code loops over the link vector in order to scrape the 10 separate comments appearing on each page. As the code generates 80 values (8 pieces of information * 10 bribe comments) the code fills up a list as a list turned out to be easier to work with.

css.selector <- ".name a , .transaction a , .overview .views , .paid-amount span , .unique-reference , .location , .date , .heading-3 a"
data_empty <- list()

for (i in link[1:100]){
  #print(paste("Processing", i, sep=" "))
  data_empty[[i]] <- read_html(i) %>%
    html_nodes(css = css.selector) %>%
    html_text()
  #cat("Done!\n")
  Sys.sleep(1) 
}

Data Manipulation

Data will be converted into a data frame. The code transforms the list to an untidy dataframe in order to make a dataframe of the tidy data structure.

dat <- as.data.frame(data_empty)
dat.vector <- as.vector(as.matrix(dat)) # creating vector and then a new dataframe in order to secure that the df has the structure of 'tidy data'.

# Creating a matrix, then a dataframe from the vector above. 
scrape.matrix <- matrix(dat.vector, ncol = 8,  byrow=TRUE)
df.untidy <- data.frame(scrape.matrix)
names(df.untidy) <- c("views", "title", "department", "section", "bribe", "date", "location", "reference.no")
#df.untidy[1:40,]

The following piece of code performs the actual data manipulation

df <- df.untidy
df$bribe <- sub("Paid INR |,", "", df.untidy$bribe) # removes text from what is to be a numeric vector & 
df$bribe <- gsub(",", "", df$bribe)
df$bribe <- str_trim(df$bribe) # removes white space
df$bribe <- as.numeric(df$bribe) # transforms character vector to numeric vector
df$title <- str_trim(df.untidy$title) # trims the title vector for white space
df$views <- gsub(" .*", "", df.untidy$views) # removes "views" from the numeric count vector
df$views <- as.numeric(df$views) # transforming views from character to numeric variable
df$location <- str_split(df$location, ",")
df$city <- unlist(lapply(df$location, function(x) x[1]))
df$region <- unlist(lapply(df$location, function(x) x[2]))

#df[1:10,]
# View(df)

Data Analysis

The following analysis has been made with data collected on the 8th of November 2015 at 9.30 pm with data over the period 12th of October till the 8th of November 2015. Differencens in numbers can ocuur, due to new bribes having been reported since this date.

The data analysis will consist of three parts: 1) We will look at the relationship between bribes and state departments. We will examine the number of bribes per department, the total amount of bribe-money per department and the average bribe per department. 2) We will look at the relationship between bribes and type of bribe. We will examine the number of bribes per type of bribe, the total amount of bribe-money per type of bribe and the average bribe per type of bribe. 3) We will look at the relationship between bribes and location. We will examine the number of bribes per region and will list the citys with the largest number of bribes.

We have decided to remove bribes above 1.000.000 INR from the dataset, as some of them appear unrealisticly high. Some of them might be correct, but due to the fact that we are not allowed to upload more than one rmarkdown file and problems with uploading a dataset to github we cannot save the data from this date and be selective about which observations to exclude.


df$bribe[df$bribe >= 1000000] <- NA 

1 The relationship between bribes and state departments.

1a We will first look at the number of bribes per department.

# The following code creates a dataframe and lists the number of bribes per department

df.department = df %>% 
  group_by(department) %>% 
  summarise(
    number = n()
  )  %>%
  arrange(desc(number), desc(department)) 
# The following code creates a visualization of the number of bribes per department

p1 <- ggplot(na.omit(df.department), aes(x = reorder(department, number), y = number))
p1 <- p1 + geom_bar(stat="identity", alpha= 1, fill=I("red")) + coord_flip()
p1 <- p1 + labs(title = "Bribes per department", x = "Deparment", y = "Number")
p1 + theme_minimal()

It becomes clear that 'municipal services' is the department, where most bribes are made. Among the 1000 bribes analyzed, 268 bribes have been made at 'municipal services'. 132 bribes have been made at 'Food, Civil Supplies and Consumer Affairs' and 124 bribes have been made at 'Police'.

Among the departments with least bribes are 'Labour', 'Public Works Department' and 'Water and Sewage'. Two bribes were reported, but lacked information about 'department'.

1b We will now look at the amount of bribe-money per department.

# Data frame of the total amount of bribe-money per department

 df.department_money = df %>% 
  group_by(department) %>% 
  summarise(
    amount = sum(bribe, na.rm = TRUE)
  )  %>%
  arrange(desc(amount), desc(department)) 
# Visualization of the amount of bribe-money per department

p2 <- ggplot(na.omit(df.department_money), aes(x = reorder(department, amount), y = amount))
p2 <- p2 + geom_bar(stat="identity", alpha=1, fill="red") + coord_flip()
p2 <- p2 + labs(title = "Amount of bribe-money per department", x = "Deparment", y = "Amount")
p2 + theme_minimal()

The amount of bribe-money is highest in 'Municipal Services', where the number of bribes is also highest. The 'Police' gets the second highest amount of bribe-money, but only gets half as many bribes as 'Municipal Services', which implies that the average bribe for the 'Police' is higher than for 'Municipal Services'. 'Food, Civil Supplies and Consumer Affairs', who had the second highest number of bribes, only has the 8th highest amount of bribe-money.

1c We will now look at the average amount of bribe-money per department.

# Data frame of the average amount of bribe-money per department
 df.department_money_average = df %>% 
  group_by(department) %>% 
  summarise(
    ave = sum(bribe, na.rm = TRUE)/n()
  )  %>%
  arrange(desc(ave), desc(department)) 
# Visualization of average bribe-money per department
p3 <- ggplot(na.omit(df.department_money_average), aes(x = reorder(department, ave), y = ave))
p3 <- p3 + geom_bar(stat="identity", alpha=1, fill="red") + coord_flip()
p3 <- p3 + labs(title = "Average of bribe-money per department", x = "Deparment", y = "Average Amount of Bribe")
p3 + theme_minimal()

'Revenue' has the highest average amount of bribe-money being 25.000 INR, but only two bribes. 'Commercial Tax, Sales Tax, VAT' has the second highest average amount of bribe-money, being 18.801 INR. 'Public Works Department' has the lowest average amount of bribe-money, being 500 INR (excluding the two observations with no information about department).

From the three plots above, it is furthermore obvious that the 'bribe culture' is concentrated within specific departments being; Municipality Services, Police, Commercial Tax, Sales Tax, VAT, Stamts and Registration, and Income Tax. The departments cover services like birth certificates, registration of property/land, marriage certificates, and the c-form (necessary in order to buy and sell goods). Realising this, it seems that the market dynamics apply to the "market for bribes" as well. Hence, the highest average bribe is made for services, which the citizen cannot be without. Thus, 1) the provider has a monopoly on a good (the service) as no other institution than this public institution can provide the service in the same area / region 2) the citizen is forced to obtain the good. Such setting creates an opportunity structure within which the provider can charge very high bribes. Contrarily, for departments like Passport or Transport the service is not essential and the monopoly vanishes. (For many Indians, passports are not necessarily to every day life).

Such results can help shed light on possible solutions for the corruption in Indian public institutions. Hence, knowing that market dynamics apply, the Indian state could weaken the monopoly situation by designing several public institutions providing the same service, in the same area. This way, the Indian state changes the incentive structure for public employes. As the public employee is interested in the payoffs from bribe, the employee will have an incentive to get the bribe. However, getting the bribe might only happen, if the bribe required is lower than the bribe required by the other institution providing the same service. In this way, the total amount of bribes paid might decrease, thought the bribe culture remains.

2 The relationship between bribes and the type of bribe

2a We will first and foremost look at the number of bribes per type of bribe

# The following code creates a dataframe and lists the number of bribes per type of bribes

 df.section = df %>% 
  group_by(section) %>% 
  summarise(
    number = n()
  )  %>%
  arrange(desc(number), desc(section)) 
# The following code creates a visualization of the number of bribes per type of bribe

p4 <- ggplot(na.omit(df.section[which(df.section$number>4),]), aes(x = reorder(section, number), y = number))
p4 <- p4 + geom_bar(stat="identity", alpha=1, fill="red") + coord_flip()
p4 <- p4 + labs(title = "Bribes per section", x = "Section", y = "Number")
p4 + theme_minimal()

The most common bribe is for 'Birth Certificate', with 263 bribes out of a 1.000. The second most common bribe is for 'Issue of Ration Card' with 131 bribes.

2b We will look at the amount of bribe-money per section

# Data frame of the total amount of bribe-money per section

 df.section_money = df %>% 
  group_by(section) %>% 
  summarise(
    amount = sum(bribe, na.rm = TRUE)
  )  %>%
  arrange(desc(amount), desc(section)) 
# Visualization of the amount of bribe-money per section

p5 <- ggplot(na.omit(df.section_money[which(df.section_money$amount>5000),]), aes(x = reorder(section, amount), y = amount))
p5 <- p5 + geom_bar(stat="identity", alpha=1, fill="red") + coord_flip()
p5 <- p5 + labs(title = "Amount of bribe-money per section", x = "section", y = "Amount")
p5 + theme_minimal()

'Birth Certificate' also has the highest amount of bribe-money, being 2.986.308 INR in total. 'C form' has the second higest amount of bribe-money with less than half of the amount having been paid for 'Birth Certificate'. 'Issue of Ration Card', which had the second highest number of bribes, only has the 8th highest amount.

2c We will look at the average amount of bribe-money per section

# Data frame of the average amount of bribe-money per section
 df.section_money_average = df %>% 
  group_by(section) %>% 
  summarise(
    ave = sum(bribe, na.rm = TRUE)/n()
  )  %>%
  arrange(desc(ave), desc(section)) 
# Visualization of the average of bribe-money per section
p6 <- ggplot(na.omit(df.section_money_average[which(df.section_money_average$ave>1),]), aes(x = reorder(section, ave), y = ave))
p6 <- p6 + geom_bar(stat="identity", alpha=1, fill="red") + coord_flip()
p6 <- p6 + labs(title = "Average of bribe-money per section", x = "Section", y = "Mean")
p6 + theme_minimal()

We see that 'Registration of land' and 'Police harassment' have the highest average with respectively 55.000 INR and 54.312 INR. Apparently false allegations can be bribed with an amount as low as 200 INR which is about 20 DKK.

3 We will now look at the relationship between bribes and location.

In this section, we explore the geographical distribution of bribery in India.

3a We will first look at the number of bribes per region

# The following piece of code creates a map of India showing the number of crimes per region.
#Loading shapefile
india.regions <- getData("GADM", country = "India", level = 1)

#Inspecting shapefile
#plot(india.regions)
#names(india.regions)

#Converting shapefile into data frame
region.data <- fortify(india.regions, region = c("ID_1"))
region.merge <- merge(region.data, india.regions, by.x = "id", by.y = "ID_1")
region.merge$region <- region.merge$NAME_1

#Setting up data for map: Agg. bribes per. region
agg.bribes = df %>%
  filter(!is.na(region)) %>%
  group_by(region) %>%
  summarise(agg_bribe = n()) %>%
  arrange(desc(agg_bribe))
agg.bribes$region <- gsub("^\\s+|\\s+$", "", agg.bribes$region)

#Merging data and shapefile
region.map = left_join(region.merge, agg.bribes, by = "region")

#Mapping using ggplot2 and ggmap
map.india <- ggmap(get_map(location = "india", maptype = "toner-background", zoom = 5))
map.india + geom_polygon(aes(x = long, y = lat, fill = agg_bribe, group = group), 
                         data = region.map, alpha = 0.6, color = "black", 
                         size = 0.2, na.rm = T) + 
  scale_fill_gradient("Total Bribes", low = "springgreen3", high = "red") + 
  theme_minimal() + labs(x = NULL, y = NULL, title = "Bribes per region")

The map shows that 1) 'Karnataka' is clearly the region with most bribes around 300 bribes, 2) that all other regions are lower than 150 bribes and 3) that some regions actually have had zero bribes in the period from the 12th of October till the 8th of November 2015.

3b We will look at the amount of bribe-money per region

#Setting up data for map: Total amount of bribe-money per. region
total.bribes = df %>%
  filter(!is.na(region)) %>%
  group_by(region) %>%
  summarise(total_bribe = sum(bribe, na.rm = TRUE)) %>%
  arrange(desc(total_bribe))
total.bribes$region <- gsub("^\\s+|\\s+$", "", total.bribes$region)

#Merging data and shapefile
region.map = left_join(region.merge, total.bribes, by = "region")

#Mapping using ggplot2 and ggmap
map.india <- ggmap(get_map(location = "india", maptype = "toner-background", zoom = 5))
map.india + geom_polygon(aes(x = long, y = lat, fill = total_bribe, group = group), 
                         data = region.map, alpha = 0.6, color = "black", 
                         size = 0.2, na.rm = T) + 
  scale_fill_gradient("Total amount", low = "springgreen3", high = "red") + 
  theme_minimal() + labs(x = NULL, y = NULL, title = "Amount of bribe-money per region")

Although the previous map showed that 'Karnataka' was the region with most bribes, this map shows that 'Maharashtra' is the region with highest total amount of bribe-money with 'Uttar Pradesh' being the region with the the second highest total amount. The reason for this change might be due to the fact that Mumbai is the capital of Maharashtra. Other than being the wealthiest city in India with the highest GDP of any city in South, West, or Central Asia, Mumbai also has the highest number of billionaires and millionaires among all cities in India. Adding to that, Mumbai is the financial, commercial and entertainment capital of India. It is also one of the world's top ten centres of commerce in terms of global financial flow, generating 6.16% of India's GDP and accounting for 25% of industrial output, 70% of maritime trade in India (Mumbai Port Trust and JNPT), and 70% of capital transactions to India's economy (source: Wikipedia). 'Uttar Pradesh' is the most populous region with over 200 million inhabitants which could be the reason for being the region with the second highest total amount of bribe-money.

3c We will look at the average amount of bribe-money per region

#Setting up data for map: Total amount of bribe-money per. region
ave.bribes = df %>%
  filter(!is.na(region)) %>%
  group_by(region) %>%
  summarise(ave_bribe = sum(bribe, na.rm = TRUE)/n()) %>%
  arrange(desc(ave_bribe))
ave.bribes$region <- gsub("^\\s+|\\s+$", "", ave.bribes$region)

#Merging data and shapefile
region.map = left_join(region.merge, ave.bribes, by = "region")

#Mapping using ggplot2 and ggmap
map.india <- ggmap(get_map(location = "india", maptype = "toner-background", zoom = 5))
map.india + geom_polygon(aes(x = long, y = lat, fill = ave_bribe, group = group), 
                         data = region.map, alpha = 0.6, color = "black", 
                         size = 0.2, na.rm = T) + 
  scale_fill_gradient("Average amount", low = "springgreen3", high = "red") + 
  theme_minimal() + labs(x = NULL, y = NULL, title = "Average amount of bribe-money per region")

This map shows that 'Maharashtra' and 'West Bengal' has the highest average amount of bribe-money per region. The reason for 'West Bengal' could be the same as for 'Maharashtra' namely the existence of a very large city with Mumbai in 'Maharashtra' and Kolkata/Calcutta in 'West Bengal'. Kolkata is the principal commercial, cultural, and educational centre of East India, while the Port of Kolkata is India's oldest operating port and its sole major riverine port. It is also the third most populous metropolitan area behind Delhi and Mumbai and the city with the third highest GDP behind Mumbai and Delhi.

3d We will now look at the map top 10 cities

#MAP OF CITIES WITH THW MOST BRIBES

#Locating 10 cities with the most bribes
most.bribes = df %>%
  filter(!is.na(city)) %>%
  group_by(city) %>%
  summarise(n = n()) %>%
  arrange(desc(n))
most.bribes$city <- gsub("^\\s+|\\s+$", "", most.bribes$city)

#Vectorizing city variable 
city.list = most.bribes$city %>%
  as.matrix() %>%
  as.vector()

#Setting up city coordinate function 
city.coordinates = function(city){
  india.city = geocode(city, source = "google", output = "more")
  return(cbind(india.city))
}

#Retrieving long/lat coordinates for top 10 cities
city.list2 = list()
for (i in city.list[1:10]){
  #print(paste("processing", i, sep = " "))
  city.list2[[i]] = city.coordinates(i)
}  
rm(i)

#Creating data frame
city.data = ldply(city.list2)
city.data$city = city.data$.id

#Merging 
df.merge = right_join(city.data, most.bribes)

#Plotting the data
map.india <- ggmap(get_map(location = "india", maptype = "toner-background", zoom = 5))
map.india + geom_point(aes(x = lon, y = lat, size = (n)), color = I("red"),
                       data = df.merge) + 
  scale_size_area(breaks = (c(10, 50, 100, 150, 200, 300)), 
                  labels = c(10, 50, 100, 150, 200, 300), 
                  name = "Number of bribes") + theme_minimal() +
  labs(x = NULL, y = NULL, title = "Top 10 Bribe Cities") + 
  geom_text(aes(label = city, hjust=1, vjust=1, angle=45), 
            data = df.merge, size = 3, color = I("red")) 

This map shows the top ti cities with the most reported bribes. Bangalore is clearly the city with the highest number of bribes being around 300. Bangalore is the third most populous city in India behind Mumbai and Delhi but apparantly neither Mumbai or Delhi has a high amount of reported bribes which contradicts the aforementioned possible reasons for 'Maharashtra' and 'Uttar Pradesh'. Maybe the high amount of total bribe-money for these regions are not because of Mumbai and Delhi and that the bribes are spread out over the entire region but it could also be due to a few very high bribes in those cities. The reason for Agra having so many reported bribes could be because the city is the 19th most populous city in India but it could also be linked to Taj Mahal which is located in Agra. Achhnera is a municipal board in the Agra district which must be why it is among the top ten most corrupt cities in India.

sebastianbarfort commented 8 years ago

Beautiful assignment. Great code, visualizations and explanations.

Keep up the very good work!

APPROVED