sebastianbarfort / sds

Social Data Science, course at University of Copenhagen
http://sebastianbarfort.github.io/sds/
12 stars 17 forks source link

Group 14: Assignment 2 #56

Closed andreaslangholz closed 8 years ago

andreaslangholz commented 8 years ago

title: "Assignment 2" author: "Gruppe 14" date: "8. nov. 2015"

output: html_document

Bribes in India

In this assignment we wish to understand the different dynamics in Indian corruption, using data scraped from the site www.IPaidABribe.com. The assignement follows in four parts, first we describe the data we are interested in and the process of obtaining such data using the scraping tools from the 'rvest' package. Next we clean and manipulate the data into the variables we need for the analytical description. Finally we visualize the data, highlighting different aspects of corruption in India segmented into departments, weekdays and geography.

A general comment on datatype and quality

Data is collected through the website “http://www.ipaidabribe.com” where people have the possibility of reporting three kinds of bribe reports “I paid bribe”, “I did not pay a bribe” and “I met a honest Officer”.

In this analysis the focus will be on the report type “I paid bribe”. This report type can be done anonymous or public in the sense, that the reporting persons contract information will be shared with one or more of the following options “Senior officials of the department”, “State Vigilance Officer” or “Media - Newspapers & Television”. When collecting the data it is difficult distinguish what is anonymous or public (Note: It is possible to get the information, but it demands are more advanced web scraper and because of time pressure, this will not be developed.). This raises the issue about selection bias in the data. It can be imagine, that people who are using the website, is the people whom are the most frustrated about corruption and since it is free to report anonymous, they may let of stream and may exaggerate the reported bribes. Another selection problem is that not all people in India have Internet access, which can make bribes seem more widespread in some areas, only because more people have access to the Internet.

The data quality is reasonably high, which comes form a clever report template. When making a report there are predetermined lists for cities, date, department and type of bribe, and the bribe amount can only be numeric and in INR. With this setup the user minimize the chance of accidently reporting wrong information.

Getting the data

#Libraries used in the assignment:
x <- c ("rvest", "plyr", "ggthemes", "dplyr", "stringr", "ggplot2", "lubridate", "raster", "maptools", "sp", "magrittr", "RColorBrewer", "rgdal", "rgeos")
lapply(x, require, character.only = TRUE)

First we want to scrape the desired data from www.ipaidabribe.com. We do this in 3 steps:

1) Finding the data

We define the URL we want to scrape from, and then apply CSS.selector to find the selectors of the different items in the HTML code we want to scrape.

link = "http://www.ipaidabribe.com/reports/paid?page="

css.selector.title = ".heading-3 a"
css.selector.location = ".location"
css.selector.department = ".name a"
css.selector.date = ".date"
css.selector.detail = ".transaction a"
css.selector.ID = ".unique-reference"
css.selector.amount = ".paid-amount span"
css.selector.views = ".overview .views"
css.selector.all = ".ref-module-paid-bribe"

2) Defining the extraction function

We then define a function able to extract the desired information from the provided URL

#Creating a function that extracts the correct data from a given URL input
extraction_function = function(link) {
  site = html(link)
  title = site %>% html_nodes(css = css.selector.title) %>% html_text()
  department = site %>% html_nodes(css = css.selector.department) %>% html_text()
  location = site  %>% html_nodes(css = css.selector.location) %>% html_text ()
  date = site  %>% html_nodes(css = css.selector.date) %>% html_text ()
  amount = site  %>% html_nodes(css = css.selector.amount) %>% html_text ()
  detail = site  %>% html_nodes(css = css.selector.detail) %>% html_text ()
  views = site  %>% html_nodes(css = css.selector.views) %>%  html_text ()
  ID = site  %>% html_nodes(css = css.selector.ID) %>% html_text () 
  return(cbind(title, department, location, date, amount, detail, views, ID))
}

3) Looping the function

Noticing that the original URL links to the pages in increments of 10, we define a list of links that we want the function to loop through, and construct a loop to assemple all 1000 observations in a single dataframe

#creating a vector with the pagenumbers ending the URL for each page
page <- c(seq(0,1000,10))

#creating a vector consisting of URL links to the first 100 pages
listoflinks = paste0(link, page)
data = list()

for (i in seq_along(listoflinks)) {
  data[[i]] = extraction_function(listoflinks[i])
  Sys.sleep(1)
}
#Converting to dataframe:
bribe.data = ldply(data)

Cleaning the Data

We now extract the variables we need for the visual analysis

# Numerical number of views pr. observation
bribe.data$views = as.numeric(str_extract(bribe.data$views, "[0-9]+,[0-9]+,[0-9]+|[0-9]+,[0-9]+|[0-9]+"))

# Extracting the size of the bribed amount
bribe.data$size = str_extract(bribe.data$amount, "[0-9]+,[0-9]+,[0-9]+|[0-9]+,[0-9]+|[0-9]+")
bribe.data$size = as.numeric(gsub(",", "", bribe.data$size))

## We do a bit of manipulation to come up with a usable date variable:
# First extracting the relevant information:
bribe.data$year  = str_extract(bribe.data$date, ", [0-9]*")
bribe.data$year  = as.numeric(gsub(", ", "", bribe.data$year))
bribe.data$month = str_extract(bribe.data$date, "[A-z]*")
bribe.data$day   = as.numeric(str_extract(bribe.data$date, "([0-9])+"))

# Function that converts month from character to numeric.
numMonth <- function(x) {
  months <- list(january  =1,  february =2,  march     =3, 
                 april    =4,  may      =5,  june      =6, 
                 july     =7,  august   =8,  september =9, 
                 october  =10, november =11, december  =12)

  x <- tolower(x)
  sapply(x,function(x) months[[x]])
}
bribe.data$month_n = as.numeric(numMonth(bribe.data$month))

# Making a date.
bribe.data$day= ifelse(bribe.data$day<10, gsub(" ","", paste("0", bribe.data$day)),bribe.data$day)
bribe.data$date.c = as.Date(paste(bribe.data$year, bribe.data$month_n, bribe.data$day, sep="-"))

# Extracting state and city
bribe.data$city  = str_extract(bribe.data$location, "[A-z]*")
bribe.data$state = sub('^[^,]+,', '', bribe.data$location)

Visualization

Density Plot

amountmean <- mean(log(bribe.data$size))
amount_sd <- sd(log(bribe.data$size))

density <- ggplot(bribe.data, aes(x=log(bribe.data$size)))
density + geom_density(size = 1) + stat_function(fun = dnorm, args = list(mean = amountmean, sd = amount_sd), color = "red", size = 1) + labs(list(title="Destribution of bribes", x = "log(amount)")) + theme_minimal() 

The density plot shows the distribution of log bribe amount together with a normal distribution. Taking the log reveals the destribution, as it would otherwisebe invisible due to the large data range We se that the distribution of bribes is approximately normal, but with more bribes centered around the mean

Weeks and department

We now want to investigate whether we can see different patterns of corruption related to the different departments bribed and the day of occurence

We start by observing the different levels of bribing in different departments

# Variable transforming date to weekday
bribe.data$weekday <- weekdays(bribe.data$date.c)
bribe.data$weekday <- factor(bribe.data$weekday, levels= c("mandag", "tirsdag", "onsdag", "torsdag", "fredag", "lørdag", "søndag"), labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
bribe.data[order(bribe.data$weekday), ]

# Generating variable with department averages and weekday averages
department_average <- bribe.data %>% 
  group_by(department) %>% 
  summarise(count= n(), avg = mean(size, na.rm = TRUE))
department_average_weekday <- bribe.data %>%
  group_by(department, weekday) %>% 
  summarise(count= n(), avg = mean(size, na.rm = TRUE))

#antal bestikkelser fordelt på afdeling
department_bribes = bribe.data %>% 
  group_by(department) %>% 
  summarise(Number_bribes = n())

#antal bestikkelser fordelt på afdeling og ugedag 
department_bribes_weekday = bribe.data %>% group_by(department, weekday) %>% summarise(Number_bribes = n())
#gns sum fordelt på afdeling
r = ggplot(department_average, aes(x = reorder(department, avg), y = avg))
r + geom_bar(stat="identity") + coord_flip() + theme(axis.text.x = element_blank()) + labs(list(title = "Average size of bribe in different departments", x = "Department", y = "Average size of bribe")) + scale_fill_discrete(name="Experimental\nCondition") + theme_minimal()

#antal observationer fordelt på afdeling
q = ggplot(department_bribes, aes(x = reorder(department, Number_bribes), y = Number_bribes))
q + geom_bar(stat="identity") + coord_flip() + theme(axis.text.x = element_blank()) + labs(list(title = "Numbers of bribes in different departments", x = "Department", y = "Number of bribes")) + scale_fill_discrete(name="Experimental\nCondition") + theme_minimal()

The department with the highest average is "Income Tax" (63321 INR), where there are only 39 observations. This means that a simple observation can impact the average a lot.In the department where bribe is most common, "Municipal Services" (255), it's way cheaper to bribe with an average size of the bribes of 12630 INR. "Municipal Services" thus rank 7th out of the 17 different department with respect to average size of the bribes

FACET WRAP

Next, we use facet wrap to create bar plots of the different departments, showing total sum of bribe, avg. size of bribe and number of bribes on different weekdays. For each figure we have choosen the most interesting departments. For the sake of simplicity and analytical relevance, we have chosen out the departments with the most interesting patterns, such as fact is that the number of bribes are highest on mondays for almost every department.

#totale bestikkelsessum fordelt på ugedag og afdeling
n = ggplot(bribe.data, aes(x=weekday, y=size))
n + geom_bar(stat = "identity") + facet_wrap(~department, scales = "free")
#enkelte afdelinger udvælges
facets <- c("Commercial Tax, Sales Tax, VAT", "Customs, Excise and Service Tax", "Municipal Services", "Food, Civil Supplies and Consumers Affairs", "Stamps and Registration", "Education", "Income Tax")

n.sub = ggplot(bribe.data[bribe.data$department %in% facets,], aes(x=weekday, y=size))
n.sub + geom_bar(stat = "identity") + facet_wrap(~department, scales = "free") + theme(axis.text.x = element_blank()) + labs(list(title = "Total sum of bribes on different days of the week in different departments", x = "", y = "")) + scale_fill_discrete(name="Experimental\nCondition") + theme_minimal()

#gns. sum fordelt på afdeling

k = ggplot(department_average_weekday, aes(x=weekday, y=avg))
k + geom_bar(stat = "identity") + facet_wrap(~department, scales = "free")
#enkelte afdelinger udvælges

facets.k <- c("Commercial Tax, Sales Tax, VAT", "Customs, Excise and Service Tax", "Municipal Services", "Food, Civil Supplies and Consumers Affairs", "Stamps and Registration", "Education", "Income Tax", "Transport", "Railways", "Passport")

k.sub = ggplot(department_average_weekday[department_average_weekday$department %in% facets.k,], aes(x=weekday, y=avg))

k.sub + geom_bar(stat = "identity") + facet_wrap(~department, scales = "free") + theme(axis.text.x = element_blank()) + labs(list(title = "Avg. size of bribe on different days of the week in different departments", x = "", y = "")) + scale_fill_discrete(name="Experimental\nCondition") + theme_minimal()
#antal observationer fordelt på ugedag og afdeling
f = ggplot(department_bribes_weekday, aes(x=weekday, y=Number_bribes))
f + geom_bar(stat = "identity") + facet_wrap(~department, scales = "free") + theme(axis.text.x = element_blank()) + labs(list(title = "Number of bribes on different days of the week in different departments", x = "", y = "")) + scale_fill_discrete(name="Experimental\nCondition") + theme_minimal()
#enkelte afdeling udvælges
facets.f <- c("Commercial Tax, Sales Tax, VAT", "Customs, Excise and Service Tax", "Education", "Municipal Services", "Food, Civil Supplies and Consumers Affairs", "Stamps and Registration", "Income Tax", "Police", "Railway", "Passport", "Transport")

f.sub = ggplot(department_bribes_weekday[department_bribes_weekday$department %in% facets.f,], aes(x=weekday, y=Number_bribes))

f.sub + geom_bar(stat = "identity") + facet_wrap(~department, scales = "free") + theme(axis.text.x = element_blank()) + labs(list(title = "Number of bribes on different days of the week in different departments", x = "", y = "")) + scale_fill_discrete(name="Experimental\nCondition") + theme_minimal()

Map

Lastly we want to provide a map of Indian provinces, showing the level of corruption as a function of the depth of the colorscale.

#=========Forsøg på at komme udenom pipen der ikke virker=========

state_avg2 <- group_by(bribe.data, state)
state_avg2 <- summarise(state_avg2, mean_size = mean(size))

state_avg2$logsize = log(state_avg2$mean_size)
state_avg2$region <- sub(" ","",state_avg2$state)

state_avg2 <- state_avg2[!(state_avg2$region == ""),]
state_avg2$region = tolower(state_avg2$region)

#================== Preparing data for map =======================

#Map of India
ind1 <- getData('GADM', country = 'IND', level = 1)

# Fortifying and creating "merger" variable "region#

ind1$region = tolower(ind1$NAME_1)
ind.f = fortify(ind1, region = "region")
ind.f$region = tolower(ind.f$id)

df.map <- merge(ind.f, state_avg2, by="region", all.x = TRUE)
df.plot <- df.map[order(df.map$order), ]

#Plotting with color=========================================================================
#min(state_avg2$logsize)
#max(state_avg2$logsize)

cnames <- aggregate(cbind(long, lat) ~ id, data=df.plot, FUN=function(x) mean(range(x)))
ggplot() +
  geom_polygon(data = df.plot,
               aes(x = long, y = lat, group = group, fill = logsize),
               color = "grey", size = 0.6) +
  coord_map() +
  scale_fill_gradient(name="Log-Average Bribe", limits=c(4, 13), low="white", high="blue") +
  theme_minimal() +
  labs(title="Level of corruption in India", x = "Longtitude", y = "Latitude") + 
  geom_text(data=cnames, aes(long, lat, label = id), size=2.5)

Summing up

We have now touch upon a visual analysis of te different patterns of corruption in India, based on both the different departments of the state, the distribution on weekdays and the geographical dispersion. Further analysis could look more into the city level of corruption, as well as the seasonality and general long-term time trends.

Full R-Code:

#Libraries used in the assignment:
x <- c ("rvest", "plyr", "ggthemes", "dplyr", "stringr", "ggplot2", "lubridate", "raster", "maptools", "sp", "magrittr", "RColorBrewer", "rgdal", "rgeos")
lapply(x, require, character.only = TRUE)

link = "http://www.ipaidabribe.com/reports/paid?page="

css.selector.title = ".heading-3 a"
css.selector.location = ".location"
css.selector.department = ".name a"
css.selector.date = ".date"
css.selector.detail = ".transaction a"
css.selector.ID = ".unique-reference"
css.selector.amount = ".paid-amount span"
css.selector.views = ".overview .views"
css.selector.all = ".ref-module-paid-bribe"

#Creating a function that extracts the correct data from a given URL input
extraction_function = function(link) {
  site = html(link)
  title = site %>% html_nodes(css = css.selector.title) %>% html_text()
  department = site %>% html_nodes(css = css.selector.department) %>% html_text()
  location = site  %>% html_nodes(css = css.selector.location) %>% html_text ()
  date = site  %>% html_nodes(css = css.selector.date) %>% html_text ()
  amount = site  %>% html_nodes(css = css.selector.amount) %>% html_text ()
  detail = site  %>% html_nodes(css = css.selector.detail) %>% html_text ()
  views = site  %>% html_nodes(css = css.selector.views) %>%  html_text ()
  ID = site  %>% html_nodes(css = css.selector.ID) %>% html_text () 
  return(cbind(title, department, location, date, amount, detail, views, ID))
}

#creating a vector with the pagenumbers ending the URL for each page
page <- c(seq(0,1000,10))

#creating a vector consisting of URL links to the first 100 pages
listoflinks = paste0(link, page)
data = list()

for (i in seq_along(listoflinks)) {
  data[[i]] = extraction_function(listoflinks[i])
  Sys.sleep(1)
}
#Converting to dataframe:
bribe.data = ldply(data)

# Numerical number of views pr. observation
bribe.data$views = as.numeric(str_extract(bribe.data$views, "[0-9]+,[0-9]+,[0-9]+|[0-9]+,[0-9]+|[0-9]+"))

# Extracting the size of the bribed amount
bribe.data$size = str_extract(bribe.data$amount, "[0-9]+,[0-9]+,[0-9]+|[0-9]+,[0-9]+|[0-9]+")
bribe.data$size = as.numeric(gsub(",", "", bribe.data$size))

## We do a bit of manipulation to come up with a usable date variable:
# First extracting the relevant information:
bribe.data$year  = str_extract(bribe.data$date, ", [0-9]*")
bribe.data$year  = as.numeric(gsub(", ", "", bribe.data$year))
bribe.data$month = str_extract(bribe.data$date, "[A-z]*")
bribe.data$day   = as.numeric(str_extract(bribe.data$date, "([0-9])+"))

# Function that converts month from character to numeric.
numMonth <- function(x) {
  months <- list(january  =1,  february =2,  march     =3, 
                 april    =4,  may      =5,  june      =6, 
                 july     =7,  august   =8,  september =9, 
                 october  =10, november =11, december  =12)

  x <- tolower(x)
  sapply(x,function(x) months[[x]])
}
bribe.data$month_n = as.numeric(numMonth(bribe.data$month))

# Making a date.
bribe.data$day= ifelse(bribe.data$day<10, gsub(" ","", paste("0", bribe.data$day)),bribe.data$day)
bribe.data$date.c = as.Date(paste(bribe.data$year, bribe.data$month_n, bribe.data$day, sep="-"))

# Extracting state and city
bribe.data$city  = str_extract(bribe.data$location, "[A-z]*")
bribe.data$state = sub('^[^,]+,', '', bribe.data$location)

#Visualization
amountmean <- mean(log(bribe.data$size))
amount_sd <- sd(log(bribe.data$size))

density <- ggplot(bribe.data, aes(x=log(bribe.data$size)))
density + geom_density(size = 1) + stat_function(fun = dnorm, args = list(mean = amountmean, sd = amount_sd), color = "red", size = 1) + labs(list(title="Destribution of bribes", x = "log(amount)")) + theme_minimal() 

# Variable transforming date to weekday
bribe.data$weekday <- weekdays(bribe.data$date.c)
bribe.data$weekday <- factor(bribe.data$weekday, levels= c("mandag", "tirsdag", "onsdag", "torsdag", "fredag", "lørdag", "søndag"), labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
bribe.data[order(bribe.data$weekday), ]

# Generating variable with department averages and weekday averages
department_average <- bribe.data %>% 
  group_by(department) %>% 
  summarise(count= n(), avg = mean(size, na.rm = TRUE))
department_average_weekday <- bribe.data %>%
  group_by(department, weekday) %>% 
  summarise(count= n(), avg = mean(size, na.rm = TRUE))

#antal bestikkelser fordelt på afdeling
department_bribes = bribe.data %>% 
  group_by(department) %>% 
  summarise(Number_bribes = n())

#antal bestikkelser fordelt på afdeling og ugedag 
department_bribes_weekday = bribe.data %>% group_by(department, weekday) %>% summarise(Number_bribes = n())

#gns sum fordelt på afdeling
r = ggplot(department_average, aes(x = reorder(department, avg), y = avg))
r + geom_bar(stat="identity") + coord_flip() + theme(axis.text.x = element_blank()) + labs(list(title = "Average size of bribe in different departments", x = "Department", y = "Average size of bribe")) + scale_fill_discrete(name="Experimental\nCondition") + theme_minimal()

#antal observationer fordelt på afdeling
q = ggplot(department_bribes, aes(x = reorder(department, Number_bribes), y = Number_bribes))
q + geom_bar(stat="identity") + coord_flip() + theme(axis.text.x = element_blank()) + labs(list(title = "Numbers of bribes in different departments", x = "Department", y = "Number of bribes")) + scale_fill_discrete(name="Experimental\nCondition") + theme_minimal()

#totale bestikkelsessum fordelt på ugedag og afdeling
n = ggplot(bribe.data, aes(x=weekday, y=size))
n + geom_bar(stat = "identity") + facet_wrap(~department, scales = "free")

#enkelte afdelinger udvælges
facets <- c("Commercial Tax, Sales Tax, VAT", "Customs, Excise and Service Tax", "Municipal Services", "Food, Civil Supplies and Consumers Affairs", "Stamps and Registration", "Education", "Income Tax")

n.sub = ggplot(bribe.data[bribe.data$department %in% facets,], aes(x=weekday, y=size))
n.sub + geom_bar(stat = "identity") + facet_wrap(~department, scales = "free") + theme(axis.text.x = element_blank()) + labs(list(title = "Total sum of bribes on different days of the week in different departments", x = "", y = "")) + scale_fill_discrete(name="Experimental\nCondition") + theme_minimal()

#gns. sum fordelt på afdeling

k = ggplot(department_average_weekday, aes(x=weekday, y=avg))
k + geom_bar(stat = "identity") + facet_wrap(~department, scales = "free")

#enkelte afdelinger udvælges

facets.k <- c("Commercial Tax, Sales Tax, VAT", "Customs, Excise and Service Tax", "Municipal Services", "Food, Civil Supplies and Consumers Affairs", "Stamps and Registration", "Education", "Income Tax", "Transport", "Railways", "Passport")

k.sub = ggplot(department_average_weekday[department_average_weekday$department %in% facets.k,], aes(x=weekday, y=avg))

k.sub + geom_bar(stat = "identity") + facet_wrap(~department, scales = "free") + theme(axis.text.x = element_blank()) + labs(list(title = "Avg. size of bribe on different days of the week in different departments", x = "", y = "")) + scale_fill_discrete(name="Experimental\nCondition") + theme_minimal()

#antal observationer fordelt på ugedag og afdeling
f = ggplot(department_bribes_weekday, aes(x=weekday, y=Number_bribes))
f + geom_bar(stat = "identity") + facet_wrap(~department, scales = "free") + theme(axis.text.x = element_blank()) + labs(list(title = "Number of bribes on different days of the week in different departments", x = "", y = "")) + scale_fill_discrete(name="Experimental\nCondition") + theme_minimal()

#enkelte afdeling udvælges
facets.f <- c("Commercial Tax, Sales Tax, VAT", "Customs, Excise and Service Tax", "Education", "Municipal Services", "Food, Civil Supplies and Consumers Affairs", "Stamps and Registration", "Income Tax", "Police", "Railway", "Passport", "Transport")

f.sub = ggplot(department_bribes_weekday[department_bribes_weekday$department %in% facets.f,], aes(x=weekday, y=Number_bribes))

f.sub + geom_bar(stat = "identity") + facet_wrap(~department, scales = "free") + theme(axis.text.x = element_blank()) + labs(list(title = "Number of bribes on different days of the week in different departments", x = "", y = "")) + scale_fill_discrete(name="Experimental\nCondition") + theme_minimal()

#=========Forsøg på at komme udenom pipen der ikke virker=========

state_avg2 <- group_by(bribe.data, state)
state_avg2 <- summarise(state_avg2, mean_size = mean(size))

state_avg2$logsize = log(state_avg2$mean_size)
state_avg2$region <- sub(" ","",state_avg2$state)

state_avg2 <- state_avg2[!(state_avg2$region == ""),]
state_avg2$region = tolower(state_avg2$region)

#================== Preparing data for map =======================

#Map of India
ind1 <- getData('GADM', country = 'IND', level = 1)

# Fortifying and creating "merger" variable "region#

ind1$region = tolower(ind1$NAME_1)
ind.f = fortify(ind1, region = "region")
ind.f$region = tolower(ind.f$id)

df.map <- merge(ind.f, state_avg2, by="region", all.x = TRUE)
df.plot <- df.map[order(df.map$order), ]

#Plotting with color=========================================================================
#min(state_avg2$logsize)
#max(state_avg2$logsize)

cnames <- aggregate(cbind(long, lat) ~ id, data=df.plot, FUN=function(x) mean(range(x)))
ggplot() +
  geom_polygon(data = df.plot,
               aes(x = long, y = lat, group = group, fill = logsize),
               color = "grey", size = 0.6) +
  coord_map() +
  scale_fill_gradient(name="Log-Average Bribe", limits=c(4, 13), low="white", high="blue") +
  theme_minimal() +
  labs(title="Level of corruption in India", x = "Longtitude", y = "Latitude") + 
  geom_text(data=cnames, aes(long, lat, label = id), size=2.5)
sebastianbarfort commented 8 years ago

Great assignment.

Very clear R code and nice use of maps.

APPROVED.