sebastianbarfort / sds

Social Data Science, course at University of Copenhagen
http://sebastianbarfort.github.io/sds/
12 stars 17 forks source link

Assignment 2 - Group 26 #32

Closed clausvestergaard closed 8 years ago

clausvestergaard commented 8 years ago

title: "Assignment 2" author: "by Group 26" date: "8. Nov. 2015"

output: html_document


### Libraries -----------
library(rvest)
library(dplyr)
library(stringr)
library(ggplot2)
library(readr)
library(lubridate)
library(maps)
library(knitr)
library(scales)

# ### Loading data (Has been commented out (Highligt lines -> MAC: Cmd+Shift+C, Win: Ctrl+Shift+C )) -----------
# css.selector = ".unique-reference , .overview .views , .location , .date , .transaction a , .paid-amount span , .name a , .heading-3 a"
# 
# bribe.data <- data.frame() #No need to specify variables here
#  
#
# for (i in 0:99) {
#   
#   link = paste("http://www.ipaidabribe.com/reports/paid?page=",i,"0#gsc.tab=0", sep="")
#   
#   print(paste("processing", i, sep = " "))
#   
#   bribe.page = read_html(link) %>%
#     html_nodes(css = css.selector) %>% 
#     html_text()
#   
#   a <- matrix(bribe.page, nrow=length(bribe.page)/8, ncol=8, byrow=TRUE)
#   b <- data.frame(a)
#   
#   bribe.data <- rbind(bribe.data,b)
#   
#   Sys.sleep(1)
#   cat(" done!\n")
#   
# }

#Scraping is done on 5. nov. 2015 at 10:40!!

#### Load data from github (Active instead of web-scraping) -----
bribe.data <- read_csv("https://raw.githubusercontent.com/clausvestergaard/SDS_Group26/master/bribedata.csv")

### Rename headers ------
names(bribe.data) <- c("OBS","Views","Title","Department","Details","Amount","Date","Place","ID")

bribe.data$OBS=NULL

### Take care of the date format -----
bribe.data$Date=as.character(bribe.data$Date)

bribe.data$Date=bribe.data$Date %>% 
  str_replace(",", "") %>%
  str_replace("November","11") %>%
  str_replace("October", "10") %>%
  str_replace("September", "09") %>%
  str_replace("August", "08") %>%
  str_replace("July", "07") %>%
  str_replace("June", "06")

bribe.data$Date=bribe.data$Date %>%
  str_replace_all("2015", "")

  #Only keep what's needed
bribe.data$Date=str_sub(bribe.data$Date, start=0, end=5)

  #add year
bribe.data$Date=paste("2015",bribe.data$Date, collapse=NULL)

  #format / maybe just remove all whitespace
bribe.data$Date=str_replace_all(bribe.data$Date," ", "-")
bribe.data$Date <- str_trim(bribe.data$Date)

 #Date class
bribe.data$Date=as.Date(bribe.data$Date)

  #Now add the day of the week:
bribe.data$Weekday=weekdays(bribe.data$Date, abbreviate=FALSE)
bribe.data$WeekdayNum=wday(bribe.data$Date) # Created another variable with the weekdays
# ranging from 0 (Sun) to 7 (Sat) - hopefully makes it easier to do weekday graphs then

### Convert views and amount to numeric -----------
bribe.data$Views <- gsub("[^0-9]","",bribe.data$Views) %>%  #Substitute all non numeric characters with nothing
  as.numeric(bribe.data$Views)                      #Convert the vector from class=factor to class=numeric

bribe.data$Amount <- gsub("[^0-9]","",bribe.data$Amount) %>% 
  as.numeric(bribe.data$Amount)

#### Character class for remaining variables -----
bribe.data$Department=as.character(bribe.data$Department)
bribe.data$Title=as.character(bribe.data$Title)
bribe.data$Details=as.character(bribe.data$Details)
bribe.data$Place=as.character(bribe.data$Place)

#### Divide region and city -----
bribe.data$city = str_extract(bribe.data$Place, "[A-z].*,?")
bribe.data$region = gsub(pattern = "[A-z].*,", replacement = "", bribe.data$Place)

#Delete original place variable
bribe.data$Place <- NULL

Introduction

The data behind for the graphs and tables in this assignment has been scraped from the website www.ipaidabribe.com on the 5th of November 2015. All data is crowd sourced in primarily India and the scraped data set contains 1.000 observations and 7 variables which among other things include the size of the bribe, the count of views per bribe report, in which city the bribe was paid, the date and so on.

Bribes per department

The following graph shows the number of bribes paid grouped by department and reordered by the number of paid bribes:

#Count of bribes per department
g <- ggplot(data=data.frame(table(bribe.data$Department)), aes(x=reorder(Var1, Freq), y=Freq)) + #Var1 and Freq is created by the table function
  geom_bar(stat="identity") + coord_flip() +
  xlab("Department") + ylab("Count of bribes paid") +
  theme_minimal() +
  ggtitle("Count of bribes paid per department") 
g

We see that there is 16 different departments (Police, Municipal Services, Transport and so on), one department without a name and a department called ‘others’. We see that the departments has a wide spread of reports – “Unnamed department”, Labour, Public Works Department and Water and Sewage have only one report while Municipal Services have 237 reports.

Moreover, what can also be interesting as a starting point for further investigation is to see how the count of bribes per department (or region, as we shall see later) relates to the total amount of bribes paid to the respective department or in the respective region. We use this to determine the efficiency of a department, or, put differently, the level of extortion a department exerts. This can also be used to differentiate between retail corruption and wholesale corruption.

Below, we use the aggregated data, filter out one single outlier with an amount paid of 8 billion rupees (an equivalent of more than DKK 800 mn.) and those departments which have less than 10 bribes reported.

#start to omit the 'outlier'
bd.omit <- bribe.data %>% 
  filter(Amount<6000000000)

#create new dataframes

bd.sum = bd.omit %>% 
  group_by(Department) %>% 
  summarise(Count=n()) %>% 
  arrange(Department)

bd.amount = bd.omit %>%
  group_by(Department) %>%
  summarise(TotalAmount=sum(Amount)) %>%
  arrange(Department)

bd.sa=merge(bd.sum, bd.amount, by="Department")

bd.sa$AmountPerBribe=bd.sa$TotalAmount/bd.sa$Count
bd.sa$AmountPerBribe=round(bd.sa$AmountPerBribe, digits=2)
bd.sa=bd.sa[-c(1),] #gets rid of the department with no entry

#Notice that there are some states with a very low bribe count; filter out those with 10 or less counts
bd.sa=filter(bd.sa, Count > 10)

g=ggplot(bd.sa, aes(x=reorder(Department, TotalAmount), 
                                                    y=TotalAmount, fill=AmountPerBribe)) +
  geom_bar(stat="identity") + coord_flip() + theme_minimal()  +
  xlab("Department") + ylab("Total amount paid, in INR") + ggtitle("Total amount and mean amount per bribe by departments") +
  theme_minimal()
g

This adequately illustrates both India's vast and intransparent layers of government as well as its tidy share of properly registered employees: Even though the mean bribe for income tax departments vastly exceeds all others, the total amount of bribes is still higher in municipal services. The latter can rather be located in the retail end of the spectrum.

Bribes per State

Focussing on Indian states, we exclude those states where the number of bribes reported is below 10 and show the ten most efficient states:


#start to omit the 'outlier'
bd.omit <- bribe.data %>% 
  filter(Amount<6000000000)

#create new dataframes

bd.state=bd.omit %>% 
  group_by(region) %>% 
  summarise(Count=n()) %>% 
  arrange(region)

bd.amount= bd.omit %>%
  group_by(region) %>%
  summarise(TotalAmount=sum(Amount)) %>%
  arrange(region)

bd.ra=merge(bd.state, bd.amount, by="region")

bd.ra$AmountPerBribe=bd.ra$TotalAmount/bd.ra$Count
bd.ra$AmountPerBribe=round(bd.ra$AmountPerBribe, digits=2)

#Notice that there are some states with a very low bribe count; filter out those with 10 or less counts
bd.ra=filter(bd.ra, Count > 10)

g=ggplot(filter(bd.ra, AmountPerBribe > 10000), aes(x=reorder(region, TotalAmount), 
                                                    y=TotalAmount, fill=AmountPerBribe)) +
  geom_bar(stat="identity") + coord_flip() + theme_minimal()  +
  xlab("State") + ylab("Total amount paid, in INR") + ggtitle("Total amount and mean amount per bribe in ten states")  
g

With some background knowledge it can be noticed that the shown ten states roughly match those states in India with the highest GDP. For instance, with Maharashtra, Uttar Pradesh and Tamil Nadu having the highest total amount of bribes reported, those three are also, in this order, the economic powerhouses of India. Moreover, while Maharashtra has both the highest total volume of bribes reported and the highest mean, Uttar Pradesh has its bribe mean at only about one fourth of Maharashtra's mean. Officials in Maharashtra thus seem to be more extortionate than those in Uttar Pradesh - yet again, this matches roughly with per capita GDP figures for the two states (INR 117,091 for Maharashtra, INR 36,250 for Uttar Pradesh, being the second-poorest state in India. See http://statisticstimes.com/economy/gdp-capita-of-indian-states.php.)

Views per post

In the following we will investigate any patterns that could explain why some posts have higher view count than others. This can be investigated in many different ways. We will look for patterns on the number of views the report have and under which department the report is listed.

Every report listed on www.ipaidabribe.com has a view count. Each report gets a ‘view’ added to the view count every time a person clicks on the link of the report. When a person clicks on a link it reveals that the person wants to know more about this specific bribe.

In our examination of patterns regarding a reports view count and the department where the report is listed, we want to have certain minimum of number of bribes per department. Therefore, we drop departments that have less than 10 bribes. In other words we keep “Commercial Tax, Sales Tax, VAT”, “Customs, Excise and Service Tax”, “Food, Civil Supplies and Consumer Affairs”, “Income Tax”, “Municipal Services”, “Others”, “Passport”, “Police”, “Stamps and Registration” and “Transport” in our examination of patterns.

The examination is done making a density graph of views per department and a table with the mean and the standard deviation of the number of views per department. We use the density graph because it tells us where the largest mass of the distribution is, i.e. how the number of reports per department is distributed over the number of views. A spike in the distribution will tell us that there is bigger (depending on the height of the spike) concentration of reports with a given view count, while a flat distribution will tell us that the reports is more equally distributed over the view count.


condition = c("Commercial Tax, Sales Tax, VAT", "Customs, Excise and Service Tax",
              " Food, Civil Supplies and Consumer Affairs", "Income Tax", "Municipal Services", 
              "Others", "Passport", "Police", "Stamps and Registration", "Transport") #we make a vector of conditions.

bribe.data.dep  = filter(bribe.data, Department %in% condition)

bribe.data.mean = bribe.data.dep %>% 
  group_by(Department) %>%
  summarise(Mean=mean(Views), Std=sd(Views))

kable(bribe.data.mean, digits = 1)

bribe.data.amean = bribe.data.dep %>% 
  group_by(Department) %>%
  summarise(Mean=mean(Amount), Std=sd(Amount))

kable(bribe.data.amean, digits=1)

p = ggplot(bribe.data.dep, aes(x = Views , colour = Department))
p = p + geom_density() + scale_x_log10()
p = p + theme_minimal()
p

From the graph we see that the highest spike with the highest number of views is from the department ‘others’ with a mean of 153 views per report, while the second department with the highest concentration of reports with a high view count is ‘income tax’ with a mean of 130 views per report. The third highest concentration of reports with the highest view count is from the ‘Municipal Services” department with a mean of 115 views per report. However, the concentration of reports from ‘Municipal Services’ is more flatly distributed. This could suggest that reports from the ‘others’ and ‘income tax’ department are of more interest for the users of ipaidabribe.com.

Taking a closer look on the characteristics of the two departments we see that the ‘others’ and the ‘income tax’ department details of transactions is the same for every report, namely ‘others’ for the ‘others’ department and ‘New PAN Card’ for the ‘income tax’ department. The relatively low information on the details of the bribe (especially reports under ‘others’ department) could be an explanation as to why users click on the link of the report to obtain more information about the bribe. This could furthermore explain why some departments have a lower concentration of reports with a high view count (e.g. Police, Transport) because it’s maybe more well known to bribe police officers or transport services. This could be further investigated by looking at the amount of the bribe, or if institutions of government have an effect on the view count of a report.

Illustrating this with the departmens for income tax and municipal services only, there is not a visible correlation between the actual details and the amount paid either:


munin=c("Municipal Services","Income Tax")

q=ggplot(filter(bd.omit, Department %in% munin), aes(x=Date, y=Amount, size=Views, colour=Details))+
  geom_jitter() + scale_y_log10() + theme_minimal() + ylab("Log scale of amount paid") +
  ggtitle("Context and amount of bribes for \n Municipal Services and Income Tax over time")
q

Timeline of aggregated bribes paid and aggegrated amount paid

The following graph shows the total amount paid per day and the number of bribes paid per day.

#### Graph 3 ----
bd.omit <- bribe.data %>% 
  filter(Amount<6000000000)

d <- data.frame(table(bd.omit$Date))
names(d) <- c("Date","Bribes_paid")
d$Date <- as.Date(d$Date)

b <- data.frame(aggregate(bd.omit$Amount, by=list(bd.omit$Date), FUN=sum))
names(b) <- c("Date","Amount")
c <- merge(x=d, y=b, by="Date", all=TRUE)

i <- ggplot(c, aes(x=Date, y=Amount, alpha=Bribes_paid)) + 
      geom_bar(stat="identity",position="dodge", colour="black")+
      scale_y_continuous(labels = comma)+
      xlab("Date") + ylab("Total amount paid") +
      ggtitle("No. of bribes paid + total amount paid per day")+
      geom_errorbar(stat = "hline", yintercept = "mean",
                width=1, colour="red",aes(ymax=..y..,ymin=..y..))
i

From the graph it is shown that the 12th of October is particular extraordinary regarding the number of bribes paid (very black filling). On this day at least 611 bribes were paid. The 12th of October is the first date in our data set hence it's quite likely that more bribes were paid but not included in the scraping of the data. The 611 bribes on the 12th of October sums up to a total amount of 2.000.000. The graph is a bit misguiding since the second highest number of paid bribes are 48 (13th of October). The red line shows the mean amount paid which is 484.752.

sebastianbarfort commented 8 years ago

Great assignment.

Beautiful plots and good descriptions. You could also have considered plotting the data geographically, but great job overall.

APPROVED