sebastianbarfort / sds

Social Data Science, course at University of Copenhagen
http://sebastianbarfort.github.io/sds/
12 stars 17 forks source link

Group 24 Assignment 2 #45

Closed emsvenss closed 8 years ago

emsvenss commented 8 years ago

title: "Assignment 2" author: (Group 24) Sophie Burgard , Linxin Chen, Guillermo Edgardo Sepúlveda Witt, Emily Mae Svensson date: "November 8, 2015"

output: pdf_document

Introduction

Corruption is everywhere around the world, and it does not matter which country, either developed or developing, it will still be present. However, most studies show that the highest levels of corruption occurs in developing countries like the subject of this report: India. The website called "Ipaidabribe.com" contains self reported or crowd sourced attempts to measure the corruption in India. Therefore, every time someone pays or receives a bribe they will report the details on the website, including the amount, location, type of bribe, etc.. We (group 24) were able to scrape the data from the website and use the latest 1000 reports of different bribes. In this paper we analyze the different amounts, type of bribe, departments, views on posts, dates, and locations, of the latest 1000 bribes.

Data Gathering and Cleaning

Scraping the Data from the Web

Before we can begin our analysis, we must obtain data from Ipaidabribe.com for the most recent 1000 reports. We can build a webscraper to automatically take the desired data. First we must define CSS selectors in order to scrape the data from the webpage. This will allow us to select the required data points: Views, Amount, Title, Location, Transaction Type, Department, and Date.

library("rvest")
css.selector=  c(".views",".paid-amount", ".heading-3", ".location", ".transaction", ".name", ".date")
var_names = c("Views","Amount", "Title", "Location", "Transaction", "Department", "Date")

Next, we need to create a For loop to loop through the first 1000 posts. The counter runs in steps of 10 observations for 1000 observations. We can use these selectors and the counter to run a for loop. This will extract the desired data from the first 1000 observations and save them in a data frame that we can then use for analysis.

counter= seq(from=1, to=1000, by=10)

for (p in 1:length(counter)){
link = paste("http://www.ipaidabribe.com/reports/paid?page=",counter[p], sep="")
  for (i in 1:length(css.selector)){
      if (i==1){
         tmp = read_html(link) %>% 
         html_nodes(css = css.selector[i]) %>% 
         html_text()
         tmp= tmp[2:11]}
      else{
        tmp2 = read_html(link) %>% 
        html_nodes(css = css.selector[i]) %>% 
        html_text()
        tmp = cbind(tmp, tmp2)}
  }  
  if(p==1){
    df =  as.data.frame(tmp)}
  else{
    df2 = as.data.frame(tmp)
    df = rbind(df, df2)   
  }
}  

names(df) = var_names

Cleaning the Data

After the loop finishes running we have 1000 observations of data. However, these observations must be cleaned before we can use the data. Using the following code we fixed problems with spacing, phrasing, columns, and other small problems. We used this new data frame to begin our analysis.

library("readr")
df = read_csv2("https://raw.githubusercontent.com/emsvenss/SDSGroup24/master/Datafinal")
library("stringr")
df$Views=gsub("views", "", df$Views)
df$Amount=gsub("Paid INR", "", df$Amount)
df$Amount=gsub(",", "", df$Amount)
df$Province<-sub("\\w.*\\,", "", df$Location)
df$City<-sub("\\,.*\\w", "", df$Location)

df$month <- sub("(\\w+)\\s(\\d+)\\,\\s(\\d+)","\\1", df$Date)
df$day <- sub("(\\w+)\\s(\\d+)\\,\\s(\\d+)","\\2", df$Date)
df$year <- sub("(\\w+)\\s(\\d+)\\,\\s(\\d+)","\\3", df$Date)
df$month <- str_extract_all(df$month,"^\\w\\w\\w")
df$month <-match(df$month, month.abb)
df$date <- paste(df$year, df$month, df$day, sep = "/")
df$date <- as.Date(df$date,format='%Y/%m/%d')

Data Analysis

Analyzing the Different Departments

Graphing the frequency of the reports or the sum of the total number of reports categorized by each departments, will enable us to analyze the which department is the most corrupted.

df$Amount <- as.numeric(as.character(df$Amount)) 
df$Views <- as.numeric(as.character(df$Views)) 
library(ggplot2)
p = ggplot(data=df, aes(x=Department))
p+geom_histogram(aes(color=Department, fill=Department))+labs(title="Which Department is Most Corrupted", x="Department", y="Number of Cases")+ theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

From the graph we can conclude that Municipal service is the most corrupted. with over 200 cases of corruptions or approx 20% of the total reports is from Municipal Service Department. Even though we suspected that municipal service most likely be one of the most corrupted department, the education, and labour department is not as corrupted as we suspected with less than 50 cases or resorts!

Even though we have conclude that the most corrupted department is Municipal Services base on given data, it does not say much on the amount of money receive from the public illegally. Therefore by graphing the amount bribery receive from the public and the department will enable us to see which department receive the most number of bribery.

summary(df$Amount)
p2 = ggplot(data=df, aes(x=Department, y=Amount))
p2+geom_jitter(aes(color=Department))+ labs(title = "Which Department Recieves the Most Money", x= "Department", y= "Amount (log scale)")+
  scale_y_log10()+ theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+geom_hline(yintercept=log10(mean(df$Amount, color="#17272B")))

From the graph, the dark blue line represents the log mean of the amount of bribery all the departments has revive, from there we can conclude that most department receive many bribe above the average. On top of it all, from the graph, we cannot conclude that there is a specific department that receive typically high amount of bribery.

Total Views and Bribery

Analyzing the relationship between the total number of views on each posts and the total amount of bribery that is reported. Majority of the times when posts involve with money, the higher the amount of money or transaction more people will be interested meaning there will be more views on that particular posts. However for this website--about bribery in India, we ran a regression on the total number of views on each post and the amount of bribery paid.

p3= ggplot(data=df, aes(x= Amount, y=Views))
p3+geom_point(aes(colour=Department))+
  geom_smooth(data=df[df$Views>0 & df$Amount > 0,], na.rm = TRUE, method="lm", se=FALSE, color="black")+
  scale_x_log10()+scale_y_log10()+theme_minimal()+ #+facet_wrap(~Department)
  labs(title="Relationship between Amount of Money and Viewers", x= "Amount Paid (log scale)", y= "Total Views (log scale)")

From the graph we can clearly see the regression line is downward slopping meaning that the two variables (total bribe paid and total number of views) has a negative relationship meaning that the higher the amount paid and the number of views on the post decreases.

Analyzing Corruption by Transaction Type

Graphing the total sum of the reported bribes for each type of transaction will allow us to determine the most common types of bribes. We can observe the differences between the transaction types by plotting the total number of bribes per transaction type in a histogram. First, we must remove the outliers from the data. Then, we can create data frames using the newly cleaned data in order to determine the count, mean, and total amount for each transaction type.

# Remove Outliers
df.clean <- subset(df, df[, 4] != 8120303241)
df.clean <- subset(df.clean, df[, 4] != 1)

# Transaction type data frame
df2 = data.frame(table(df.clean$Transaction, df.clean$day))
names(df2) = c("Transaction", "Day", "Count")

# Count per transaction type
library("plyr")
library("dplyr")
df3 = df.clean %>% 
  filter(Transaction != "") %>% 
  group_by(Transaction) %>% 
  summarise(count = n()) %>% 
  arrange(-count)
head(df3)

# Sum the amount for each transaction type
df4 <- data.frame(ddply(df.clean, c("Transaction"), summarise, Count = length(Amount), Amount = sum(Amount), Mean = mean(Amount), Views = sum(Views)))
df4 <- data.frame(df4[order(df4$Amount),])

#Find the count and mean of the amount for each transaction type
df5 <- data.frame(ddply(df.clean, c("Transaction"), summarise,Count = length(Amount), Mean = mean(Amount)))
df5 <- data.frame(df4[order(df4$Average),])

Next, we can use these data frames to graph our data. From the graph below, we can clearly see that birth certificates are the most commonly reported bribe. There are over 200 instances of bribery associated with birth certificates in the most recent 1000 reports. The next highest is the issue of a ration card. Birth certificates are necessary for most important activities in India, including opening a bank account and obtaining a social security number insurance, passports, and any kind of official license. According to my research obtaining a birth certificate in India is simple and cheap if it is obtained within 21 days of the child's birth. If not, it is an expensive and lengthy process. Perhaps birth certificate bribes are so prevalent because many new parents were unable to obtain their child's birth certificate within the first 21 days, and so it would be cheaper and easier to pay a bribe than to jump through bureacratic hoops for a new certificate.

library("ggplot2")
p3 = ggplot(data = df3, aes(x = Transaction, y = count))
p3 = p3 + geom_histogram(aes(color = Transaction, fill=Transaction), stat = "identity") + theme(panel.background = element_blank(), axis.text.x = element_text(angle = 45, hjust = 1)) + labs(x ="Transaction", y = "Number of Reported Bribes", title = "Number of Reported Bribes")
p3

We can see the differences more clearly using a bubble graph, as is shown below. As we can see from the Amount summary, the mean count is only 24.85, but the high is 213. This implies that most of the transaction categries have a small number of bribes associated with them. This is corroborated by the bubble graph below, where most of the bubbles are small.

library("ggplot2")
summary(df4$Count)
p4 = ggplot(data = df4, aes(x = Count, y = Transaction))
p4 = p4 + geom_point(aes(size = sqrt(Count/pi), color=Count, fill = Transaction),  pch = 21, show_guide = FALSE) + geom_text(aes(label=Transaction), size=4) + scale_size_continuous(range=c(1,90)) + scale_x_continuous(limits=c(0,250)) + theme(panel.background = element_blank(), axis.text.y = element_blank(), axis.ticks=element_blank(), legend.position="none") + labs(x ="Number of Reported Bribes Per Transaction Type", y = "Transaction", title = "Number of Reported Bribes") + theme(axis.text.y=element_blank(),axis.ticks=element_blank()) 
p4

We can also compare the different types of bribery by comparing their costs. The graph below displays the total monetary amount of bribes paid for each transaction type in proportion with each other. This graph shows a more extreme gap between birth certificates and most of the other categories. Surprisingly, the issuance of a ration card is significantly smaller than any of the highest transaction types. This is unexpected because the issuance of ration cards was the second highest category when considering the number of bribes. This implies that ration cards are not expensive. Conversely, new passport applications and new PAN cards did not have a high count but both have huge monetary costs. The passport application shows the more extreme difference between count and total cost. This implies that passport applications are incredibly expensive or the data is skewed by an outlier.

library("ggplot2")
summary(df4$Amount)
p5 = ggplot(data = df4, aes(x = Amount, y = Transaction))
p5 = p5 + geom_point(aes(size = sqrt(Amount/pi), color=Amount, fill = Transaction),  pch = 21, show_guide = FALSE) +  geom_text(aes(label=Transaction), size=4) + scale_size_continuous(range=c(1,80))  + theme(panel.background = element_blank(), axis.text.y = element_blank(), axis.ticks=element_blank(), legend.position="none") + labs(x ="Total Amount Per Transaction Type", y = "Transaction Type", title = "Total Monetary Amount Per Transaction Type") + theme(axis.text.y=element_blank(),axis.ticks=element_blank())

p5

Looking at the total cost for each transaction types does not give an accurate idea of the real cost of each type. In order to fix this problem, we can take the average of the amount for each category and compare the results. The graph below displays the average cost of each type of bribe. The average cost of a passport application is significantly hgiher than any of the other categories. Considering it had one of the lowest numbers of bribes, this implies that the reported information is skewed. Perhaps one report had a significantly higher cost than all of the other reports, which would cause the data for passport applications to be skewed. Interestingly, although birth certificates had the highest count, it has a low cost per bribe. Perhaps this is why it is so common to get birth certificate changes; the average person would be more able to afford a bribe than official services.

p6 = ggplot(data = df5, aes(x = Transaction, y = Mean))
p6 = p6 + geom_histogram(aes(color = Transaction, fill = Transaction), stat = "identity") + theme(panel.background = element_blank(), axis.text.x = element_text(angle = 45, hjust = 1)) + labs(x ="Transaction", y = "Average Cost", title = "Average Cost of Reported Bribes")

p6

The following is a graph comparign the mean amount and count for each transaction type. The bubbles are scaled by the count and they are placed along the x-axis by the mean. As we can see from the graph, new passport is highly unusual compared to the other categories. It has the highest mean but one of the lowest counts. Most of the other categories fall in the same range of the mean price, but differ widely on count. Most of the transaction categories with higher counts, such as Birth Certificate and Issue of Ration Card have low average prices, which may be part of the reason they are more common.

p8 = ggplot(data = df5, aes(x = Mean, y = Transaction))
p8 = p8 + geom_point(aes(size = sqrt(Count/pi), color=Mean, fill = Transaction),  pch = 21, show_guide = FALSE) + geom_text(aes(label=Transaction), size=4) + theme(panel.background = element_blank(), axis.text.y = element_blank(), axis.ticks=element_blank(), legend.position="none") + scale_size_continuous(range=c(1,80))  + scale_x_log10() + labs(x ="Average Amount per Transaction Type", y = "Transaction Type", title = "Average Monetary Amount vs Count Per Transaction Type") 

p8
require(stringr)
require(ggplot2)
require(plyr)
dfpd <- subset(df, df[, 4] != 8120303241)
dfpd <- subset(dfpd, df[, 4] != 1)
dfpd$date <- paste(dfpd$month, dfpd$day, sep = "/")
dfd <- data.frame(ddply(dfpd, c("date"), summarise, N = length(Amount), sum = sum(Amount), mean = mean(Amount)))
dfd <- data.frame(dfd[order(dfd$N),])
dfpd <- data.frame(ddply(dfpd, c("Province","date"), summarise, N = length(Amount), sum = sum(Amount), mean = mean(Amount)))
dfpd <- data.frame(dfpd[order(dfpd$N),])

Analyzing corruption levels over time

Analyze the frequency of the reports or the total amount of the bribes categorized by date, can enable us to identify behaviour patterns over time. It is important to have in mind that the data only cover approximately one month and a total of one thousand observations, so any patterns identified are not a robust result.

ggplot(na.omit(dfd), aes(x=date, y=N)) + geom_histogram(stat="identity",fill="#006633", color="#006633") + labs(title="Bribes reistrered by date", x="Date", y="Number of Bribes")+ theme(axis.text.x=element_text(angle=60,hjust=0.7,vjust=0.5))
ggplot(na.omit(dfd), aes(x=date, y=sum, fill=mean)) + geom_histogram(stat="identity") + labs(title="Total Amount of bribes by date", x="Date", y="Amount")+ theme(axis.text.x=element_text(angle=60,hjust=0.7,vjust=0.5))

From the left graph is observed that 663 of the bribes were reported on October 12, this is, the 66% of the bribes were reported in that Monday. Apart from this peak, there are no significant patterns or peaks, rather the trend is considerably stable.

From the graph on the right is observed that the largest sum of bribes occurred on November 02 with a value of 2.628.143 INR, on October 12 with a value of 2.103.827 INR and October 16 with a value of 2.009.205 INR, which is an unexpected result considering that the number of bribes reported on these dates were 16, 663 and 6 respectively. This result is a consequence of two bribes reported on November 02 and October 16 for an amount of 2.000.000 INR each one, increasing the mean of the bribes to 334.867,5 INR and 164.258,9 INR respectively, compared to the mean on October 12 of 3.173,2 INR. Even though the observations are outlines, these events do not seem improbable, considering that the amount of each one is equivalent to approximately USD30.000 and were paid to obtain a passport and a PAN card respectively.

Analyzing corruption levels over time grouped by Provinces

An interesting result was obtained by analyzing the information grouped by province and date, as can be seen in the following graph.

dfpd$Province <- gsub("\\,","NA",dfpd$Province)
ggplot(na.omit(dfpd), aes(x=date, y=Province, size= N)) + geom_point(stat="identity",fill="blue4", color="blue4") + labs(title="Bribes reported in each Povince by date ", x="Date", y="Amount", size="Number of Bribes")+ theme(axis.text.x=element_text(angle=60,hjust=0.7,vjust=0.5))

Of all bribes, 396 were reported in Karnataka on Monday October 12. This is interesting and curious because on that day was celebrated the Mahalaya Amavasya, an important festivity which is a public legal holiday in the provinces of Karnataka, Odisha, Tripura and West Bengal. This means, that the 40% of the bribes analyzed were reported on a holiday day.

Analyzing of corruption over the different Provinces of India

Since India is a huge country with severely differing areas, it is worth a look how the number of reported bribes is distributed over the area. For the 1000 observations we get reports from 27 out of 37 possible states and territories in India. The number of reports differs widely and shows a right skewed distribution with some extreme values.

# Reading Data, loading packages
library(maps)
library(ggplot2)
library(sp)
library(RColorBrewer)
library(maptools)
library(dplyr)
library(stringr)
library(knitr)

#delete obvious outlier /false value in "Amount"-Variable
which.max(df$Amount)
df = df[-33,]

#set variable level
df$Province = as.factor(df$Province)
df$Amount = as.numeric(df$Amount)

##### Plot for Frequency

#subset for frequency of bribes by province
prov_freq = df %>% group_by(Province) %>%   summarise(freq= n())
names(prov_freq) = c("province", "freq")
prov_freq = prov_freq[-1,]
prov_freq$province = tolower(as.character(prov_freq$province))
prov_freq$province = str_sub(prov_freq$province, start = 2)
#Frequency of Reports by Province
summary(prov_freq$freq)

The exactly same shows for the distribution of the mean amount of paid bribes per state.

To see how these values distribute over the area of India, we plot them on the map. Because of the highly skewed distribution we use a log scale. The maps show a significant pattern, showing more reported bribes in the middle of India, decaying in direction of the borders. The province of Karnataka shows far ahead the most reports, which is not surprising since the city of Bangalore, a economic centre, is located in that area. Only the city of Bangalore assembles 422 out of 434 reports for the whole state. Surprisingly the Karnataka area is only in the middle when it comes to the mean of the paid amounts. The mean amount shows less of a pattern on the map, indicating that the high means of the observations might be driven by outliers.

#load map shapefile ( download from http://biogeo.ucdavis.edu/data/gadm2.8/rds/IND_adm1.rds )

# Normally you have to download the above file to run this code, 
#but we could not upload the file to github. We have included the code 
#we used to run the file so you can see how we created the maps, 
#but we have commented it out because it cannot run without the file. 
#We will include the .jpeg files from github so you can see the output 
#without needing to download the file. If you download the file and 
#uncomment the code, it will run and produce the maps shown below.

  #ind1 = readRDS("IND_adm1.rds")  
#extract area information
  #areas = as.data.frame(ind1$NAME_1)
  #names(areas) = c("province")
  #areas$province = as.character(tolower(areas$province))

#merge area information and frequency by are
  #mrg =left_join(x =areas, y = prov_freq, by = "province")
  #mrg$log_f = log(mrg$freq)

#plot the map
#define data to plot and add to map-information
  #ind1$data = as.numeric(mrg$log_f)
#define breakpoints
  #col_no = as.factor(as.numeric(cut(ind1$data, breaks=5)))
  #levels(col_no) = c("< 4", "< 12", "< 40","< 120", "= 434")
  #ind1$col_no = col_no
#chose color palette
  #myPalette = brewer.pal(5,"Greens")
#plot
  #map1 = spplot(ind1, "col_no", col=grey(.5), col.regions=myPalette, main="Absolute Number of Reported Bribes by State")

####plot for mean

#subset for mean by province and some string corrections
  #prov_mean = df %>% group_by(Province) %>%   summarise(mean= mean(Amount))
  #names(prov_mean) = c("province", "mean")
  #prov_mean = prov_mean[-1,]
  #prov_mean$province = as.character(tolower(prov_mean$province))
  #prov_mean$province = str_sub(prov_mean$province, start = 2)
  #prov_mean$mean = as.numeric(prov_mean$mean)

#merge the information
  #mrg3 = left_join(x =areas, y = prov_mean, by = c("province"))
  #mrg3$log_m = log(mrg3$mean)

#plot the map
#define data to plot and add to map-information
  #ind1$data3 = as.numeric(mrg3$log_m)
#define breakpoints
  #col_no = as.factor(as.numeric(cut(ind1$data3, breaks=5)))
  #levels(col_no) = c("< 320 Rs.", "< 1300 Rs.", "< 5300 Rs.","< 22 000 Rs.", "< 90 0000 Rs.")
  #ind1$col_no = col_no
#chose color palette
  #myPalette = brewer.pal(5,"Greens")
#plot
  #map2 = spplot(ind1, "col_no", col=grey(.5), col.regions=myPalette, main="Mean Amount of Reported Bribes by State")
Map 1:

map 1

Map 2:

map 2

Conclusion

Although we can draw some conclusions from the analysis above about the corruption levels of different departments, regions, and transactions, we must do so with a healthy amount of skepticism. The crowd sourced data is self-reported and we do not know very much about the people who have reported bribes. Many bribes may go unreported, and the ones that are reported are not necessarily accurate. There were several outliers in the data that implied that the reports were lies or exaggerations. Some of the reports had unrealistically low or high amounts compared to the other data in each transaction category. Also, not everyone in India has access to the internet or knowledge about the website. Therefore it is likely that most of the reportees are wealthier. This could skew information about types and locations of bribes. This introduces strong bias in the data and causes any analysis to be unreliable.

sebastianbarfort commented 8 years ago

Hi all, I get this error message?

screen shot 2015-12-09 at 10 10 55 pm
emsvenss commented 8 years ago

Hi Sebastian, We found the source of the problem, some of our data frames had the wrong name. We have updated the markdown file and it should run independently now. Please let us know if there are any more problems, thank you!

sebastianbarfort commented 8 years ago

Great, thank you!

APPROVED