sebastianbarfort / sds

Social Data Science, course at University of Copenhagen
http://sebastianbarfort.github.io/sds/
12 stars 17 forks source link

Team 1 - Assignment 2 #36

Closed Nynnehelt closed 8 years ago

Nynnehelt commented 8 years ago

title: "SDS - Assignment 2 - Team 1" author: "Stine Hindberg Andersen, Kasper Hjalager & Nynne Almdal" header-includes:

The first chunk is the code which scrapes the latest 1000 bribe reports from "www.ipaidabribe.com" and creates a dataframe with all the information. We will use the data frame to analyse

  1. What can explain the amount of views the report gets
  2. What can explain the size of the bribe payments
  3. The correlation between population and posts
#Load required packages
library("maptools")
library("rvest")
library("stringr")
library("sp")
library("readr")
library("knitr")
library("tidyr")
library("xml2")
library("plyr")
library("ggplot2")
library("dplyr")
#Creates an vector for the 100 websites containing informations about the reports
bar = seq(0, 990, by=10)
page=c("a")
page_final<-c("http://www.ipaidabribe.com/reports/paid?page=#gsc.tab=0")
for(i in 2:length(bar)) 
{
  page[i]=paste("http://www.ipaidabribe.com/reports/paid?page=", bar[i], sep="")
page_final[i]=paste(page[i], "#gsc.tab=0", sep="")
}
#Creates an function which scrapes the information needed from the websites
scrape_ipaid = function(link){
  my.link = read_html(link, encoding = "UTF-8")
  title = my.link %>% 
    html_nodes(".heading-3 a") %>% html_text()
  amount_INR = my.link %>% 
    html_nodes(".paid-amount span") %>% html_text()
  name = my.link %>% 
    html_nodes(".name a") %>% html_text()
  detail = my.link %>% 
    html_nodes(".transaction a") %>% html_text()
  views = my.link %>% 
    html_nodes(".overview .views") %>% html_text()
  city = my.link %>% 
    html_nodes(".location") %>% html_text()
  date = my.link %>% 
    html_nodes(".date") %>% html_text()
  return(cbind(title, amount_INR, name, detail, views, city, date))
}
#Creates a loop which uses the scrape-function to gather information from the 100 websites and 1000 reports
  first.link = page_final[1]
  data = scrape_ipaid(first.link)
  for (i in page_final[2:100]){
    data = rbind(data,scrape_ipaid(i))
  }
  #We convert the matrix to a data frame
  df=as.data.frame.matrix(data)

  ###DATA CLEANING###

  #Transforms the date variable into a valid date format
  lct <- Sys.getlocale("LC_TIME"); Sys.setlocale("LC_TIME", "C")
  #Somehow the above code is required to transform all the dates
  df$date=gsub(",","",df$date)
  df$date=as.Date(df$date, "%B %d %Y")
  df$date=as.Date(df$date,"%d/%m/%Y")

  #Creates a weekday variable
  df$day <- weekdays(df$date)

  #Transforms the amount variable into a numeric format
  df$amount_INR=gsub("Paid INR ","",df$amount_INR)
  df$amount_INR=gsub(",","",df$amount_INR)
  df$amount_INR=as.numeric(df$amount_INR)

  #Transforms the views variable into a numeric format
  df$views=gsub(" views","",df$views)
  df$views=as.numeric(df$views)

  # Separates the location variable into city and region
  df = df %>% 
    separate(col=city, c("city", "region"), sep=" , ")

  #Deletes outliers
  df= df %>% filter(amount_INR<=8000000000)

What can explain the amount of views a report gets?

When you look at the different bribe reports there is a big difference in the number of views and it makes sense that some reports are more useful than others. In this section we will analyse, using descriptive statistics, what affects the number of views a report gets.

Weekdays

At first, we will look if the number of views depends on, which particular day the report is uploaded. It is possible to imagine that people have more time to look at the website in the weekend and if they look more at the latest reports, we will see a higher average of views on reports uploaded on Saturday or Sunday.

  df_day= df %>% group_by(day) %>%
    mutate(mean = mean(views)) %>% distinct(day) %>%
    select(day, mean)

  p = ggplot(df_day, aes(x = reorder(day, -mean), y = mean))
  p = p + geom_bar(stat="identity",width=.5) + labs(x="Weekday", y="Views", title="The average views on different weekdays") + theme_minimal()
plot(p)

The plot above shows that there is a significant difference between the weekdays. According to the graph, people should upload their report on Thursday and not Monday if they want more people to see their report. There is 4 times more report views on a Thursday compared to a Monday. Saturday and Sunday is in the top as we expected.

Detail

It seems obvious that the reason for the bribe could be conclusive. People probably tend to look at reports if the detail/reason is in their interest. The code below creates two graphs with the highest and lowest views for different reasons.

df_detail= df %>% filter(detail!="") %>% group_by(detail) %>% 
    mutate(mean = mean(views)) %>% distinct(detail) %>%
    select(detail, mean) %>% ungroup() %>% arrange(-mean)

  p = ggplot(df_detail[1:4,], aes(x = reorder(detail, -mean), y = mean))
  p + geom_bar(stat="identity",width=.5) + labs(x="Reason", y="Views", title="The highest average number of views") + theme_minimal() + theme(axis.text.x = element_text(angle=-10, vjust=1))

  q = ggplot(tail(df_detail,4), aes(x = reorder(detail, -mean), y = mean))
  q +  geom_bar(stat="identity",width=.5) + labs(x="Reason", y="Views", title="The lowest average number of views") + theme_minimal() + theme(axis.text.x = element_text(angle=-10, vjust=1))

The graphs show that there is a big difference between the highest and lowest amount. Reports about "Land Registration"" have about 500 average views compared to "Duplicate Registration Certificate" with only about 50 views.This confirms the hypothesis that the reason of the report probably matters when people view a report. You could also think there is a relationship between the amounts of reports in each category and the amount of views. But the categories with most reports, which is shown below, are not in the top four of the reports with most views.

count = count(df, detail) %>% arrange(-n)
  count[1:5,]

Region

Another variable that could affect the amount of views is the region. It seems obvious that people would tend to look more at reports from the region they live in. Just like before there is a graph with the highest and lowest amount of views for the different regions.

df_region= df %>% filter(region!="") %>% group_by(region) %>% 
    mutate(mean = mean(views)) %>% distinct(region) %>%
    select(region, mean) %>%  ungroup() %>% arrange(-mean)

  p = ggplot(df_region[1:4,], aes(x = reorder(region, -mean), y = mean))
  p + geom_bar(stat="identity",width=.5) + labs(x="Region", y="Views", title="The highest average number of views") + theme_minimal() + theme(axis.text.x = element_text(angle=-10, vjust=1))

  q = ggplot(tail(df_region,4), aes(x = reorder(region, -mean), y = mean))
  q + geom_bar(stat="identity",width=.5) + labs(x="Region", y="Views", title="The lowest average number of views") + theme_minimal() + theme(axis.text.x = element_text(angle=-10, vjust=1))

The graphs show that "Chandigarh" has the highest average with about 200 views and "Jammu and Kashmir" has the lowest with around 30 views. That is a difference of 6-7 times between the lowest and highest. The amount of views seems therefore to depend of the location, but of course one major reason for this is the difference in the amount of people living in the region. The more people living in the region, the greater chance of some one would view the report.

The amount paid

The last thing we want to analyse is the relationship between number of views and the amount paid for the bribe. You could think that people would look more at reports with a relatively huge amount paid, because the information are more interesting.

df_INR = df %>% filter(amount_INR<=5000)

  ggplot(df_INR, aes(x=amount_INR, y=views)) +
    geom_point() + geom_smooth() + labs(x="INR", y="Views", title="The relationship between the amount paid and number of views") + theme_minimal()

It is quite difficult to recognize any clear trend between the variables when you only look at the datapoints. The non-linear fitting line shows that there is a negative relationship between 0 and 500 INR and afterwards the trend is a bit positive but close to constant. An explanation for the negative relationship at low amounts is probably that people are also interested in relatively low amounts. If someone has just paid 1000 for a Birth Certificate, then they it would draw their attention if another has just paid the half or even less.

What can explain the size of the bribe payments?

We try to set up reasons for the bribe by looking at the variable 'amount_INR' in different ways .

Does the amount paid depend on which department and/or transaction detail? The answer is most likely 'yes'. But in which situation is a few coins enough and when do you have to pay big money?


#group data by detail and take mean of amounts paid
 df_detail_paid= df %>% filter(detail!="") %>% group_by(detail) %>% 
    mutate(mean = mean(amount_INR)) %>% distinct(detail) %>%
    select(detail, mean) %>% ungroup() %>% arrange(-mean)

#plot the highest bribes by detail 
  p = ggplot(df_detail_paid[1:5,], aes(x = reorder(detail, -mean), y = mean))
  p + geom_bar(stat="identity",width=.5) + labs(x="Reason", y="INR", title="The Highest Average Bribes by Detail") + theme_minimal() + theme(axis.text.x = element_text(angle=-10, vjust=1))

#plot the lowest bribes by detail  
  q = ggplot(tail(df_detail_paid,12), aes(x = reorder(detail, -mean), y = mean))
  q +  geom_bar(stat="identity",width=.5) + labs(x="Reason", y="INR", title="The Lowest Average Bribes by Detail") + theme_minimal() + theme(axis.text.x = element_text(angle=-25, vjust=1))

Looking at the two plots from above it comes clear that the bribes are of very different amounts. Of the bottom 12 detail groups are services like driving licenses, passport and gas. The cheapest groups are false allegation and police clearance certificate. Of the top5 highest paid bribes is noticeably Birth Certificates. Looking in the dataset this is also the highest outliers. Getting a (new) birth certificate in India is a large fraction of all the bribes together -

a = sum(df$amount_INR, na.rm = FALSE) #11793006945 INR
b = df %>% filter(df$detail ==  "Birth Certificate" )
c = sum(b$amount_INR, na.rm = FALSE) #11698974342 INR
fraction = c / a #0.99

Calculating the fraction of payments to birth certificate of all bribes paid in the data gives a clear picture of which transaction detail is the most expensive. Removing Birth Certificate will give another picture of how the highest paid bribes are distributed, perhaps a more homogeneous picture of the remaining transaction details in the high end of the bribes -


# highest bribe without birth certificate   
  df_detail_paid_nobirth = df %>% filter(detail!="") %>% group_by(detail) %>% 
    filter(detail!="Birth Certificate") %>%
    mutate(mean = mean(amount_INR)) %>% distinct(detail) %>%
    select(detail, mean) %>% ungroup() %>% arrange(-mean)
p = ggplot(df_detail_paid_nobirth[1:5,], aes(x = reorder(detail, -mean), y = mean))
  p + geom_bar(stat="identity",width=.5) + labs(x="Reason", y="INR", title="The Highest Average Bribes by Detail (When No Birth Cert.)") + theme_minimal() + theme(axis.text.x = element_text(angle=-10, vjust=1))

Taking the Birth Certificates out of the picture the highest average bribes paid goes to New PAN Cards (Permanent Account Number). Looking at the plot from above code shows that this detail group is also considerably large relative to the other detail groups.

Bribe Paid at Region

Is there any region that bribes more than other? One could think that the highest paid bribes reported is in the larger cities.


df_region_paid = df %>% filter(region!="") %>% group_by(region) %>% 
    mutate(mean = mean(amount_INR)) %>% distinct(region) %>%
    select(region, mean) %>%  ungroup() %>% arrange(-mean)

  p = ggplot(df_region_paid[1:4,], aes(x = reorder(region, -mean), y = mean))
  p + geom_bar(stat="identity",width=.5) + labs(x="Region", y="INR", title="The Highest Average Bribes by Region") + theme_minimal() + theme(axis.text.x = element_text(angle=-10, vjust=1))

  q = ggplot(tail(df_region_paid,5), aes(x = reorder(region, -mean), y = mean))
  q + geom_bar(stat="identity",width=.5) + labs(x="Region", y="INR", title="The Lowest Average Bribes by Region") + theme_minimal() + theme(axis.text.x = element_text(angle=-10, vjust=1))

Of the plot with the lowest 5 regions are the amounts paid as bribery of very different size, 'Jammu and Kashmir' has an average bribe of almost 1000 INR (1000DKK) while the region 'Pondechemy' is scrapping the bottom being the cheapest city to bribe in. Of the plot with the highest bribes one could think that regions including cities like New Dehli and Mumbai would be in the top but they are only at respectively at a 3rd and 4th place. The top 1 is the region Bihar. Looking closer to the bribing details in Bihar -


d = df %>% filter(df$region ==  "Bihar" )
#plot the highest bribes by detail in Bihar
df_detail_paid_Bihar = df %>% filter(detail!="") %>% group_by(detail) %>% 
    filter(region == "Bihar") %>%
    mutate(mean = mean(amount_INR)) %>% distinct(detail) %>%
    select(detail, mean) %>% ungroup() %>% arrange(-mean)
p = ggplot(df_detail_paid_Bihar[1:3,], aes(x = reorder(detail, -mean), y = mean))
  p + geom_bar(stat="identity",width=.5) + labs(x="Reason", y="INR", title="The Highest Average Bribes by Detail in Bihar") + theme_minimal() + theme(axis.text.x = element_text(angle=-10, vjust=1))

In the plot above it sets a clear picture of why Bihar is the region with the highest bribery amount. This is a city with the highest paid bribery group 'Birth Certificate' as earlier shown (which is a bit funny due to the fact that Bihar is next to Nepal and Bangladesh).

Bribes at Weekday

It could be interesting to see the amount paid diverged into weekdays. Which day in the week is the higest and lowest amounts paid? Are there perhaps any 'weekend discount' in the bribes?


df_day_paid = df %>% group_by(day) %>%
    mutate(mean = mean(amount_INR)) %>% distinct(day) %>%
    select(day, mean) 

  p = ggplot(df_day_paid, aes(x = reorder(day, -mean), y = mean))
  p = p + geom_bar(stat="identity",width=.5) + labs(x="Weekday", y="INR", title="The Average Bribes by Weekday") + theme_minimal()
plot(p)

Having all 1000 observations the above plot shows that Sunday is the day with highest average bribe paid. Removing the highest paid detail 'Birth Certificate' could give another picture -


df_day_paid_nobirth = df %>% group_by(day) %>%
  filter(detail!="Birth Certificate") %>%  
  mutate(mean = mean(amount_INR)) %>% distinct(day) %>%
  select(day, mean) 

  p = ggplot(df_day_paid_nobirth, aes(x = reorder(day, -mean), y = mean))
  p = p + geom_bar(stat="identity",width=.5) + labs(x="Weekday", y="INR", title="The Average Bribes by Weekday") + theme_minimal()
plot(p)

Setting the data up like this the plot shows a different distribution of bribe on the weekdays. Without birth certificates the day with highest average bribes paid is Thursday - which also was the day with most views as earlier shown.

The correlation between population and posts

We now wish to analyse if the amount of post on i paid a bribe is representative of the population in each state, i.e. if the amount of bribes payed is equal to the population size. The code imports information about the population, and cleans the data in order to make it possible to join a map of India with the gathered information. This will which allow us to map the to situations and compare them.

#loading the population, cleaning and adding the fraction of the population 
  #   in each state
pop <- read_html("http://www.indiaonlinepages.com/population/state-wise-population-of-india.html") %>%         html_node(".fest-dates") %>% html_table()
df_pop <- pop %>% select(X2,X3) %>% slice(3:37) %>% 
        dplyr::rename(id=X2,population=X3)
df_pop$population <- gsub(",","",df_pop$population) %>% 
      as.numeric(df_pop$population)
df_pop <- df_pop %>% mutate(pct=population/sum(population))
#getting the post freq by state and making sure data is trimed 
df_ipop<- as.data.frame(table(df$region)) %>% rename(id=Var1) 
df_ipop$id <- str_trim(as.character(df_ipop$id))

#Loading India into r
con <- url("http://biogeo.ucdavis.edu/data/gadm2/R/IND_adm1.RData")
load(con)
india <- fortify(gadm, region = "NAME_1")

#joining the map with the disired freq and per by state
map_ipb <- left_join(india, df_ipop, by = "id") 
map_pop <- left_join(india, df_pop, by = "id") 

Though it is not possible to distinguish the two, if a state is an outlier one could do a further analysis in order obtain further detail on the exact state.

#plotting the two different maps
p = ggplot(data=map_ipb, aes(x = long, y = lat, group = group, fill = Freq))
p + geom_polygon() + labs(x="", y="", title="Fraction of post in each state")
p = ggplot(data=map_pop, aes(x = long, y = lat, group = group, fill = pct))
p + geom_polygon() + labs(x="", y="", title="Fraction of pop in each state") 

By comparing the maps we see that about 30% of the posts on i paid a bribe are posted from Karnataka. It seems that corruption is a fairly big problem in this state, even though it is one of the smaller states in India, when looking at the population fraction in Karnataka.

The site could be more widely spread in Karnataka than in the rest of India, in which case the data from Karnataka gives a more realistic picture of the amount of bribery in India than the other states. But even if the site is more widely used in this state, it seems unlikely that the amount of bribery isn't significantly higher in Karnataka than in other states, when looking at the data.

sebastianbarfort commented 8 years ago

Really good assignment!

APPROVED