sebastianbarfort / sds

Social Data Science, course at University of Copenhagen
http://sebastianbarfort.github.io/sds/
12 stars 17 forks source link

Group 009: Assignment 2 #44

Closed RasmusGars closed 8 years ago

RasmusGars commented 8 years ago

title: "Assignment2 - Group9" author: "Mads Harslund, Lukas Hidan, Rasmus Holm & Hjörtur Hjartar" date: "9 November 2015"

output: html_document

By looking at the crowdsourced data, there are a number of issues that should be considered. We can get a snap-shot of the reported bribe,but more rigourous data collecting is needed in order to make sound statstical inference on the subject. The data is extracted from the 1000 latest entries of data. In this data there are reported entires for the dates November 5-8th 2015. Another sample period might have given different results, as interactions with public authorities may be have difference frequencies a cross time, i.e. an end-of-month tax collection. This can also vary geographically. However, there may also be a lag in time from the incident of bribe to reporting it.

There is an altruistic incentive to frequently report bribes in order to combat wide-spread corruption in India. However, it is not possible to penalize false reporters. Civilians may report certain classes of incidents with a larger propensity than others.

We start by initializing the relevant packages for scraping the webpage www.ipaidabribe.com

library("rvest")
#<<<<<<< Updated upstream
library(stringr)
#=======
library("stringr")
#>>>>>>> Stashed changes

We want the lates 1.000 observations and since the webpage is designed such that 10 observations are on each page, we need to scrape 100 pages. The first page looks like this:

link = "http://www.ipaidabribe.com/reports/paid?page="

We let R loop from 00 to 1000 with steps of 10 and define the variables of interest:

The scraped data is first stored in a list and then converted to a dataframe using the plyr package.

#<<<<<<< Updated upstream
# Loop for pages with step size 10 to get the correct links
bribe.links=list()
bribe.links=link
for(i in seq(10,1000,10)){
  bribe.links[[i]]=(paste(link, i, sep = ""))
#>>>>>>> Stashed changes
}

# Removing NA's
bribe.links <- bribe.links[!is.na(bribe.links)] 
head(bribe.links,3)

#func
scrape_bribe = function(link){
  my.link = read_html(link, encoding = "UTF-8")
  my.link.title = my.link %>% 
    html_nodes(".heading-3 a") %>% 
    html_text()
  my.link.amount_text = my.link %>% 
    html_nodes(".paid-amount span") %>% 
    html_text()
  my.link.name = my.link %>% 
    html_nodes(".name a") %>% 
    html_text()
  my.link.transaction = my.link %>% 
    html_nodes(".transaction a") %>% 
    html_text()
  my.link.views = my.link %>% 
    html_nodes(".overview .views") %>% 
    html_text()
  my.link.city = my.link %>% 
    html_nodes(".location") %>% 
    html_text()
  my.link.date = my.link %>% 
    html_nodes(".date") %>% 
    html_text()
  return(cbind( my.link.title, my.link.amount_text, my.link.name, my.link.transaction, my.link.views, my.link.city, my.link.date ))
}

#Loop data to a list
my.bribe.data = list() # initialize empty list
for (i in bribe.links[1:100]){
  my.bribe.data[[i]] = scrape_bribe(i)
  # waiting one second between hits
  Sys.sleep(1)
}

#converting into a data frame
library("plyr")
df.bribe = ldply(my.bribe.data)

We do some data cleaning to isolate the numeric part in the amount and view variables using _strexstract.

df.bribe$my.link.amount_text = str_extract(df.bribe$my.link.amount_text,"[0,0-9,9]+")
df.bribe$my.link.views = str_extract(df.bribe$my.link.views,"[0,0-9,9]+")

df.bribe=data.frame(page=df.bribe$.id, title=df.bribe$my.link.title, amount=df.bribe$my.link.amount_text, name=df.bribe$my.link.name, transaction=df.bribe$my.link.transaction, views=df.bribe$my.link.views, city=df.bribe$my.link.city, date=df.bribe$my.link.date)

library("dplyr")
library("lubridate")

df.bribe$amount = as.numeric(str_replace_all(df.bribe$amount, pattern = "," , replacement = ""))
#filter out outliers
df.bribe=filter(df.bribe,df.bribe$amount<3000000 ) 
df.bribe$views = as.numeric(df.bribe$views)

More data cleaning is done when we extract the city and the region from the combined variable we got from scraping.

df.bribe$city1 = str_extract(df.bribe$city,"[A-z]*")

df.bribe$reg = str_extract(df.bribe$city,", [A-z]+ [A-z]+")
df.bribe$reg1 = str_extract(df.bribe$city,", [A-z]+")
df.bribe$reg=gsub(",","", df.bribe$reg)
df.bribe$reg1=gsub(",","", df.bribe$reg1)

df.bribe$reg2[is.na(df.bribe$reg)]=df.bribe$reg1[is.na(df.bribe$reg)]
df.bribe$reg2[is.na(df.bribe$reg2)]=df.bribe$reg[is.na(df.bribe$reg2)]
df.bribe$reg=df.bribe$reg2

df.bribe$antal=1

We then do some data summarization with mean of amount and views on city, name and transaction.

# stat data for cities
df.bribe.reg = df.bribe %>%
  group_by(reg) %>%
  summarise(antal = sum(antal), median_amount = median(amount), mean_views = mean(views) ) %>%
  arrange(desc(median_amount)) %>%
  data.frame

# stat data for name
df.bribe.name = df.bribe %>%
  group_by(name) %>%
  summarise(antal = sum(antal), median_amount = median(amount), mean_views = mean(views) ) %>%
  arrange(desc(median_amount)) %>%
  data.frame

# stat data for transaction
df.bribe.transaction = df.bribe %>%
  group_by(transaction) %>%
  summarise(antal = sum(antal), median_amount = median(amount), mean_views = mean(views) ) %>%
  arrange(desc(median_amount)) %>%
  data.frame

We use ggplot2 package to do graphs.

library("ggplot2")

Our first graph shows the median amount paid in bribe across regions. Some regions might have a higher reported bribe-amount which may come from either more corruption or more people who like to report in these specific areas.

p = ggplot(df.bribe.reg, aes(x = reorder(reg, median_amount), 
                             y = median_amount))
p = p + geom_bar(stat = "identity") + coord_flip() + labs(title="Graph 1: Median amount paid to bribe across regions", x="Regions", y="Median amount payed in RS.")
plot(p)

The next graph shows the median amount paid in bribe for each department. Again some departments might be associated with higher corruption levels.

#plot 2
p = ggplot(df.bribe.name, aes(x = reorder(name, median_amount), 
                              y = median_amount))
p = p + geom_bar(stat = "identity") + coord_flip() + labs(title="Graph 2: Median amount payed in bribe", x="Department", y="Median amount payed in RS.")
plot(p)

Here we have instead the number of bribes for each department.

p = ggplot(df.bribe.name, aes(x = reorder(name, antal), 
                              y = antal))
p = p + geom_bar(stat = "identity") + coord_flip() + labs(title="Graph 3: Amount of bribes", x="Department", y="Amount of bribes")
plot(p)

Wondering if there is any correlation between the frequency of each transaction type we plot these together.

p = ggplot(df.bribe.transaction, aes(x = reorder(transaction, antal), 
                                     y = antal))
p = p + geom_bar(stat = "identity") + coord_flip() + labs(title="Graph 4: Amount of bribes", x="Transaction", y="Amount of bribes")
plot(p)

Our last graph shows the median amount paid in bribe across different transactions.

p = ggplot(df.bribe.transaction, aes(x = reorder(transaction, median_amount), 
                                     y = median_amount))
p = p + geom_bar(stat = "identity") + coord_flip() + labs(title="Graph 5: Median amount payed to bribe", x="Transaction", y="Median amount of bribes in RS.")
plot(p)

We see that the most frequently reported bribe type concern birth certificates. The taker of bribes regarding birth certificates are typically municipal services, making them the most represented authority to take bribes. One reason for this might be low standards of governance or monitoring at lower governmental levels. The low standards are probably widespread across public services and institutions, but we can see that municpalities are most frequent. This may be due to the low size of the birth certificate bribes, meaning that they are cheap and therefore accesible to the general public. On the other hand we have more expensive bribe types being more infrequent but larger in size. These mainly concern property and are given to he department of Commercial Tax, Sales Tax, VAT. The giver of these bribes might be wealthy individuals or business that own property, argueably to avoid tax issues.

sebastianbarfort commented 8 years ago

Ok assignment.

The coding is generally good (although you load a package two times, etc), but I would have preferred a slightly more thorough data analysis (such as geographic distribution, etc).

APPROVED