sebastianbarfort / sds

Social Data Science, course at University of Copenhagen
http://sebastianbarfort.github.io/sds/
12 stars 17 forks source link

Group 22: Assignment 2 #34

Closed kgronpug closed 8 years ago

kgronpug commented 8 years ago

title: "Assignment 2 - Group 22" author: "Kaspar Pugesgaard, Line Rasmussen, Kasper Wetterslev, Louise Poulsen" date: "November 9, 2015"

output: html_document

This assignment works with data from ipaidabribe.com, which is a website containing unofficial data on corruption in different countries. In this assignment we've decided to focus on India.

We have searched the website for terms and conditions and have not found any restrictions governing the use of the data.

Below we scrape data from ipaidabribe.com for India and compute facts and statistics from the scraped data.

#Relevant libraries.
library("rvest")
library("plyr")
library("readr")
library("dplyr")
library("ggplot2")
library("maptools")
library("RColorBrewer")
library("sp")
library("sqldf")
library("mapproj")
library("stringr") 

Scraping data

We have created a function to scrape data from the website. Using selectorgadget, the function extracts the relevant variables from the provided website. The function returns a matrix of title, amount paid, name of department, transaction detail, number of views, city and region, and date of report.

#Scrapefunction
scrape_link = function(link){
  title = read_html(link, encoding = "UTF-8") %>% 
    html_nodes(".heading-3 a") %>%
    html_text()
  amount = read_html(link, encoding = "UTF-8") %>% 
    html_nodes(".paid-amount span") %>%
    html_text()
  dept = read_html(link, encoding = "UTF-8") %>% 
    html_nodes(".name a") %>%
    html_text()
  trans = read_html(link, encoding = "UTF-8") %>% 
    html_nodes(".transaction a") %>%
    html_text()
  views = read_html(link, encoding = "UTF-8") %>% 
    html_nodes(".overview .views") %>%
    html_text()
  city = read_html(link, encoding = "UTF-8") %>% 
    html_nodes(".location") %>%
    html_text()
  date = read_html(link, encoding = "UTF-8") %>% 
      html_nodes(".date") %>%
    html_text()
    return(cbind(title, amount, dept, trans, views, city, date))
}

# Saves the URLs of ipaidabribe in a string. These are used in a loop with the above function.
link <- "http://www.ipaidabribe.com/reports/paid"
links <- "http://www.ipaidabribe.com/reports/all?page="

# Creates an empty list to store information from the website. 
bribes <- list()

# Runs the loop 100 times to extract 1.000 observations.
# In each iteration the link is updated to match the next page of the website.
for (i in 1:100){
  print(paste("Running iteration", i, "of 100"))
  bribes[[length(bribes)+1]] <- scrape_link(link)
  link = paste0(links, i*10)
  print(paste("Done"))
  Sys.sleep(1)
}

df.bribe <- ldply(bribes, data.frame)

We then create a data frame with our scraped data, and save this data locally so we do not have to rerun the scrapefunction everytime we want to work with the code. The code saving the file is commented out, to not create errors on different computers.

# Stores the data in a dataframe (and saves the scraped data as a csv file). 
#write.csv(file="C:/Users/Kaspar/Dropbox/Social data science/Assignment 2/incredibleindia.csv",          x=df.bribe, row.names=FALSE)
#df.bribe <- read_csv("C:/Users/Kaspar/Dropbox/Social data science/Assignment 2/incredibleindia_081115.csv")

Cleaning and preparing the data

We now clean the extracted data in preperation for our analysis.

#Cleans the data - extracts numbers and removes commas from the paid amount.
df.bribe$amount <- gsub("[a-z]|[A-Z]", "", df.bribe$amount)
df.bribe$amount <- sub(",", "", df.bribe$amount)
df.bribe$amount <- sub(",", "", df.bribe$amount)

#Adds date variable of the day scraped. 
df.bribe$date_scraped <- date()

#Finds the day of the week the report of bribery has happened.
df.bribe$weekday <- weekdays(as.Date(df.bribe$date,'%B %d, %Y'))

#Extracts the number of views (removes the text).
df.bribe$views <- gsub("[a-z]|[A-Z]", "", df.bribe$views)

#Extracts the region
df.bribe$region <- sub(".*[,]","", df.bribe$city)

#Changes amount and views to numeric variables.
df.bribe$amount_paid <- as.integer(df.bribe$amount)
df.bribe$nr_views <- as.integer(df.bribe$views)

#Removes rows with NA's and saves these in a new dataframe.
df.bribe_clean <- na.omit(df.bribe) %>%
  filter(region != "")

#Modifies region variable to match gadm data which is used later, more modifications might be necessary with a different dataset.
df.bribe_clean$region <- gsub("Telangana", "Andhra Pradesh", df.bribe_clean$region)
df.bribe_clean$region <- gsub("Uttarakhand", "Uttaranchal", df.bribe_clean$region)
df.bribe_clean$region <- gsub("Andaman and Nicobar Islands", "Andaman and Nicobar", df.bribe_clean$region)
df.bribe_clean$region <- gsub("Pondicherry", "Puducherry", df.bribe_clean$region)

We now have a clean dataframe containing the variables we need.

Summary statistics

To describe the data, we perform some simple summary statistics. Our dataset contains 748 observations and 14 variables.

# Summary statistics of number of views
summary(df.bribe_clean$nr_views)

The reports have been viewed between 3 and 379 times. The median is 22 and the mean is 38,87 which indicates that the distibution of views is left skewed. 25 percent of the reports have been viewed at least 13 times and 75 percent of the reports have been viewed at least 40 times.

# Summary statistics of amount paid by region.
summary(df.bribe_clean$amount_paid)

The maximum amount paid as a bribe is 8.000.000 Rs. and the minimum amount paid is 1 Rs. The median amount paid is 200 Rs and the mean amount paid is 68.160. This indicates that a large amount of the bribes are relatively small and that the distribution is skewed to the left. 75 percent of the bribes paid are smaller than 5000 Rs and 25 percent of the bribes paid are smaller than 100 Rs.

# Date of the first report.
df.bribe_clean <- arrange(df.bribe_clean, date)
head(df.bribe_clean$date, 1)

The first report was created on November 1st, 2015.

# Date of the last report.
df.bribe_clean <- arrange(df.bribe_clean, desc(date))
head(df.bribe_clean$date, 1)

The latest report was created on November 8th, 2015.

To further describe the data, we create two plots. We create one plot containing average payment per region and one plot containing number of views per region.

Plots

To create the plot, we group our dataset by region, and calculate the sum and average amount of views and size of bribe pr. region.

# Finds unique regions from the data set
bribe_group <- df.bribe_clean %>%
  group_by(region) %>%
  summarise(sum_views=sum(nr_views), avg_views=mean(nr_views), sum_bribe=sum(amount_paid),  avg_bribe=mean(amount_paid))

The plot below shows the region and the average bribe payment of each region.

p= ggplot(bribe_group, aes(x=region, y=avg_bribe))
p= p + geom_bar(stat="identity")
p=p + labs(title="Average bribe payment by region", x="Region", y="Average bribe")
p= p + theme_minimal()  
p=p+theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
p 

From the plot of the average bribe payment by region we can see, that the region with the highest average bribe is Gujaret and that there is a large difference in the average bribe paid in each region.

The plot below shows the number of views of reports per region.

# Plot of the region and the number of views
p= ggplot(bribe_group, aes(x=region, y=sum_views))
p= p + geom_bar(stat="identity")
p=p + labs(title="Sum of views by regions", x="Region", y="Sum of views")
p= p + theme_minimal()  
p=p+theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
p 

This plot shows, that the reports of bribes in Maharashtra have the most views with more than 4.000 views. Some regions have not had any views - or more likely - some regions have not had many reports of bribes.

We now examine the correlation between the average paid bribe and the number of views in a scatter plot.

# Draws scatter plot
  p = ggplot(df.bribe_clean, aes(x=nr_views, y=amount_paid))
  p = p + geom_point(shape=3)
  p = p + geom_smooth(method=lm, se=FALSE)
  p = p + labs(title="Correlation between number of views and size of bribe payment", x="Views", y="Payment")
  p = p + theme_minimal()  
  p

From this plot we cannot conclude anything about the correlation due to the outliers, so we remove the outliers and draw the graph again.

# Removes outliers
df.bribe_clean_2 <- df.bribe_clean %>%
  filter(amount_paid < 8000000)

# Draws scatter plot
p = ggplot(df.bribe_clean_2, aes(x=nr_views, y=amount_paid))
p = p + geom_point(shape=3)
p = p + geom_smooth(method=lm, se=FALSE)
p = p + labs(title="Correlation between number of views and size of bribe payment", x="Views", y="Payment")
p = p + theme_minimal()  
p

We still cannot conclude anything about the correlation between number of views and the size of the bribe payment, but there might be a slight positive correlation. The lack of a clear picture might be due to the bribe payments being relatively low for the most observations. The lack of a clear picture of correlation may also be due to our relatively small sample size.

Map

We now draw a map of the average bribes by regions to see which regions have the highest average bribe.


# Loads data on India from the internet
load(url("http://biogeo.ucdavis.edu/data/gadm2/R/IND_adm1.RData"))
ind <- gadm

#SPPlot of average bribe by region
#To color the states, we first need a 35 columns variable with values from the different states, to match the Large SpatialPolygonsDataFrame ind, which we use to draw the map.

#We draw out the correct state names
states <- as.data.frame(ind$NAME_1)

#Changes the region to a character variable
colnames(states) = "region"

#Removes white space and prepares data for merging bribes to the correct states.
states$region <- as.character(states$region)
bribe_group$region <- str_trim(bribe_group$region)

#Left joins data on bribes on states
merge <- dplyr::left_join(states, bribe_group, by = "region")

#Logtransforming the avg bribes and replaces NAs with 0 to run with spplot
merge$log_avg <- log10(merge$avg_bribe)
merge$log_avg <- ifelse(is.na(merge$log_avg), 0, merge$log_avg)

# Creating a variable to color the states in the map. The values should be between 0 and 1, so we normalise to fit this criteria.
merge$max <- max(merge$log_avg)
norm_bribe <- merge$log_avg/merge$max
ind$avgbribe <- norm_bribe

#Changes region variable (Name_1) from India dataset to a factor
ind$NAME_1 = as.factor(ind$NAME_1)

#Drawing the map with spplot. Color by avg. bribe.
spplot(ind,"NAME_1",  col.regions=rgb(ifelse(ind$avgbribe==0, 1, 0), ifelse(ind$avgbribe==0, 0, 1-ind$avgbribe), 0), colorkey=T, main="Indian States by Average Bribe Size")

The above map should be interpreted as follows. Red colored states indicate, that there are no bribe observations for the region. The darker the green color, the higher the average bribe payment in that region. We now see, that there are some states with a higher average bribe payment than others. Furthermore we note that the states with no observations all belong to the most eastern part of India. This can either mean that there is no corruption in this part of the country, or more likely, it is not reported, either because corruption is generally accepted here, or because ipaidabribe or internet in general is not used here. Besides this the there is no geographical trends to notice.

There are, however, many possible reason why some states might see more bribes than others. This could include population size, general crime rate, unemployment or other structural differences between the states.

sebastianbarfort commented 8 years ago

Good assignment.

APPROVED