Assignment 2 - group 7 - Githubissues

title: "Assignment 2"

output: html_document

A brief introduction

This assignment is based on data taken from www.ipaidabribe.com. The purpose was to use R to "scrape", or download, data from the webpage, and then describe or analyse corruption in India. The premise of the webpage is simple: Every time you pay a bribe, you upload the amount, the details of the transaction, and the location.

Web scraping: www. ipaidabride.com

The advantages of web-scraping in comparison to traditional data collection is the wide availability of user-defined data on different platforms, such as Twitter, newpapers, etc. These databases and websites can store informative and valuable information on consumer (user) consumptions pattern, interests, and so on.

In this assignment, we can use the anonymous postings on ipaidabribe.com to provide statistics on corruption - something often difficult to quantify. We have scraped the latest 1000 posts from the website and included the following variables: title, amount, name of department (to which the bribe was paid), transaction details, number of views and city. In order to do so, we have used several R packages, including "rvest" which enables R to scrape webpages. A function reads the html, identifies the nodes of interest and then creates data variables based on these nodes.

We are left with a dataframe containing eight variables (as well as an id variable), using a few data manipulation operations to transform the variables into the right categories and formats. To create a full data set containing 1000 observation we had to loop our scraping function through 100 sequential webpages, because the website only displays 10 reports per page. This created the formal dataframe which has been used for analysis.

Data cleaning

There seem to be some outliers in the data that contain huge sums of money, paid as bribes. While it is possible that these observations accurately reflect real transactions, since this is crowd-sourced data, it is also highly possible that there has been some input error. In any case, these massive bribes skew the data, and are of limited interest - we have therefore discarded observations in excess of 30 million rupees (approx. 3 million kr.).

library("stringr")
library("rvest")
library("plyr")
library("dplyr")
library("ggplot2")
library("stargazer")

#This is the script that pulls out the variables of interest on each page, and cleans the data and puts them together in a data frame. The function takes a url, reads the html, identifies the nodes, creates variables based on these nodes, and then uses regular expressions to tidy it up.

extract.bribes <- function(url) {

    cat("url:", url)
    html.data <- read_html(url)

    id <- html_text(html_nodes(html.data, ".unique-reference"))
    titles <- html_text(html_nodes(html.data, "h3 a"))
    department <- html_text(html_nodes(html.data, ".name a"))
    amount <- html_text(html_nodes(html.data, ".paid-amount span"))
    transaction <- html_text(html_nodes(html.data, ".transaction a"))
    date <- html_text(html_nodes(html.data, ".date"))
    city <- html_text(html_nodes(html.data, ".location"))
    views <- html_text(html_nodes(html.data, ".views"))

    amounts <- gsub(" |,|[[:alpha:]]","",amount)

    date <- as.Date(date, format="%B %d,%Y")

    l <- str_extract(city, "\\w* \\d+, \\d+")
    city <- city[is.na(str_extract(city, "\\w* \\d+, \\d+"))]

    views <- views[-1]
    views <- gsub(" |[[:alpha:]]","",views)

    return(data.frame(id=(as.numeric(id)), views=as.numeric(views), titles=as.character(titles), date=(date), department=as.character(department), city=as.character(city), amounts=as.numeric(amounts), transaction=as.character(transaction)))
}

## Since we want 1000 observations, and the webpage only shows 10 per page, we have to loop the previous function through 100 sequential webpages. Here, we establish the urls:

n.obs.req <- 1000
extent <- n.obs.req-10

urls <- paste0("http://www.ipaidabribe.com/reports/paid?page=", seq(0,extent,10))

## And then this code applies the function to each url:

my.list <- vector("list", length(urls))

for (i in 1:length(urls)) {
    # extracting information
    my.list[[i]] <- extract.bribes(urls[i])
    # waiting one second between hits
    Sys.sleep(1)
    cat(" done!\n")
}

# Then we pull together the results from every url into a single, large dataframe with 1000 observations.

data <- do.call(rbind, my.list)

# There seem to be outliers in the data with huge sums of money paid as bribes. It is possible that these observations represent reality, but since this is crowd-sourced data, it is also highly possible that there has been some input error. In any case, these massive bribes skew the data, and are of limited interest - we will discard observations of bribes in excess of 30 million rupees (approx. 3 million kr.).

data.clean <- subset(data,amounts < 30000000)

#It may be interesting to observe the relationship between location and bribe size. For every location, we want to know the size of the population in the given province. This data is scraped from wikipedia:

a <- read_html("https://en.wikipedia.org/wiki/List_of_states_and_union_territories_of_India_by_population") %>% html_node(".wikitable") %>%html_table()

b1 <- strsplit(a[,3],split=c("â™"))
b1 <- c(unlist(sapply(b1,FUN=function(x) return(x[-1]))[1:36]),b1[[37]][1])
b1 <- strsplit(b1,split=c("\n"))
b1 <- do.call(rbind,b1)[,1]
b1 <- as.numeric(gsub(",","",b1))

provinces <- cbind(a[2],b1)
colnames(provinces) <- c("province","population")

provinces[24,1] <- "Manipur"

data.clean$province <- do.call(rbind,strsplit(as.character(data.clean$city), 
                              split=c(", ")))[,2]

# And we merge population size to the data..

data.clean <- left_join(data.clean,provinces)

# We also create a variable that contains the weekday, should this be interesting for analysis.

data.clean$day <- weekdays(data.clean$date)

departments <- group_by(data.clean, department)
average <- summarise(departments,
count = n(),
avg = mean(amounts, na.rm = TRUE))

transactions <- group_by(data.clean, transaction)
average2 <- summarise(transactions,
count = n(),
avg = mean(amounts, na.rm = TRUE))

Data analysis

First, we are interested in looking into which bribes receive most attention on the website. Are the number of views, for example, related to the size of the bribe, such that individuals browsing the website are more interested in reading more about large bribes?

ggplot(data.clean, aes(x = amounts, y = views)) +
geom_point() + scale_x_log10() + geom_smooth(method=lm, se=FALSE) + xlab("Size of bribe, rupees (log)") +
  ylab("Number of views") +
  ggtitle("Correlation between bribe size and number of views")

The graph shows a slightly negative relationship between bribe size and number of views, undermining our hypothesis. It would appear that the size of the bribe is not a key variable that drives number of views. The x scale has been logged in the graph, which means that an exponential increase in bribe size in fact leads to a slight decrease in the number of views.

We are also interested in investigating the relationship between bribe size and the recipient of bribes. Is it the case that some departments, to whom the bribe is paid, wield more power such that they can command higher bribes and/or "sell" more expensive services on average?

ggplot(average, aes(department, avg)) +
geom_point(aes(size = count), alpha = 1/2) +
scale_size_area() + scale_x_discrete(label=abbreviate) + xlab("Department (recipient of bribe)") +
  ylab("Average bribe, rupees") +
  ggtitle("Frequency and average size of bribes, by department") + 
  scale_size_continuous(name = "Frequency\nof observations", range = c(2.5,15))

For a list of abbrevations corresponding to department names, see the appendix. We can see that average bribe size seems to vary wildly by department. It is notable that bribes paid to "Municipal services" are low on average, but very numerous, as represented by the size of the dot. Paying for a birth certificate is a common "Municipal services" bribe.

Is the average bribe amount related to region and population size? We might imagine differences in the propensity/necessity of bribes in city vs. rural life, for example. To investigate this question, we will be using the population size variable acquired from wikipedia. We can now distinguish between large and small provinces.

reg <- lm(amounts~I(log(population)),data.clean)
stargazer(reg,type="html")

We can see that the coefficient corresponding to province size is not significant, so it would appear that there is no effect of interest.

Appendix table of abbrevations

average$abbrev <- abbreviate(average$department)
tab <- cbind(as.character(average$department), average$abbrev)
colnames(tab) <- c("Department name","Abbreviation")

stargazer(tab, type="html")

sebastianbarfort / sds