title: "Group 21"
author: "Søren, Charlotte, Christian and Hallur"
date: "9. nov. 2015"
output: html_document
Assignment 2
Introduction
Assignment 2 is about scraping data from the webpage ipaidabribe.com and making an analysis on the data.
I Paid a Bribe is an initiative to fight corruption in India. It's primary aim is to uncover the market price of corruption. The way it goes is, that anybody that has to bribe someone in order to get what they want, which could be anything from a birth certificate, drivers licence etc., reports it on the webpage for everyone to see. They report
the amount, the location etc.
Approach
We were asked to scrape the latest 1000 observations from the webpage and analyze the data.
We scraped the data using Selector Gadget as we have doene in the Lectures.
The challenge was that there are only 10 reports on every page, which means that we had to scrape the data from
$1000 / 10 = 100$ pages. That meant that we had to create a Loop inside a Loop, where the insider Loop runs from 1-10, and the outher Loop runs from 1-100.
Here's the code we used:
library("rvest"); library("stringr"); library("dplyr"); library("plyr") # Install all packages we need later on
## Create a vector with numbers from 0 to 990 in steps of 10
x = c() # First: create an empty vector
for(i in 0:99)
{
x = append(x, i*10) # Second: create a loop, where we multiply every number from 0 to 99 with 10
}
## We found out that every page shows 10 bribes. In total we need 1000 reports. This means we need to scan through every of the sites and extract the informations.
## The good thing about it is, that they all have kind of the same structure: http://www.ipaidabribe.com/reports/paid?page= with 10, 20, 30 etc. at the end.
## Therefore, we put together the vector we created above and the following "y"-vector
y = "http://www.ipaidabribe.com/reports/paid?page="
long.links = paste(y, x, sep = "") # Put x and y together
df = NULL
## Now we start with the "real" loop/function:
for(i in 1:100)
{
econ.link = long.links[i] # using the links, we created with the "x"- and "y"-vector
links = read_html(econ.link, encoding = "UTF-8") %>%
html_nodes(".heading-3 a")%>%
html_attr(name = 'href')
links
econ.data = list()
for(i in links[1:10]){ # we just have 10 reports each site, so the function just extracts these 10 informations
link = read_html(i, encoding = "UTF-8")
date = link %>% html_nodes(".date") %>% html_text()
Date = date[1] # scrape the date
location = link %>% html_nodes(".location") %>% html_text()
Location = location[1] %>% str_replace_all(pattern = "\\r\\n" , replacement = " ")%>%
str_trim() # scrape the city
City = link %>% html_nodes(".location") %>% html_text()
City = City[1] %>% str_replace_all(pattern = "\\r\\n" , replacement = " ") %>% str_extract("[a-zA-Z]+[:space:]+[:punct:]*[a-zA-Z]*[:punct:]*[a-zA-Z]*[:punct:]*") %>%
str_replace_all(pattern = "," , replacement = "") %>%
str_trim() # scrape the city
Region = link %>% html_nodes(".location") %>% html_text()
Region = Region[1] %>% str_replace_all(pattern = "\\r\\n" , replacement = " ") %>%str_extract("[:punct:]+[:space:]+[a-zA-Z]+[:space:]*[a-zA-Z]*[:space:]*[a-zA-Z]*") %>%
str_replace_all(pattern = "," , replacement = "") %>%
str_trim() # scrape the region
title = link %>% html_nodes(".heading-3 a") %>% html_text()
Title = title[1] %>%
str_replace_all(pattern = "\\n" , replacement = " ") %>%
str_trim()
amount = link %>% html_nodes(".paid-amount span") %>% html_text()
Amount = amount[1] %>%
str_replace_all(pattern = "\\r\\n" , replacement = " ") %>%
str_replace_all(pattern = "Paid INR " , replacement = "") %>%
str_replace_all(pattern = "," , replacement = "")
department = link %>% html_nodes(".name a") %>% html_text()
Department = department[1]
views = link %>% html_nodes(".overview .views") %>% html_text()
Views = views[1]%>%
str_replace_all(pattern = " views" , replacement = "")
trans_details = link %>% html_nodes(".details .transaction a") %>% html_text()
Trans_details = trans_details[1]
econ.data[[i]] = cbind(Date, Location, City, Region, Title, Amount, Department, Views, Trans_details) # Put all these informations together
}
df.econ = ldply(econ.data)
df.econ$Amount = as.numeric(as.character(df.econ$Amount)) # Make the amount and views numeric, so that we can do some analysis on it
df.econ$Views = as.numeric(as.character(df.econ$Views))
df.econ$.id = NULL
df = rbind(df, df.econ) # Finally combine each 10 reports with all the others
}
Problem with scraping the data when start-up
It was basically impossible for us to scrape the data when we were at the University Campus because of the slow internet connetction. This meant that we had to do it at home. Then we came with the solution that we would store the data as a csv file and then upload it to a repository on Github. Afterwards, we would only have to use the readr library to connect to the data. We turned the data in to a csv file using this code:
We felt inspired by assignment 1 and wanted to illustrate the dimension of the bribes by showing it on a map of India.
We have chosen to create a map showing The average bribes per inhabitant in India during the last three weeks (1000 reports). The map is divided up into States and Union territories. (See link)
The colours (from purple to yellow) represent the log transformed amount of bribes per inhabitant.
The average amount paid for a bribe is 0,007 rupees. A the first glance the amount seems pretty small, but we have to keep in mind that this is "per inhabitant", and we just use latest 1000 reports, it is therefore only over a small time period and not every citizen uses the website.
Regions that are printed grey had no observations in the analysed time-period.
library("readr")
library("dplyr")
library("rvest")
library("stringr")
library("plyr")
library("grid")
df.bribe = read.csv("https://raw.githubusercontent.com/hallurp/Group21/master/I%20paid%20a%20bribe.csv")
df.bribe = df.bribe[-c(46),] # Remove the row with an Amount of 8120303241 (which is some billion dollars...)
df.bribe = df.bribe[-c(916),] # Remove the row with just one information in it (rest is NA)
## Inhabitants per Region / bribes per habitant ##
df.bribe.state = aggregate(df.bribe$Amount, by = list(Date=df.bribe$Region), FUN = sum)
df.st = data.frame(cbind(c("Goa", "Himachal Pradesh","Meghalaya","Nagaland", "Odisha", "Sikkim", "Andaman and Nicobar Islands", "Dadra and Nagar Haveli", "Daman and Diu", "Lakshadweep", "Manipur")),c(0,0,0,0,14429,0,0,0,0,0,150))
colnames(df.st)<-c("Date","x")
df.st$Date = as.character(df.st$Date)
df.bribe.state = rbind(df.bribe.state, df.st)
df.states = read_html("https://en.wikipedia.org/wiki/List_of_states_in_India_by_past_population") %>%
html_node(".wikitable") %>% # extract first node with class wikitable
html_table() # then convert the HTML table into a data frame
df.states = df.states[,-c(1,3,4,5,6,7,8)]
df.states[24,1] = "Manipur"
df.states = inner_join(df.bribe.state, df.states, by = c("Date" = "State or union territory"))
colnames(df.states) <- c("State", "Bribe", "Population")
df.states$Population = str_replace_all(df.states$Population, pattern = "," , replacement = "")
df.states$x = as.numeric(df.states$Bribe)/as.numeric(df.states$Population)
df.states[33,1] = "Andaman and Nicobar"
df.states[31,1] = "Orissa"
df.states[25,1] = "Uttaranchal"
### Visualization: map ###
library(raster)
library(rgdal)
library(rgeos)
library(ggplot2)
library(dplyr)
library(grid)
df.bribe.amount = aggregate(df.bribe$Amount, by=list(Region=df.bribe$Region), FUN=sum)
# df.bribe$Region[5] == df.bribe$Region[6]
# df.bribe$Region = as.character(df.bribe$Region)
###!!! Remember to put "grid" in the library !! ###
### Get data
india <- getData("GADM", country = "India", level = 1)
map <- fortify(india)
map$id <- as.integer(map$id)
dat <- data.frame(id = 1:(length(india@data$NAME_1)), state = india@data$NAME_1)
map_df <- inner_join(map, dat, by = "id")
centers <- data.frame(gCentroid(india, byid = TRUE))
centers$state <- dat$state
map <- fortify(india)
map$id <- as.integer(map$id)
dat <- data.frame(id = 1:(length(india@data$NAME_1)), state = india@data$NAME_1)
map_df <- inner_join(map, dat, by = "id")
centers <- data.frame(gCentroid(india, byid = TRUE))
centers$state <- dat$state
map_df = inner_join(map_df, df.states, by = c("state" = "State"))
### This is hrbrmstr's own function
theme_map <- function (base_size = 12, base_family = "") {
theme_gray(base_size = base_size, base_family = base_family) %+replace%
theme(
axis.line=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks=element_blank(),
axis.ticks.length=unit(0.3, "lines"),
axis.ticks.margin=unit(0.5, "lines"),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
legend.background=element_rect(fill="white", colour=NA),
legend.key=element_rect(colour="white"),
legend.key.size=unit(1.5, "lines"),
legend.position="right",
legend.text=element_text(size=rel(1.2)),
legend.title=element_text(size=rel(1.4), face="bold", hjust=0),
panel.background=element_blank(),
panel.border=element_blank(),
panel.grid.major=element_blank(),
panel.grid.minor=element_blank(),
panel.margin=unit(0, "lines"),
plot.background=element_blank(),
plot.margin=unit(c(1, 1, 0.5, 0.5), "lines"),
plot.title=element_text(size=rel(1.8), face="bold", hjust=0.5),
strip.background=element_rect(fill="grey90", colour="grey50"),
strip.text.x=element_text(size=rel(0.8)),
strip.text.y=element_text(size=rel(0.8), angle=-90)
)
}
library("viridis")
colnames(map_df)[colnames(map_df)=="x"] <- "Amount"
ggplot(map_df) +
geom_map(data = map_df, map = map_df,
aes(map_id = id, x = long, y = lat, group = group, fill = Amount), color = "black", size = 0.15) +
geom_text(data = centers, aes(label = state, x = x, y = y), size = 2.5, colour = "black") +
coord_map() +
labs(x = "", y = "", title = "Bribes per inhabitant in India") +
theme_map() +
scale_fill_viridis(trans = "log", breaks = c(3.547531e-04, 6.547531e-03),
labels = c("low", "high"), option = "D", name = "Amount\n(log transformed)")
Internet access
The number of reports per region is not surprisingly dependend on the internet access - as we found out, only a small perentage (24%) of people in India use the internet.(See link) This, of course, influences the results of the reports, where we see that the regions Maharashtre and Delhi, with Indian's the two biggest and most important cities Mumbai and Delhi, have more internet acces than the rest of India. The possible reason behind this is the better infrastructure. This assumption is also based on a map we found here (link), showing the internet penetration by region.
This map shows that the regions with the highest internet penetration rate were also the ones where we find the highest amount paid for a bribe per inhabitant.
Karnataka
The third highest amount per inhabitant that we notice in the map is in a region called Karnataka. This doesn't fit with our theory above regarding the internet acces in India, but there could be another explanation behind this: The company that runs the website ipaidabribe.com is located here. This could influence the awareness of the people living in the region and therefore they will be more likely to make a report. In the first phase of the launch of the website, the company probably focussed their marketing strategies on their home base.
We can conclude, that the map/the website doesn't really show, where the most bribes in India are. It is very much dependent on the internet access and the awarness of the site.
Transaction Details / For what do people pay a bribe - and how much?
df.bribe.trans = group_by(df.bribe, Trans_details)
df.bribe.trans = dplyr::summarise(df.bribe.trans, count = n())
df.bribe.trans2 = aggregate(df.bribe$Amount, by = list(Date=df.bribe$Trans_details), FUN = sum, na.rm=TRUE)
df.bribe.trans3 = inner_join(df.bribe.trans2, df.bribe.trans, by = c("Date" = "Trans_details"))
df.bribe.trans3$prep = round(as.numeric(df.bribe.trans3$x)/as.numeric(df.bribe.trans3$count))
df.bribe.trans3 = df.bribe.trans3[-c(1, 3, 4, 5, 7, 8, 9, 10, 12, 13, 15, 17, 18, 21, 22, 24, 25, 26, 28, 30, 31, 32, 33, 34, 35, 39, 40, 41),]
p = ggplot(df.bribe.trans3, aes(x = reorder(Date, prep),
y = prep))
p + geom_bar(fill = "#fd482f", color = "black", stat = "identity") + coord_flip() + labs(title = "Average bribe-amount for:", x = "", y = "Amount in Indian Rupee" ) +
theme(plot.title = element_text(lineheight=.5, face="bold"))
Conclusion
We plotted the average bribe-amount (Total amount spent on each transaction detail, divided by its number of reports) for some selected services.
There is surprisingly a huge difference between them. For a new PAN-card (Code that acts as identification of Indians, especially those who pay Income Tax.) you need to pay, in average, 70773 Indian Rupee (around 7,500 DKK), whereas a duplication of the driving license is very cheap with a price of 500 Indian Rupee (around 52 DKK).
One funny note about it: In order to get a scholarship in India, it seems that you sometimes first need to pay money, before you get some back.
But we need to be careful with these results: For some of the services we can be pretty sure about the average-amount because of many reports. Others, like "Land Registration", had only a few reports.
How expensive is a birth certificate in different cities?
After doing the first graph on the average bribe-amount for each sector, we asked ourselves: Is there a difference between the amount for a service payed in, for example, Delhi and Mumbai? Now we can say: Yes, there are huge differences!
We used the bribes for a "birth certificate" as an example, because there are 219 reports on that during the last three weeks - which is the highest number.
We didn't really expect these big differences and it seems, that the result aren't really reliable. But: we can neither affirm them nor scrap them, because we have more than just a one or two reports on each city. (Pune: 4; Agra: 44, Bangalore: 33; New Delhi: 4; Mumbai: 10)
However, it can really be the case, that bribes are less common (and therefore more expensive) in the most important two cities Delhi and Mumbai, because they receive more national and international attention.
Some general notes
All of our conclusions are kind of vague. As we said, the analysed period of time is very small and the results are too less to do reliable conclusions.
But we think that our main messages could nonetheless be true.
title: "Group 21" author: "Søren, Charlotte, Christian and Hallur" date: "9. nov. 2015"
output: html_document
Assignment 2
Introduction
Assignment 2 is about scraping data from the webpage ipaidabribe.com and making an analysis on the data. I Paid a Bribe is an initiative to fight corruption in India. It's primary aim is to uncover the market price of corruption. The way it goes is, that anybody that has to bribe someone in order to get what they want, which could be anything from a birth certificate, drivers licence etc., reports it on the webpage for everyone to see. They report the amount, the location etc.
Approach
We were asked to scrape the latest 1000 observations from the webpage and analyze the data. We scraped the data using Selector Gadget as we have doene in the Lectures. The challenge was that there are only 10 reports on every page, which means that we had to scrape the data from $1000 / 10 = 100$ pages. That meant that we had to create a Loop inside a Loop, where the insider Loop runs from 1-10, and the outher Loop runs from 1-100.
Here's the code we used:
Problem with scraping the data when start-up
It was basically impossible for us to scrape the data when we were at the University Campus because of the slow internet connetction. This meant that we had to do it at home. Then we came with the solution that we would store the data as a csv file and then upload it to a repository on Github. Afterwards, we would only have to use the readr library to connect to the data. We turned the data in to a csv file using this code:
Data analysis
Map illustration
Introduction
We felt inspired by assignment 1 and wanted to illustrate the dimension of the bribes by showing it on a map of India. We have chosen to create a map showing The average bribes per inhabitant in India during the last three weeks (1000 reports). The map is divided up into States and Union territories. (See link) The colours (from purple to yellow) represent the log transformed amount of bribes per inhabitant. The average amount paid for a bribe is 0,007 rupees. A the first glance the amount seems pretty small, but we have to keep in mind that this is "per inhabitant", and we just use latest 1000 reports, it is therefore only over a small time period and not every citizen uses the website. Regions that are printed grey had no observations in the analysed time-period.
Internet access
The number of reports per region is not surprisingly dependend on the internet access - as we found out, only a small perentage (24%) of people in India use the internet.(See link) This, of course, influences the results of the reports, where we see that the regions Maharashtre and Delhi, with Indian's the two biggest and most important cities Mumbai and Delhi, have more internet acces than the rest of India. The possible reason behind this is the better infrastructure. This assumption is also based on a map we found here (link), showing the internet penetration by region. This map shows that the regions with the highest internet penetration rate were also the ones where we find the highest amount paid for a bribe per inhabitant.
Karnataka
The third highest amount per inhabitant that we notice in the map is in a region called Karnataka. This doesn't fit with our theory above regarding the internet acces in India, but there could be another explanation behind this: The company that runs the website ipaidabribe.com is located here. This could influence the awareness of the people living in the region and therefore they will be more likely to make a report. In the first phase of the launch of the website, the company probably focussed their marketing strategies on their home base.
We can conclude, that the map/the website doesn't really show, where the most bribes in India are. It is very much dependent on the internet access and the awarness of the site.
Transaction Details / For what do people pay a bribe - and how much?
Conclusion
We plotted the average bribe-amount (Total amount spent on each transaction detail, divided by its number of reports) for some selected services. There is surprisingly a huge difference between them. For a new PAN-card (Code that acts as identification of Indians, especially those who pay Income Tax.) you need to pay, in average, 70773 Indian Rupee (around 7,500 DKK), whereas a duplication of the driving license is very cheap with a price of 500 Indian Rupee (around 52 DKK). One funny note about it: In order to get a scholarship in India, it seems that you sometimes first need to pay money, before you get some back.
But we need to be careful with these results: For some of the services we can be pretty sure about the average-amount because of many reports. Others, like "Land Registration", had only a few reports.
How expensive is a birth certificate in different cities?
Conclusion
After doing the first graph on the average bribe-amount for each sector, we asked ourselves: Is there a difference between the amount for a service payed in, for example, Delhi and Mumbai? Now we can say: Yes, there are huge differences! We used the bribes for a "birth certificate" as an example, because there are 219 reports on that during the last three weeks - which is the highest number. We didn't really expect these big differences and it seems, that the result aren't really reliable. But: we can neither affirm them nor scrap them, because we have more than just a one or two reports on each city. (Pune: 4; Agra: 44, Bangalore: 33; New Delhi: 4; Mumbai: 10) However, it can really be the case, that bribes are less common (and therefore more expensive) in the most important two cities Delhi and Mumbai, because they receive more national and international attention.
Some general notes
All of our conclusions are kind of vague. As we said, the analysed period of time is very small and the results are too less to do reliable conclusions. But we think that our main messages could nonetheless be true.