sebastianbarfort / sds

Social Data Science, course at University of Copenhagen
http://sebastianbarfort.github.io/sds/
12 stars 17 forks source link

Assignment 2 - Group 21 #39

Closed hallurp closed 8 years ago

hallurp commented 9 years ago

title: "Group 21" author: "Søren, Charlotte, Christian and Hallur" date: "9. nov. 2015"

output: html_document

Assignment 2

Introduction

Assignment 2 is about scraping data from the webpage ipaidabribe.com and making an analysis on the data. I Paid a Bribe is an initiative to fight corruption in India. It's primary aim is to uncover the market price of corruption. The way it goes is, that anybody that has to bribe someone in order to get what they want, which could be anything from a birth certificate, drivers licence etc., reports it on the webpage for everyone to see. They report the amount, the location etc.

Approach

We were asked to scrape the latest 1000 observations from the webpage and analyze the data. We scraped the data using Selector Gadget as we have doene in the Lectures. The challenge was that there are only 10 reports on every page, which means that we had to scrape the data from $1000 / 10 = 100$ pages. That meant that we had to create a Loop inside a Loop, where the insider Loop runs from 1-10, and the outher Loop runs from 1-100.

Here's the code we used:

library("rvest"); library("stringr"); library("dplyr"); library("plyr") # Install all packages we need later on

## Create a vector with numbers from 0 to 990 in steps of 10 

x = c() # First: create an empty vector

for(i in 0:99)
{
  x = append(x, i*10) # Second: create a loop, where we multiply every number from 0 to 99 with 10

}

## We found out that every page shows 10 bribes. In total we need 1000 reports. This means we need to scan through every of the sites and extract the informations.
## The good thing about it is, that they all have kind of the same structure: http://www.ipaidabribe.com/reports/paid?page= with 10, 20, 30 etc. at the end.
## Therefore, we put together the vector we created above and the following "y"-vector

y = "http://www.ipaidabribe.com/reports/paid?page="

long.links = paste(y, x, sep = "") # Put x and y together

df = NULL

## Now we start with the "real" loop/function:

for(i in 1:100)
{
  econ.link = long.links[i] # using the links, we created with the "x"- and "y"-vector

links = read_html(econ.link, encoding = "UTF-8") %>%
  html_nodes(".heading-3 a")%>%
  html_attr(name = 'href')
links

econ.data = list()
for(i in links[1:10]){ # we just have 10 reports each site, so the function just extracts these 10 informations
link = read_html(i, encoding = "UTF-8")

date = link %>% html_nodes(".date") %>% html_text()
Date = date[1] # scrape the date

location = link %>% html_nodes(".location") %>% html_text()
Location = location[1] %>% str_replace_all(pattern = "\\r\\n" , replacement = " ")%>%
  str_trim() # scrape the city

City = link %>% html_nodes(".location") %>% html_text()
City = City[1] %>% str_replace_all(pattern = "\\r\\n" , replacement = " ") %>% str_extract("[a-zA-Z]+[:space:]+[:punct:]*[a-zA-Z]*[:punct:]*[a-zA-Z]*[:punct:]*") %>% 
    str_replace_all(pattern = "," , replacement = "") %>% 
    str_trim() # scrape the city

Region = link %>% html_nodes(".location") %>% html_text()
Region = Region[1] %>% str_replace_all(pattern = "\\r\\n" , replacement = " ") %>%str_extract("[:punct:]+[:space:]+[a-zA-Z]+[:space:]*[a-zA-Z]*[:space:]*[a-zA-Z]*") %>% 
  str_replace_all(pattern = "," , replacement = "") %>% 
  str_trim() # scrape the region

title = link %>% html_nodes(".heading-3 a") %>% html_text()
Title = title[1] %>% 
  str_replace_all(pattern = "\\n" , replacement = " ") %>%
  str_trim()

amount = link %>% html_nodes(".paid-amount span") %>% html_text()
Amount = amount[1] %>% 
  str_replace_all(pattern = "\\r\\n" , replacement = " ") %>% 
  str_replace_all(pattern = "Paid INR " , replacement = "") %>%
  str_replace_all(pattern = "," , replacement = "")

department = link %>% html_nodes(".name a") %>% html_text()
Department = department[1]

views = link %>% html_nodes(".overview .views") %>% html_text()
Views = views[1]%>% 
  str_replace_all(pattern = " views" , replacement = "")

trans_details = link %>% html_nodes(".details .transaction a") %>% html_text()
Trans_details = trans_details[1]

econ.data[[i]] = cbind(Date, Location, City, Region, Title, Amount, Department, Views, Trans_details) # Put all these informations together
}

df.econ = ldply(econ.data)
df.econ$Amount = as.numeric(as.character(df.econ$Amount)) # Make the amount and views numeric, so that we can do some analysis on it
df.econ$Views = as.numeric(as.character(df.econ$Views))
df.econ$.id = NULL

df = rbind(df, df.econ) # Finally combine each 10 reports with all the others

}

Problem with scraping the data when start-up

It was basically impossible for us to scrape the data when we were at the University Campus because of the slow internet connetction. This meant that we had to do it at home. Then we came with the solution that we would store the data as a csv file and then upload it to a repository on Github. Afterwards, we would only have to use the readr library to connect to the data. We turned the data in to a csv file using this code:

write.csv(df.econ, file = "I_paid_a_bribe.csv", row.names = FALSE)
MyData = read.csv(file="I_paid_a_bribe.csv", header=TRUE, sep=",")

Data analysis

Map illustration

Introduction

We felt inspired by assignment 1 and wanted to illustrate the dimension of the bribes by showing it on a map of India. We have chosen to create a map showing The average bribes per inhabitant in India during the last three weeks (1000 reports). The map is divided up into States and Union territories. (See link) The colours (from purple to yellow) represent the log transformed amount of bribes per inhabitant. The average amount paid for a bribe is 0,007 rupees. A the first glance the amount seems pretty small, but we have to keep in mind that this is "per inhabitant", and we just use latest 1000 reports, it is therefore only over a small time period and not every citizen uses the website. Regions that are printed grey had no observations in the analysed time-period.

library("readr")
library("dplyr")
library("rvest")
library("stringr")
library("plyr")
library("grid")

df.bribe = read.csv("https://raw.githubusercontent.com/hallurp/Group21/master/I%20paid%20a%20bribe.csv")
df.bribe = df.bribe[-c(46),] # Remove the row with an Amount of 8120303241 (which is some billion dollars...)
df.bribe = df.bribe[-c(916),] # Remove the row with just one information in it (rest is NA)

## Inhabitants per Region / bribes per habitant ##
df.bribe.state = aggregate(df.bribe$Amount, by = list(Date=df.bribe$Region), FUN = sum)
df.st = data.frame(cbind(c("Goa", "Himachal Pradesh","Meghalaya","Nagaland", "Odisha", "Sikkim", "Andaman and Nicobar Islands", "Dadra and Nagar Haveli", "Daman and Diu", "Lakshadweep", "Manipur")),c(0,0,0,0,14429,0,0,0,0,0,150))
colnames(df.st)<-c("Date","x")
df.st$Date = as.character(df.st$Date)
df.bribe.state = rbind(df.bribe.state, df.st)

df.states = read_html("https://en.wikipedia.org/wiki/List_of_states_in_India_by_past_population") %>%
  html_node(".wikitable") %>% # extract first node with class wikitable
  html_table() # then convert the HTML table into a data frame

df.states = df.states[,-c(1,3,4,5,6,7,8)]
df.states[24,1] = "Manipur"

df.states = inner_join(df.bribe.state, df.states, by = c("Date" = "State or union territory"))
colnames(df.states) <- c("State", "Bribe", "Population")
df.states$Population = str_replace_all(df.states$Population, pattern = "," , replacement = "")
df.states$x = as.numeric(df.states$Bribe)/as.numeric(df.states$Population)
df.states[33,1] = "Andaman and Nicobar"
df.states[31,1] = "Orissa"
df.states[25,1] = "Uttaranchal"

### Visualization: map ###

library(raster)
library(rgdal)
library(rgeos)
library(ggplot2)
library(dplyr)
library(grid)

df.bribe.amount = aggregate(df.bribe$Amount, by=list(Region=df.bribe$Region), FUN=sum)

# df.bribe$Region[5] == df.bribe$Region[6]
# df.bribe$Region = as.character(df.bribe$Region)

###!!! Remember to put "grid" in the library !! ###

### Get data
india <- getData("GADM", country = "India", level = 1)

map <- fortify(india)
map$id <- as.integer(map$id)

dat <- data.frame(id = 1:(length(india@data$NAME_1)), state = india@data$NAME_1)
map_df <- inner_join(map, dat, by = "id")

centers <- data.frame(gCentroid(india, byid = TRUE))
centers$state <- dat$state

map <- fortify(india)
map$id <- as.integer(map$id)

dat <- data.frame(id = 1:(length(india@data$NAME_1)), state = india@data$NAME_1)
map_df <- inner_join(map, dat, by = "id")

centers <- data.frame(gCentroid(india, byid = TRUE))
centers$state <- dat$state

map_df = inner_join(map_df, df.states, by = c("state" = "State"))

### This is hrbrmstr's own function
theme_map <- function (base_size = 12, base_family = "") {
  theme_gray(base_size = base_size, base_family = base_family) %+replace% 
    theme(
      axis.line=element_blank(),
      axis.text.x=element_blank(),
      axis.text.y=element_blank(),
      axis.ticks=element_blank(),
      axis.ticks.length=unit(0.3, "lines"),
      axis.ticks.margin=unit(0.5, "lines"),
      axis.title.x=element_blank(),
      axis.title.y=element_blank(),
      legend.background=element_rect(fill="white", colour=NA),
      legend.key=element_rect(colour="white"),
      legend.key.size=unit(1.5, "lines"),
      legend.position="right",
      legend.text=element_text(size=rel(1.2)),
      legend.title=element_text(size=rel(1.4), face="bold", hjust=0),
      panel.background=element_blank(),
      panel.border=element_blank(),
      panel.grid.major=element_blank(),
      panel.grid.minor=element_blank(),
      panel.margin=unit(0, "lines"),
      plot.background=element_blank(),
      plot.margin=unit(c(1, 1, 0.5, 0.5), "lines"),
      plot.title=element_text(size=rel(1.8), face="bold", hjust=0.5),
      strip.background=element_rect(fill="grey90", colour="grey50"),
      strip.text.x=element_text(size=rel(0.8)),
      strip.text.y=element_text(size=rel(0.8), angle=-90) 
    )   
}

library("viridis")

colnames(map_df)[colnames(map_df)=="x"] <- "Amount"

ggplot(map_df) +
  geom_map(data = map_df, map = map_df,
           aes(map_id = id, x = long, y = lat, group = group, fill = Amount), color = "black", size = 0.15) +
  geom_text(data = centers, aes(label = state, x = x, y = y), size = 2.5, colour = "black") +
  coord_map() +
  labs(x = "", y = "", title = "Bribes per inhabitant in India") +
  theme_map() +
scale_fill_viridis(trans = "log", breaks = c(3.547531e-04, 6.547531e-03),
labels = c("low", "high"), option = "D", name = "Amount\n(log transformed)")

Internet access

The number of reports per region is not surprisingly dependend on the internet access - as we found out, only a small perentage (24%) of people in India use the internet.(See link) This, of course, influences the results of the reports, where we see that the regions Maharashtre and Delhi, with Indian's the two biggest and most important cities Mumbai and Delhi, have more internet acces than the rest of India. The possible reason behind this is the better infrastructure. This assumption is also based on a map we found here (link), showing the internet penetration by region. This map shows that the regions with the highest internet penetration rate were also the ones where we find the highest amount paid for a bribe per inhabitant.

Karnataka

The third highest amount per inhabitant that we notice in the map is in a region called Karnataka. This doesn't fit with our theory above regarding the internet acces in India, but there could be another explanation behind this: The company that runs the website ipaidabribe.com is located here. This could influence the awareness of the people living in the region and therefore they will be more likely to make a report. In the first phase of the launch of the website, the company probably focussed their marketing strategies on their home base.

We can conclude, that the map/the website doesn't really show, where the most bribes in India are. It is very much dependent on the internet access and the awarness of the site.

Transaction Details / For what do people pay a bribe - and how much?


df.bribe.trans = group_by(df.bribe, Trans_details)

df.bribe.trans = dplyr::summarise(df.bribe.trans, count = n())
df.bribe.trans2 = aggregate(df.bribe$Amount, by = list(Date=df.bribe$Trans_details), FUN = sum, na.rm=TRUE)
df.bribe.trans3 = inner_join(df.bribe.trans2, df.bribe.trans, by = c("Date" = "Trans_details"))
df.bribe.trans3$prep = round(as.numeric(df.bribe.trans3$x)/as.numeric(df.bribe.trans3$count))

df.bribe.trans3 = df.bribe.trans3[-c(1, 3, 4, 5, 7, 8, 9, 10, 12, 13, 15, 17, 18, 21, 22, 24, 25, 26, 28, 30, 31, 32, 33, 34, 35, 39, 40, 41),]

p = ggplot(df.bribe.trans3, aes(x = reorder(Date, prep), 
                       y = prep)) 
p + geom_bar(fill = "#fd482f", color = "black", stat = "identity") + coord_flip() + labs(title = "Average bribe-amount for:", x = "", y = "Amount in Indian Rupee" ) + 
  theme(plot.title = element_text(lineheight=.5, face="bold"))

Conclusion

We plotted the average bribe-amount (Total amount spent on each transaction detail, divided by its number of reports) for some selected services. There is surprisingly a huge difference between them. For a new PAN-card (Code that acts as identification of Indians, especially those who pay Income Tax.) you need to pay, in average, 70773 Indian Rupee (around 7,500 DKK), whereas a duplication of the driving license is very cheap with a price of 500 Indian Rupee (around 52 DKK). One funny note about it: In order to get a scholarship in India, it seems that you sometimes first need to pay money, before you get some back.

But we need to be careful with these results: For some of the services we can be pretty sure about the average-amount because of many reports. Others, like "Land Registration", had only a few reports.

How expensive is a birth certificate in different cities?


## About transaction details ##

Mumbai = df.bribe[!(df.bribe$City != "Mumbai"),]
Delhi = df.bribe[!(df.bribe$City != "New Delhi"),]
Bangalore = df.bribe[!(df.bribe$City != "Bangalore"),]
Agra = df.bribe[!(df.bribe$City != "Agra"),]
Pune = df.bribe[!(df.bribe$City != "Pune"),]

Mumbai = Mumbai[!(Mumbai$Trans_details != "Birth Certificate"),]
Delhi = Delhi[!(Delhi$Trans_details != "Birth Certificate"),]
Bangalore = Bangalore[!(Bangalore$Trans_details != "Birth Certificate"),]
Agra = Agra[!(Agra$Trans_details != "Birth Certificate"),]
Pune = Pune[!(Pune$Trans_details != "Birth Certificate"),]

Mumbai = na.omit(Mumbai)
Delhi = na.omit(Delhi)
Bangalore = na.omit(Bangalore)
Agra = na.omit(Agra)
Pune = na.omit(Pune)

Delhi. = group_by(Delhi, Trans_details)

Delhi. = dplyr::summarise(Delhi., count = n())
Delhi.2 = aggregate(Delhi$Amount, by = list(Date=Delhi$Trans_details), FUN = sum, na.rm=TRUE)
Delhi.3 = inner_join(Delhi.2, Delhi., by = c("Date" = "Trans_details"))
Delhi.3$prep = round(as.numeric(Delhi.3$x)/as.numeric(Delhi.3$count))
Delhi.3$city = "New Delhi"

Mumbai. = group_by(Mumbai, Trans_details)

Mumbai. = dplyr::summarise(Mumbai., count = n())
Mumbai.2 = aggregate(Mumbai$Amount, by = list(Date=Mumbai$Trans_details), FUN = sum, na.rm=TRUE)
Mumbai.3 = inner_join(Mumbai.2, Mumbai., by = c("Date" = "Trans_details"))
Mumbai.3$prep = round(as.numeric(Mumbai.3$x)/as.numeric(Mumbai.3$count))
Mumbai.3$city = "Mumbai"

Bangalore. = group_by(Bangalore, Trans_details)

Bangalore. = dplyr::summarise(Bangalore., count = n())
Bangalore.2 = aggregate(Bangalore$Amount, by = list(Date=Bangalore$Trans_details), FUN = sum, na.rm=TRUE)
Bangalore.3 = inner_join(Bangalore.2, Bangalore., by = c("Date" = "Trans_details"))
Bangalore.3$prep = round(as.numeric(Bangalore.3$x)/as.numeric(Bangalore.3$count))
Bangalore.3$city = "Bangalore"

Agra. = group_by(Agra, Trans_details)

Agra. = dplyr::summarise(Agra., count = n())
Agra.2 = aggregate(Agra$Amount, by = list(Date=Agra$Trans_details), FUN = sum, na.rm=TRUE)
Agra.3 = inner_join(Agra.2, Agra., by = c("Date" = "Trans_details"))
Agra.3$prep = round(as.numeric(Agra.3$x)/as.numeric(Agra.3$count))
Agra.3$city = "Agra"

Pune. = group_by(Pune, Trans_details)

Pune. = dplyr::summarise(Pune., count = n())
Pune.2 = aggregate(Pune$Amount, by = list(Date=Pune$Trans_details), FUN = sum, na.rm=TRUE)
Pune.3 = inner_join(Pune.2, Pune., by = c("Date" = "Trans_details"))
Pune.3$prep = round(as.numeric(Pune.3$x)/as.numeric(Pune.3$count))
Pune.3$city = "Pune"

Cities = rbind(Pune.3, Agra.3, Bangalore.3, Delhi.3, Mumbai.3)

q = ggplot(Cities, aes(x = reorder(city, prep), 
                                y = prep)) 
q + geom_bar(fill = "#fd482f", color = "black", stat = "identity") + coord_flip() + labs(title = "Average bribe-amount for a birth certificate in:", x = "City", y = "Amount in Indian Rupee" ) + 
  theme(plot.title = element_text(lineheight=.5, face="bold")) + geom_text(aes(label=prep), size = 3.5, hjust=1.1, vjust=0)

Conclusion

After doing the first graph on the average bribe-amount for each sector, we asked ourselves: Is there a difference between the amount for a service payed in, for example, Delhi and Mumbai? Now we can say: Yes, there are huge differences! We used the bribes for a "birth certificate" as an example, because there are 219 reports on that during the last three weeks - which is the highest number. We didn't really expect these big differences and it seems, that the result aren't really reliable. But: we can neither affirm them nor scrap them, because we have more than just a one or two reports on each city. (Pune: 4; Agra: 44, Bangalore: 33; New Delhi: 4; Mumbai: 10) However, it can really be the case, that bribes are less common (and therefore more expensive) in the most important two cities Delhi and Mumbai, because they receive more national and international attention.


Some general notes

All of our conclusions are kind of vague. As we said, the analysed period of time is very small and the results are too less to do reliable conclusions. But we think that our main messages could nonetheless be true.

sebastianbarfort commented 8 years ago

Good assignment!

Great code, nice maps and I like your overall thoughts (which could perhaps be described in a little more detail).

Keep up the good work!

APPROVED