sebastianbarfort / sds

Social Data Science, course at University of Copenhagen
http://sebastianbarfort.github.io/sds/
12 stars 17 forks source link

Group 17: Assignment 2 #51

Closed CarolineMarkMortensen closed 8 years ago

CarolineMarkMortensen commented 8 years ago

title: "Assignment 2" author: "Gruppe 17: Anna Møller, Tamara Møller-Hastrup, Michal Mróz and Caroline Mortensen" date: "9. nov. 2015"

output: html_document


### Necessary libraries and function ------

library("rvest")
library("knitr")
library("dplyr")
library("stringr")
library("zoo")
library("lubridate")
library("sp")
library("RColorBrewer")
library("timeDate")
library("ggplot2")
library("ggthemes")
library("car")

trim.leading <- function (x)  sub("^\\s+", "", x)
### Loading data -----------
css.selector = ".unique-reference , .overview .views , .location , .date , .transaction a , .paid-amount span , .name a , .heading-3 a"

bribe.data <- data.frame(X1=character(),
                         X2=character(), 
                         X3=character(), 
                         X4=character(),
                         X5=character(),
                         stringsAsFactors=FALSE) 

for (i in 1:10) {

  link = paste("http://www.ipaidabribe.com/reports/paid?page=",i,"0#gsc.tab=0", sep="")

  print(paste("processing", i, sep = " "))

  bribe.page = read_html(link) %>% 
    html_nodes(css = css.selector) %>% 
    html_text()

  a <- matrix(bribe.page, nrow=length(bribe.page)/8, ncol=8, byrow=TRUE)
  b <- data.frame(a)

  bribe.data <- rbind(bribe.data,b)

  Sys.sleep(1)
  cat(" done!\n")

}
bribe.data <- read.csv("https://github.com/michalmroz/SDS/raw/master/Bribesr.csv")
#Extracting Indian cities and states
bribe.data$City <- gsub(",.*","",bribe.data$Place)
bribe.data$State <- gsub(".*,","",bribe.data$Place)

#Converting string date into a date format and adding a weekday variable
bribe.data$Date <- gsub(",","",bribe.data$Date)
Sys.setlocale("LC_TIME", "uk")
bribe.data$Date <- as.Date(bribe.data$Date, format="%B%d%Y")
bribe.data$Date <- as.timeDate(bribe.data$Date)
bribe.data$Day <- dayOfWeek(bribe.data$Date)
bribe.data$Date <- as.Date(bribe.data$Date)

# Clearing the "Views" variable
bribe.data$Views <- gsub(" views","",bribe.data$Views)
bribe.data$Views <- as.numeric(bribe.data$Views)

# Clearing the "Amount" variable
bribe.data$Amount <- gsub("Paid INR ","",bribe.data$Amount)
bribe.data$Amount <- gsub(",","",bribe.data$Amount)
bribe.data$Amount <- as.numeric(bribe.data$Amount)

We have scraped data from the website www.ipaidabribe.com and we collected a 1,000 observations. As the website only shows 10 reports per page we have to collect our data by scraping the last 100 pages.

 ### Analyzing data 
summary(bribe.data)

bribes.by.state <- bribe.data %>% 
  group_by(State) %>%
  summarise(NumberOfBribes=n()) %>%
  arrange(-NumberOfBribes)

names(bribes.by.state) <- c("NAME_1","NumberOfBribes")
bribes.by.state$NAME_1 <- as.character(bribes.by.state$NAME_1)
bribes.by.state$NAME_1 <- str_replace_all(bribes.by.state$NAME_1, "[\r\n]" , "")
bribes.by.state$NAME_1 <- trim.leading(bribes.by.state$NAME_1)

bribe.data$City <- str_replace_all(bribe.data$City, "[\r\n]" , "")
bribe.data$City <- trim.leading(bribe.data$City)

## Summarizing by department
bribe.data.copy <- bribe.data
bribe.data.copy$Department <- recode(bribe.data.copy$Department, "c('Public Works Department','Revenue','Airports', 'Education','Labour','Stamps and Registration')='Others'", as.factor.result=FALSE)

department.data <- bribe.data.copy %>% 
  group_by(Department, Date) %>%
  filter(Department!="" & Department!=" ") %>%
  filter(Amount<=1000000 & Amount>=500) %>%
#  filter(Date!="2015-10-12") %>%
  summarise(NumberOfBribes=n(), SumOfBribes=sum(Amount)) %>%
  mutate(SumNumber=cumsum(NumberOfBribes), SumBribes=cumsum(SumOfBribes)) %>%
  arrange(Date)

The figure shows which department receives most bribes in the given period. Public Works Department, Revenue, Airports, Education, Labour, Stamps and Registration received a small amount of bribes so we cathegorized these deparments in Others. In this time period the bribes to "Municipal Services" and "Others" increased significantly more than the other departments.

s = ggplot(department.data, aes(x=Date, y=SumNumber, colour=Department))
s + geom_line(size=1) +
  theme_minimal()
## Cities with the most bribes per state -------
a <- bribe.data %>% 
  group_by(State, City) %>%
  summarise(NumberOfBribes=n()) %>%
  mutate(rank=rank(NumberOfBribes)) %>%
  arrange(rank) %>%
  filter(rank==max(rank)) %>%
  select(State,City,NumberOfBribes)

## Number of bribes per date

bribes.by.date <- bribe.data %>%
    group_by(Date) %>%
    summarise(Bribes=n()) %>%
    arrange(Date)

The plot below shows how many bribes were reported on the certain dates. That you can see no apparent patterns in the distribution of the bribes given. Also no apparent fluctuations in terms of a certain day of the week.

## plotting (without the outlier - the 12th of October)
bribes.by.date %>%
 filter(Date!="2015-10-12") %>%
  plot()

The table below shows which day of the week the most posts are posted.

## average and variance per day of the week ------
bribe.data %>%
  filter(Date!="2015-10-12") %>%
  group_by(Date) %>%
  summarise(Bribes=n()) %>%
  mutate(Day = as.character(dayOfWeek(as.timeDate(Date)))) %>%
  group_by(Day) %>%
  summarise(Mean=mean(Bribes), SD=sd(Bribes), Volatility=SD/Mean) %>%
  arrange(-Volatility)
## checking for outliers with bribe value -------
bribe.data %>%
  group_by(Amount) %>% 
  arrange(Amount) %>%
  select(Amount, Date) %>%
  head(10)

bribe.data %>%
  arrange(Amount) %>%
  select(Amount, Date, Details) %>%
  tail(10)

### The amount of money paid in each state ------
amount.by.state <- bribe.data %>%
  filter(Amount<=1000000) %>% # removing the outliers
  group_by(State) %>%
  summarise(Sum=as.numeric(sum(Amount))) %>%
  arrange(-Sum)

names(amount.by.state) <- c("NAME_1", "Sum")
amount.by.state$NAME_1 <- as.character(bribes.by.state$NAME_1)
amount.by.state$NAME_1 <- str_replace_all(bribes.by.state$NAME_1, "[\r\n]" , "")
amount.by.state$NAME_1 <- trim.leading(bribes.by.state$NAME_1)

## Number of views per day -----
bribe.data <- bribe.data %>%
  mutate(ViewsPerDay=Views/as.numeric((Sys.Date()-Date)))

cor(bribe.data$Amount, bribe.data$ViewsPerDay)

d <- bribe.data %>%
  filter(Amount<=1000000 & Amount>=500)

The plot below shows the correlation between the size of the bribes and the views per day. Observations below 500 and above 1,000,000 is removed from the data.


plot(d$ViewsPerDay, log(d$Amount))
abline(lm(log(d$Amount)~d$ViewsPerDay),col="green",lwd=1.5)

The correlation is calculated to be:

cor(bribe.data$Amount, bribe.data$ViewsPerDay)

We therefore conclude that the result is not significant.

This bar plot shows the mean number of views per day for bribes given in each state.

## Views per state ------
views.per.state <- bribe.data %>%
  filter(State!="") %>%
  filter(State!=" ") %>%
  group_by(State) %>%
  summarise(Mean=mean(ViewsPerDay)) %>%
  arrange(-Mean)

p = ggplot(views.per.state, aes(x = reorder(State,Mean), y = Mean, fill=State))
p = p + geom_bar(stat="identity")+coord_flip()
p = p + labs(title="Mean number of views for bribes given in each state",
             y="Mean number of views per day", x="State" )
p = p + theme_minimal() + theme(legend.position="none")
p

People who uses the website are more likely to look at the post about Arunachal Pradesh than the other states.

The next section shows different maps of India. The first map depicts the number of bribes paid in each state, the second map shows the total size of all bribes paid in each state and the last map shows the average value of a bribe per state in India.

### Plotting a map  ------

## collect data --> to be downloaded from http://gadm.org/country
##setwd("/Users/caroline/Desktop")
##india <- readRDS("IND_adm1.rds")
library("raster")
india <- getData("GADM", country="IND", level=1)
detach("package:raster") 
## Bribes per state

## merge with the dataset created above
india@data <- left_join(india@data, bribes.by.state, by="NAME_1")
india@data$ln <- log(india@data$NumberOfBribes)

spplot(india, "NumberOfBribes", col.regions = colorRampPalette(brewer.pal(9, "Greys"))(16), 
       col = "#2B2B2B", main = "Number of bribes per state in India")

That as you can see on the map, the highest number of bribes was given in Karnataka, note that this state reports triple number of bribes as any other state.

bribe.data %>%
  group_by(State) %>%
  summarise(a=n()) %>%
  arrange(-a)

In the map below we observe the total size of all bribes in each state. The map depicts map that there are significant differences between states in the size of bribes.

c <- india@data

## Amount per state
india@data <- left_join(india@data, amount.by.state, by="NAME_1")
spplot(india, "Sum", col.regions = colorRampPalette(brewer.pal(9, "Greys"))(16), 
       col = "#2B2B2B", main = "Size of bribes paid per state in India")
head(amount.by.state,5)

We see in the table above that the state Karnataka have reported the highest amount on bribes, which makes sense, as Karnataka also had the most reported bribes. What is more interesting is that even though Uttar Pradesh only reported a third of the total number of bribes than in Karnataka, the total value of the reported amount of the bribes is only 235,271 rupees smaller.

The figure below shows the average size of the bribes paid in each state, and the table shows the name of the states where the average bribes are the biggest.

It is hard to tell the different states of India apart we construct a table to better look at the results. This table shows that the average size of a bribe is biggest in Assam.

## Average bribe by state
india@data$Average <- india@data$Sum/india@data$NumberOfBribes
spplot(india, "Average", col.regions = colorRampPalette(brewer.pal(9, "Greys"))(16), 
       col = "#2B2B2B", main = "The average value of a bribe per state in India")

 india@data %>%
  select(NAME_1, Average) %>%
  na.omit() %>%
  arrange(-Average) %>%
  head(5)

All in all, we found out that www.ipaidabribe.com post most posts on Saturdays and that there is no correlation between the size of a bribes and views. At the same time it is seen that the visitors to this webpage is more interested in the corruption in the state of Arunachal Pradesh than in the rest of the states. Furthermore, we found that the most recevied brib is paid to Municipal Services and "others" in the given time period, while the other departements is very stablie in the time period. In the last part of your assignment we had made three maps where we can see respectively the number of bribes, the total values of bribes and the average size of bribes. In the first maps only one state is is very conspicuous and that state is Karnataka. In the second map there are 5 very distinctly States exactly Karnataka, Uttar Pradesh, Madhya Pradesh, Maharashtra and Assam. In the last map the state with the biggest average bribes i Assam.

sebastianbarfort commented 8 years ago

Hi Caroline and co,

Good assignment. You write clean and effective R code, that's very nice to see. I would recommend using the ggplot2 package to do all the plotting (then you only need to learn one syntax) but that's ultimately up to you.

APPROVED