sebastianbarfort / sds

Social Data Science, course at University of Copenhagen
http://sebastianbarfort.github.io/sds/
12 stars 17 forks source link

Group 16 - Assignment 2 #54

Closed bjarkedahl closed 9 years ago

bjarkedahl commented 9 years ago

title: "Assignment 2 - group 16" author: "Bjarke Dahl Mogensen, Mikkel Mertz og Benjamin Wedel Mathiasen" date: "9. nov. 2015"

output: html_document

library ("plyr")
library ("dplyr")
library ("rvest")
library ("readr")
library ("knitr")
library ("stringr")
library ("xml2")
library ("ggplot2")
library("mapproj")
link_ipaidabribe = paste("http://www.ipaidabribe.com/reports/paid?page=", 0:99, 0, sep = "")

scrape_ipaidabribe = function (link_ipaidabribe) {
  my.link = read_html(link_ipaidabribe, encoding= "UTF-8")
  location = my.link %>%
    html_nodes(".location " ) %>%
    html_text()
  date = my.link %>%
    html_nodes(".date " ) %>%
    html_text()
  transactiondetail = my.link %>%
    html_nodes(".transaction a" ) %>%
    html_text()
  depname = my.link %>%
    html_nodes(".name a" ) %>%
    html_text()
  title = my.link %>%
    html_nodes(".heading-3 a" ) %>%
    html_text()
  views = my.link %>%
    html_nodes(".overview .views" ) %>%
    html_text()
  payment = my.link %>%
    html_nodes(".paid-amount span" ) %>%
    html_text()
  refnumber = my.link %>%
    html_nodes(".unique-reference" ) %>%
    html_text()
  return (cbind(title, payment, depname, transactiondetail, views, location, date, refnumber))  
}

my.ipaidabribe.data = list() 
for (i in link_ipaidabribe){
  print(paste("processing", i, sep = " "))
  my.ipaidabribe.data[[i]] = scrape_ipaidabribe(i)
    Sys.sleep(0.25)
  cat(" done!\n")
}

df=ldply(my.ipaidabribe.data, data.frame)
df.backup = df
df$payment = gsub(",","",df$payment)
df$payment = as.numeric(word(df$payment, +3)) # generating variable with only the price

Estimation of bribes in India

This analysis seeks to illuminate the structure of bribes in India. We do this using data from the website ipaidabribe.com. This is a website where people can write if they bribed someone. Peoble are able to inform of various aspects of their bribe such as: how much they paid, where they paid it and to whom. Since the data relies on people to report their bribes themselves we are not observing every bribe, but as long as there isn't a bias in the data left out this is not going to affect our main conclusions. You might think of it like this: if the older generation does not report any bribes and this subsection of the pupulation bribes peoble for different reasons this would bias our results. However we do not think this is a problem, so we keep to the information availible to us and even if it is only describing a subsection of the population it is still relevant.

Size of the bribes

When considering the structure of bribes one interesting part to analyze could be the size of the bribes and how much they vary. The mean, median, maximum and minimum of the bribe payments are shown in table 1. as we see there is a very big gab between the highest and lowest payment, but its also worth to notice the very big gab between the mean and median.

_Table 1 - Summary statistics_

summary(df$payment)

Because of the large variation in the size of the amounts being payed we consider the distribution of payments in figure 1. In figure 1 the density is shown to the log of payments to center it. Even within this transformation the distribuion seems to have a very long tail to the right indicating that the majority of the payments are in the smaller area, but there is a few very large bribes.

df$India = "India"
df$logpayments = log(df$payment)
boxplot(logpayments~India, data = df, main = "Distribution of payments",
        ylab = "Log(Payment)")

p = ggplot(data = df, aes(x = payment))
p = p + geom_density() + labs(title = "Figure 1: Distribution of bribe payments",
                              x = "Log(Payment)",
                              y = "Density")
p = p + scale_x_log10()
p = p + theme(panel.grid.major = element_blank(),
              panel.grid.minor = element_blank(),
              panel.background = element_blank(),
              axis.line = element_line(colour = "black"))
plot(p)

Number of bribes

As shown above the size of bribes vary a lot, but who primarily receives the bribe and for what? This is what we are going to consider in the next section. In this section we only consider the number of bribes and leaves out the size of the bribe for a moment. Figure 2 shows the number of bribes reported to each department. Among the departments reported are the police, transportation, municipalities etc.

df.dep = df %>%
  filter(!is.na(depname)) %>% #Removing payments with missing department
  group_by(depname)

p = ggplot(data = df.dep, aes(x = depname))
p = p + geom_histogram() + coord_flip() + labs(title = "Figure 2: Number of bribes to each department", 
                                               x = "Department", 
                                               y = "Number of bribes")
p = p + theme(panel.grid.major = element_blank(),
              panel.grid.minor = element_blank(),
              panel.background = element_blank(),
              axis.line = element_line(colour = "black"))
plot(p)

As figure 2 shows the majority of the bribes goes to municipal service, while "Food, Civil Suppliers and Consumer Affairs" and the police gets second and third most bribes. Since the majority of the bribes goes to municipality serveces it is interesting to see what kind of services the municipalities gets bribed for. This is shown in figure 3.

df.transaction = subset(df, depname=="Municipal Services") %>%
  filter(!is.na(transactiondetail)) %>% #Removing payments with missing transaction detail
  group_by(transactiondetail)

p = ggplot(data = df.transaction, aes(x = transactiondetail))
p = p + geom_histogram() + coord_flip() + labs(title = "Figure 3: What do the municipalities get bribed for", 
                                               x = "Action", 
                                               y = "Number of bribes")
p = p + theme(panel.grid.major = element_blank(),
              panel.grid.minor = element_blank(),
              panel.background = element_blank(),
              axis.line = element_line(colour = "black"))
plot(p)

As indicated by figure 3 the vast majority of the bribes goes to the category "Birth Certificates" and only a very small fraction of the bribes goes to the other categories with the largest being registration of land.

Investigation of the differences between states

So far we have only considered the overall bribes and size of bribes in entire India. In this section however we dive a little bit more into the different states of India. We want to consider the average sum of payments per 10.000 inhabitants in the different states. This is illustrated in figure 4 where we plot the logarithmic transformation of the sum of payements per 10.000 individual in each state. As figure 4 shows there is a great difference in the average payment.

#Scraping populationdata for India from wikipedia:
India_pop =  read_html("https://en.wikipedia.org/wiki/Demographics_of_India") %>%
  html_nodes(xpath= '//*[@id="mw-content-text"]/table[4]') %>%
  html_table( )
df_India_pop=ldply(India_pop, data.frame)
df_India_pop$state_2 = df_India_pop$State...Union.Territory

#creating a state-variable in df
df$State = df$location %>% 
  str_replace_all(pattern = "[A-Z][a-z]*\r\n" , replacement= " ") %>%
  str_trim()
df$state_1 = gsub(",", "", df$State)

#Creating a new dataframe with total payment per state
df_state = df %>% 
  filter(!is.na(state_1) & !(df$state_1 =="")) %>% 
  group_by(state_1) %>% 
  summarise(total.payments = sum(payment, na.rm = TRUE), n_bribes = n() )
df_state$state_2 = gsub("^\\s+", "", df_state$state_1)

?summarise
#By Comparing the data from wiki and the data from ipaidabribe we see that there are some of the state-names that differ. We make sure that at least the states with the biggest total payment are similar in the two dataframes. We find that there is a problem with New Delhi and fix that:
df_state[29,"state_2"]="Delhi"

#Joining the data from wiki with df_state
df_state = inner_join(df_India_pop, df_state, by="state_2")
df_state$pop=gsub(",", "",df_state$Population.24.)
df_state$pop1 = gsub("^\\s+", "", df_state$pop)

#Construction a new variable 'relativepayment'
as.numeric(df_state$pop1)
as.numeric(df_state$total.payments)
df_state$relativepayment = as.numeric(df_state$total.payments) / as.numeric(df_state$pop1) * 100000
arrange(df_state, -relativepayment)

#Plotting the data to se the states with the highest relative bribe payments
p = ggplot(data = df_state, aes(x=state_2, y=relativepayment))
p = p + geom_bar(stat = "identity") + coord_flip() + labs(title = "Figure 4: Payment pr. 100,000 inhabitants", 
                                               x = "State", 
                                               y = "Log of payment pr. 100.000 inhabitants")
p = p + scale_y_log10()
plot(p)

From figure 4 we see, that there is huge variation in the number of bribes in each state. Theoretically there could be a lots of causes of this difference. Factors such as culture, education, demographics, economic situation in the state and urbanization might influence both the number and the size of bribes in a givens state. Since we only have a small part of the available data on ipaidabribe.com and only limited writingspace a adequate analysis of these factors are not possible. Therefore we focus on one of the above mentioned factors - Urbanization.

Urbanization is a relevant issue in India in nearlye every matter regarding economics, corruption, living conditions etc. According to the World Bank 90 mio. people migrated from farms to cities in the last decade.The prospect of higher wages and better living standards is expected to draw 250 million more by 2030.

The question is wether this huge movement of people have any influence on the level of corruption. You could for instance expect, that a side effect of the urbanization would be more employment in job functions that require a higher degree of education than for instance work in the agriculture. Such a development will (over time) lead to a generally higher level of education and since there is kind of consensus around the fact, that higher education often leads to better functioning democratic systems one could assume that the level of corruption would be lower in urban areas than in rural areas.

The purpose of figure 5 is to inveatigate this theory. If the theory is correct, we should se a equalization between the states. We see a litle equalization, but it is difficult to see whether it is a significant change.


#Constructioon new variables for the relative payment/bribes pr. 10.000 urban inhabitants
df_state$urbanpop=gsub(",", "",df_state$Urban.Population.24.)
df_state$urbanpop1 = gsub("^\\s+", "", df_state$urbanpop)
as.numeric(df_state$urbanpop1)
df_state$urban_relativepayment = as.numeric(df_state$total.payments)/
  as.numeric(df_state$urbanpop1)*100000
df_state$share_urban_pop = as.numeric(df_state$urbanpop1)/as.numeric(df_state$pop1)
df_state$Relativebribes = as.numeric(df_state$n_bribes)/as.numeric(df_state$pop1)
df_state$Relativebribes_urban = as.numeric(df_state$n_bribes)/as.numeric(df_state$urbanpop)

#Plotting the data to see the states with the highest relative bribe relative to urban population
p = ggplot(data = df_state, aes(x=state_2, y=urban_relativepayment))
p = p + geom_bar(stat="identity") + coord_flip() + labs(title = "Figure 5: Relative bribe-pay pr. 100,000 urban inhabitants", 
                                               x = "State", 
                                               y = "Log of payment pr. 100,000 urban inhabitants)")
p = p + scale_y_log10()
plot(p)

p = ggplot(data = df_state, aes(x=relativepayment, y=share_urban_pop))
p = p + geom_point() + coord_flip() + labs(title = "Figure 6: Relative payments compared to share of total population living in urban areas", 
                                               x = "Log of payments pr. 100.000 inhabitants", 
                                               y = "Share of population living in urban areas")
p = p + scale_x_log10()
plot(p)

In figure 6 the picture is also pretty blurred. Again we must keep in mind, that this is a really small part of the total data available. If we increased the number of observation we might get another and more robust result. From figure 6 we can not say much, but it is interesting, that the states with the most corruption typically have a urbanization-rate between 25 % and 50 %. Whether this pattern is due to underreporting in rural areas or there is a actual jump in corruption-rates at this interval we leave unanswered.

sebastianbarfort commented 9 years ago

Hi Bjarke and co,

Great assignment. You write clear code and produce nice and intuitive graphs. You could perhaps add a smother to the plot of payments relative to population size, but what you're doing is great.

APPROVED.