sebastianbarfort / sds

Social Data Science, course at University of Copenhagen
http://sebastianbarfort.github.io/sds/
12 stars 17 forks source link

Group 6: Assignment 2 #37

Closed BCEgerod closed 8 years ago

BCEgerod commented 8 years ago

title: 'Group 6: Indian Bribes Assignment' author: 'Ulrik Torp Elberg & Benjamin C.K. Egerod' date: November 8, 2015

output: html_document

Writing the function to scrape from ipaidabribe.com

In what follows, we investigate how bribery in India varies across department and states. To do that we scrape data on local bribery from ipaidabribe.com.

First, we load the relevant R packages for data manipulation, plotting and scraping

library(rvest); library(reshape); library(reshape2)
library(plyr); library(dplyr); library(ggplot2); library(stringr)

We then use selectorGadget to define the css-string, we are interested in.

# css.selector.
css.selector <- ".date , .paid-amount span , .location , .overview .views , .transaction a , .name a , .heading-3 a"

And paste together a vector of links to loop over when scraping

# link to first page
link <- "http://www.ipaidabribe.com/reports/paid#gsc.tab=0"

# string common to all links
link.part.1 <- rep("http://www.ipaidabribe.com/reports/paid?", times = 100)
link.part.2 <- "#gsc.tab=0" # second part of string common to all links

#paste together with page number to get link for each page of bribe reports
pagenumber <- paste(link.part.1,"page=", seq(from = 10, 1000, by = 10),link.part.2, sep ="")

#put in link to first page
pagenumber[101] <- "http://www.ipaidabribe.com/reports/paid#gsc.tab=0" #this actually gives 1010 bribery reports

We then write a function that loops over the links and scrapes down the bribery reports and stores them in a list. It then loops over the list, reshapes and cleans the data and returns a single data frame consisting of 1010 bribery reports and the relevant variables.

# define finction to scrape 'I Paid a Bribe'
scrape.bribe <- function(links, selector){

    list.of.data <- list() # make empty list to store scraped data

cat("Now scraping data ... \n")    

for(i in 1:length(links)){ # loop over the links that are inputted
  #print(paste("processing", i, sep = " ")) leave out in rmarkdown version

list.of.data[[i]] <- read_html(links[i], encoding = "UTF-8") %>% #open element i from the vector of links
  html_nodes(css = selector) %>% # extract elements of interest from the webpage
  html_text() #convert
  Sys.sleep(1) # wait for a second
  #cat("Done! \n")
}

cat("Finished scraping")
cat("Now reshaping data ... \n")

bribe.data3 <- list() # create empty list to store cleaned data

for(j in 1:length(list.of.data)){ # loop over scraped elements

  #print(paste("processing", j, sep = " ")) leave out in rmarkdown version

  dat <- as.data.frame(list.of.data[j]) # convert to df to manipulate easier
  names(dat) <- "bribe.data2"
  dat$var.type <- rep(c(1:7), times = 10) # put in identifier for variable of interest
  # unique identifier for report. Each variable is classified to belong to a report
  dat$report <- rep(c(1:10), each = 7) 

  # currently the variables are saved in a single vector
  # divide that vector up into multiples to consitute a matrix and save as a df
  dat2 <- dcast(dat, report ~ var.type, value.var = "bribe.data2")
  #name the variables
  names(dat2) <- c("report", "nViews", "Title", "Department", "Subject", "Amount", "Date", "City.and.Province")

  # save each of these df's into a list 
  bribe.data3[[j]] <- dat2
  rm(dat); rm(dat2)

  #cat("Done! \n")

}

# bind together all df's in the list to create single df of all pages
final.data <- do.call("rbind", bribe.data3)

  return(final.data)

}

# run the function
data <- scrape.bribe(link = pagenumber, selector = css.selector)

Finally, to work with the data, we need to convert the amount payed in bribes to numerical values. And we need to convert the date variable to a date format.

#convert amount payed to numeric variable
data$amountNum <- str_extract(string = data$Amount, "[0-9]{0,3}[,]{0,1}[0-9]{1,3}[,]{0,1}[0-9]{1,3}")

data$amountNum <-as.numeric( gsub(data$amountNum, pattern = ",", replacement = "")) # Strip away comma an convert to numeric

# first set locale to "C"
lct <- Sys.getlocale("LC_TIME"); Sys.setlocale("LC_TIME", "C")

# Now convert dates
data$Date <- as.Date(data$Date, format= "%B %d, %Y")

Investigating data

In this short data analysis, we focus on two questions: 1) whether there are systematic differences across departments in the amount of bribes that is payed to them, and 2) whether there are differences across the Indian states.

Bribes across departments of regional governement

To do this, we first look at the total amount of bribes payed by department. This is shown in the barplot below.

total.bribe.dep <- data %>%
  group_by(Department) %>%
  summarise(sum(amountNum, na.rm = T))
names(total.bribe.dep)[2] <- "Total.Bribes"

ggplot(total.bribe.dep, aes(x = reorder(Department, Total.Bribes), y = Total.Bribes)) + 
  geom_bar(stat = "identity") + 
  theme_minimal() + 
  coord_flip() +
  labs(x = "Department", y = "Total bribes", title = "Total amount of bribes payed in the period \n by department in descending order")

Five departments stand out: Municipal Services, Income tax, the Police, Stamps and Registrations and -- finally -- Commercial Tax, Sales Tax, VAT. It should be noted that -- among other things -- the Municipal Services administrate the highering and pay grades of local public employees. The Stamps and Registration department administrates e.g. the registration of property sales and ownership among citizens. It seems that the major receivers of bribes -- in this time period, at least -- are the departments which holds some direct sway over people's economic welfare, or like the police can use violence legitimately.

Also, the services these departments provide are compulsory: as an ordinary citizen, you probably can't get around paying your taxes, and you need to have the property you own registered, but you might be able to bribe your way to a lower tax or fee.

It should also be noted that the literature on the political economy of corruption emphasizes the intransparency of regulation and the autonomy of the public officials as determinants of bribery levels. Many developing countries have -- more or less intentionally -- put in place extremely intransparent systems of taxation, which no ordinary citizen would understand. This means that bribery can constitute a potentially large part of the salary of public employees departments: paying a bribe may "grease the wheels", let your request be treated faster and the decision turn out in your favour. It essentially becomes necessary for people who don't understand the intransparent bureaucracy to pay bribes in order to get things done. Taking these things into consideration, it shouldn't come as a surprise that these departments are the biggest bribe receivers.

However, seeing as the period we consider is relatively short, any of the above departments may by chance have recieved single extremely big bribes that put them apart from the seemingly less corrupt departments. The code below plots a figure which shows the cumulative development in bribes payed to the respective departments throughout time period. We restrict our attention to the five biggest bribe receivers among the departments.

# now cumulative sums of bribes by department
daily.bribe.dep.cum <- data %>%
  group_by(Department, Date) %>%
  summarise(amountNum = sum(amountNum, na.rm = T)) %>%
  mutate(cumulative = cumsum(amountNum))
names(daily.bribe.dep.cum)[3] <- "Total.Bribes"

daily.bribe.dep.cum$Department[1] <- NA # set white space to missing obs

# Set all departments except five biggets bribe receivers as missing.
daily.bribe.dep.cum$Big.bribe.dep <- ifelse(daily.bribe.dep.cum$Department == "Municipal Services", daily.bribe.dep.cum$Department,
                             ifelse(daily.bribe.dep.cum$Department == "Income Tax", daily.bribe.dep.cum$Department, 
                                    ifelse(daily.bribe.dep.cum$Department == "Police", daily.bribe.dep.cum$Department,
                                           ifelse(daily.bribe.dep.cum$Department == "Stamps and Registration", daily.bribe.dep.cum$Department,
                                                  ifelse(daily.bribe.dep.cum$Department == "Commercial Tax, Sales Tax, VAT", daily.bribe.dep.cum$Department, NA
                                                         )
                                                  )
                                           )
                                    )
                             )

# Plot cumulative development with NA's removed
ggplot(na.omit(daily.bribe.dep.cum), aes(x = Date, y = cumulative, colour = Big.bribe.dep)) + 
  geom_line() + 
  theme_minimal() + 
  labs(x = NULL, y = "Cumulative bribes", 
       title = "Cumulative amount of bribes payed each day \n among the five biggest bribe receiving departments")

From this plot, it becomes clear that not all of the biggest bribe receiving departments continuously receive big amounts during the entire period. Whereas Municipal Services receive bribes more or less steadily throughout the whole period, the other of the Big Five only have one or two big bribed days in the sampled period, which are behind their unsightly placement among the top five bribe receivers. This indicates that the pattern may be an artefact of the time period under consideration.

Geographical patterns of bribery

First we plot the total amount of bribes received in each of the states on a map of India.

#we start by loading in some handy packages for working with GIS data
require("rgdal"); library(rgeos); library(maptools); library(mapproj); library(raster)

# we then split the variable containing city and state of the bribe report
geo <- str_split(data$City.and.Province, pattern  = ",")
geo <- do.call(rbind.data.frame, geo)
geo[,3] <- NULL
data <- bind_cols(data, geo)
names(data)[10:11] <- c("City", "Province") 

# there's white space, which makes merging with the GIS-data difficult
# This is an inefficient solution but gets the job done
data$Province2 <- gsub(x = data$Province, pattern = " ", replacement = "")

data$Province2 <- ifelse(data$Province2 == "AndamanandNicobar", "Andaman and Nicobar", 
                         ifelse(data$Province2 == "AndhraPradesh", "Andhra Pradesh", 
                                ifelse(data$Province2 == "ArunachalPradesh", "Arunachal Pradesh", 
                                       ifelse(data$Province2 == "DadraandNagarHaveli", "Dadra and Nagar Haveli", 
                                              ifelse(data$Province2 ==  "DamanandDiu", "Daman and Diu", 
                                                     ifelse(data$Province2 == "HimachalPradesh", "Himachal Pradesh", 
                                                            ifelse(data$Province2 == "JammuandKashmir", "Jammu and Kashmir", 
                                                                   ifelse(data$Province2 == "MadhyaPradesh", "Madhya Pradesh", 
                                                                          ifelse(data$Province2 == "TamilNadu", "Tamil Nadu", 
                                                                                 ifelse(data$Province2 == "UttarPradesh", "Uttar Pradesh", 
                                                                                        ifelse(data$Province2 == "WestBengal", "West Bengal", data$Province2)))))))))))

# download India administrative shapefiles 
india.regions <- getData("GADM", country = "India", level = 1)

# break down shapefile for use in ggplot 
indiaF <- fortify(india.regions, region = c("ID_1"))
indiaF2 <- merge(indiaF, india.regions, by.x = "id", by.y = "ID_1")
names(indiaF2)[12] <- "Province2" 

# calculate total bribes by Indian state
data.for.plot <- data %>%
  group_by(Province2) %>%
  summarise(sum(amountNum, na.rm = T))
#data.for.plot[1] <- NULL
names(data.for.plot)[2] <- "Total.Bribes"

# join together shapefile and bribery data by Indian state
indiaF3 <- left_join(indiaF2, data.for.plot, by = "Province2")

# defines the name of each state as a point by long and lat
cnames <- aggregate(cbind(long, lat) ~ Province2, data=indiaF3, 
                    FUN=function(x)mean(range(x)))

# plots the data on a map
ggplot(indiaF3, aes(x = long, y = lat)) +
  geom_polygon(aes(group = group, fill = Total.Bribes)) +
  theme_minimal() +
  labs(x = NULL, y = NULL, title = "Bribes payed in Indian States") +
  geom_text(data=cnames, aes(long, lat, label = Province2), size=3, colour = "red") + # red colour to better disinguish names of the states
  coord_map()
# if we had time, we would've plotted the citizies according to bribes received. We couldn't make it work in time for deadline.

It is interesting to see that Maharashtra, which is known as one of the power houses of the Indian economy is also the state, where the highest amount of bribes is payed by far. As runner-up, we find Uttar Pradesh, which in absolute terms is India's third largest economy. If this pattern holds and the biggest economies among the Indian states are generally more corrupt, it would run counter to the literature, which has generally found a negative correlation between both growth and levels of GDP and corruption. Corruption may affect the wealth of a state, beacuse if bribery is necessary to get licenses or get proof of ownership etc. this would decrease the incentives to make new investments. Conversely, high GDP would probably increase the salaries of the local officials, which would decrease the incentive to supplement their legal income with bribes.

No matter the causal mechanism, it would be puzzling, if we found a positive relation. To test it more formally, we scrape down GDP pr. capita of the Indian states and merge it with our bribery data on the state level. We scrape the data from statisticstimes.com, which we don't know much about, so sure about the quality of the data. But the face validity of it at least seems fine.

#Scrape GDP pr. capita for Indian States
link.gdp <- "http://statisticstimes.com/economy/gdp-capita-of-indian-states.php"
table.id <- "#table_id"

indian.gdp <- read_html(link.gdp) %>% 
  html_nodes(table.id) %>%
  html_table(fill = T)

indian.gdp <- as.data.frame(indian.gdp)
indian.gdp <- indian.gdp[-1,]
names(indian.gdp)[4] <- "GDPcap"
names(indian.gdp)[2] <- "Province2"

gdpState <- left_join(indian.gdp, data.for.plot, by = "Province2")
gdpState$GDPcap <-as.numeric( gsub(gdpState$GDPcap, pattern = ",", replacement = ""))

Below, we plot the relationship between the total amount of bribes payed in a State and its GDP pr. capita. It's clear from looking at both the linear and loess fitted line that we find no relation between wealth and corruption.

ggplot(gdpState, aes(x = GDPcap, y = Total.Bribes)) +
  geom_point() +
  theme_minimal() +
  stat_smooth(method = lm) +
  stat_smooth(method = loess) +
  geom_text(aes(label = Province2), size = 3.2) +
  labs(x="GDP pr. Capita", y = "Total Bribes", title = "Correlation between wealth and bribery")

Below we drop Maharashtra and plot the same figure. After doing this, we find a somewhat negative relationship, but it is still statistically and substantially indistinguishable from zero.

# without Maharashtra
gdpState2 <- gdpState[-7, ]

ggplot(gdpState2, aes(x = GDPcap, y = Total.Bribes)) +
  geom_point() +
  theme_minimal() +
  stat_smooth(method = lm) +
  stat_smooth(method = loess) +
  geom_text(aes(label = Province2), size = 3.2)+
  labs(x="GDP pr. Capita", y = "Total Bribes", title = "Correlation between wealth and bribery \n Maharashtra excluded")

It should be noted that this bribery data is extremely reliant on internet access. Seeing as internet access is strongly correlated with economic development, it's straightforward to assume that more wealthy areas would be more connected. This would -- all things being equal -- make it easier to report bribery using the I Paid a Bribe website in rich states than in poorer. This would ultimately lead to non-random measurement error in all likelihood relatively overestimating corruption in the richer states and underestimating it in poorer ones. This would bias any regression estimator upward. This could be what is driving the close-to-zero association between GDP pr.capita and amounts paid in bribery that we find.

sebastianbarfort commented 8 years ago

Very good assignment!

APPROVED