sebastianbarfort / sds

Social Data Science, course at University of Copenhagen
http://sebastianbarfort.github.io/sds/
12 stars 17 forks source link

Group 13, assignment 2. #40

Closed BobKruithof closed 8 years ago

BobKruithof commented 8 years ago

title: "Assignment 2"

output: html_document

In this assignment, we started by scraping 1.000 observations from http://www.ipaidabribe.com. It is a website with the aim to measure corruption in India. The data includes title, amount, name of department, number of views, city, location, date and weekday. After scraping the data, we perform a data-analysis based on a couple of graphs/tables and one map.

Getting the data from the website:

#Code to get the data

#Scrape the data
library("stringr")
library("readr")
library("lubridate")
library("rvest")
library("XML")
library("tidyr")
library("dplyr")
library("maps")
library("ggplot2")
library("ggmap")
library("grid")
library("gridExtra")
css.selector= c(".views",".paid-amount", ".heading-3", ".location", ".transaction", ".name", ".date")
var_names = c("Views","Amount", "Title", "Location", "Transaction", "Department", "Date")

counter= seq(from=1, to=1000, by=10)
for (p in 1:length(counter)){
link = paste("http://www.ipaidabribe.com/reports/paid?page",counter[p], sep="")
for (i in 1:length(css.selector)){
if (i==1){
tmp = read_html(link) %>% 
html_nodes(css = css.selector[i]) %>% 
html_text()
tmp= tmp[2:11]}
else{
tmp2 = read_html(link) %>% 
html_nodes(css = css.selector[i]) %>% 
html_text()
tmp = cbind(tmp, tmp2)}
} 
if(p==1){
df = as.data.frame(tmp)}
else{
df2 = as.data.frame(tmp)
df = rbind(df, df2) 
}
}
names(df) = var_names
write.csv2(df, file="dataframe.csv")
df <- read.csv2("dataframe_final.csv")

If we had to load this data every time we wanted to work on our assignment, we would encounter a couple of issues. Everytime you would need to take time to load the data, especially with the WiFi from the university, it took quite some while. Besides that, it also could result in a situation where the observations that we used for our analysis is different when someone else runs the code. This means that for him/her the analysis would possibly not fit the data and therefore it would make our analysis wrong. We decided that the best way to handle this, is to store the data into a dataframe and load it every time we needed it.

Reading the dataframe:


library("stringr")
library("readr")
library("lubridate")
library("rvest")
library("XML")
library("tidyr")
library("dplyr")
library("maps")
library("ggplot2")
library("ggmap")
library("grid")
library("gridExtra")

#Loads dataframe
df = read.csv2("https://raw.githubusercontent.com/Bob30/Groupwork/master/dataframe.csv")

Continuing working on the data:


#Making df ready to the analyse
df$Amount=gsub(",", "", df$Amount)
class(df$Amount)="numeric"

#Making City
df$City=str_extract(df$Location, "[A-z]+")

#Area
df$Area1=str_extract(df$Location, ", [A-z]+")
df$Area2=str_extract(df$Location, ", [A-z]+ [A-z]+")
df$Area1=gsub(",", "", df$Area1)
df$Area1=gsub(" ", "", df$Area1)
df$Area2=gsub(",", "", df$Area2)
df$Area2=str_extract(df$Area2, "[A-z]+ [A-z]+")

df$Area=df$Area2

for( i in 1:1000) {
  if( is.na(df$Area[i]) == TRUE){
    df$Area[i] <- df$Area1[i]
  } 
}

df$Area[is.na(df$Area)]="Missing"

#Geting lon and lat for india
css.selector="td:nth-child(1)"
Location = read_html("http://www.latlong.net/category/cities-102-15.html") %>% 
  html_nodes(css = css.selector) %>% 
  html_text()

css.selector="td:nth-child(2)"
Lat = read_html("http://www.latlong.net/category/cities-102-15.html") %>% 
  html_nodes(css = css.selector) %>% 
  html_text()

css.selector="td~ td+ td"
Lon = read_html("http://www.latlong.net/category/cities-102-15.html") %>% 
  html_nodes(css = css.selector) %>% 
  html_text()

#Puts it into a dataframe
India=data.frame(Location,Lat,Lon, stringsAsFactors=FALSE)

class(India$Lat)="numeric"
class(India$Lon)="numeric"

#Makes city
India$city=str_extract(India$Location, "[A-z]+")

#Makes Area
India$Area1=str_extract(India$Location, ", [A-z]+,")
India$Area1=gsub(",", "", India$Area1)
India$Area1=gsub(" ", "", India$Area1)

India$Area2=str_extract(India$Location, ", [A-z]+ [A-z]*,")
India$Area2=gsub(",","", India$Area2)
India$Area2=str_extract(India$Area2, "[A-z]+ [A-z]+")

India$Area=India$Area2

for( i in 1:50) {
  if( is.na(India$Area[i]) == TRUE){
    India$Area[i] <- India$Area1[i]
  } 
}

#Groupes the dataframe by Area and makes variables 
df2 = df %>% 
  group_by(Area)%>%
  summarise(n.Area= n(), m.bribe=mean(Amount, na.rm=TRUE), sum.B=sum(Amount, na.rm=TRUE))

#Merge the variables of interes with lon and lat for India
#df3=inner_join(df2, India, by="Area")

df3=left_join(df2, India, by="Area")
#Finds lon and lat for the missing values

Area=c("Arunachal Pradesh", "Chandigarh", "Chhattisgarh", "Delhi", "Jammu and", "Manipur", "Mizoram", "Orissa", "Tripura", "Uttarakhand")
Lat=c(28.21788, 30.733315, 21.278657, 28.613939, 32.719418, 24.663717, 23.164543, 20.951666, 23.940848, 30.066753)
Lon=c(94.72775, 76.779418, 81.866144, 77.209021, 74.733707, 93.906269, 92.937574, 85.098524, 91.988153, 79.019300)

India2=data_frame(Area, Lat, Lon)

df5=left_join(df3, India2, by="Area")

for( i in 1:57) {
  if(is.na(df5$Lat.x[i]) == TRUE){
    df5$Lat.x[i] <- df5$Lat.y[i]
  } 
}

for( i in 1:57) {
  if(is.na(df5$Lon.x[i]) == TRUE){
    df5$Lon.x[i] <- df5$Lon.y[i]
  } 
}

#The dataframes makes several observation pr. Area. Groups the dataframe again
df4 = df5 %>% 
  group_by(Area)%>%
  summarise(n.Area= n(), m.bribe=mean(m.bribe, na.rm=TRUE), sum.B=sum(sum.B, na.rm=TRUE), Lat=mean(Lat.x), Lon=mean(Lon.x))

#Gets the map of India
map=get_map(location="India", zoom = 5)

#Geting population
css.selector="td:nth-child(2)"
Area = read_html("https://en.wikipedia.org/wiki/List_of_states_and_union_territories_of_India_by_population") %>% 
  html_nodes(css = css.selector) %>% 
  html_text()

css.selector="td:nth-child(3)"
Population = read_html("https://en.wikipedia.org/wiki/List_of_states_and_union_territories_of_India_by_population") %>% 
  html_nodes(css = css.selector) %>% 
  html_text()

pop=data_frame(Area, Population)

pop$Population2=pop$Population
pop$Population2=str_extract(pop$Population, "[0-9]+,[0-9]+,[0-9]+")

pop$Population3=pop$Population
pop$Population3=str_extract(pop$Population, "[0-9]+,[0-9]+")

for( i in 1:37) {
  if( is.na(pop$Population2[i]) == TRUE){
    pop$Population2[i] <- pop$Population3[i]
  } 
}

pop$Population2=gsub(",", "", pop$Population2)
class(pop$Population2)="numeric"

df6=left_join(df4,pop, by="Area")

Density of bribes pr. weekday

We want to investigate, if there is a special weekday, where the amount of the reported bribes is higher than the other weekdays.

We do this by making a plot with the density of the amount pr. weekday:

p = ggplot(df, aes(x = Amount, colour = weekday))
p = p + geom_density() + scale_x_log10() + ggtitle("Bribes pr. weekday")+labs(x="Amount", y="Density")
p

The highest density is for mondays, where both the number of bribes reported and the median is much higher than the other days. The number of bribes are 689 and the median is 963. We checked the data on the website, to ensure this is not a scraping error. It appeared that on one particulair monday (12-10-2015), an enourmous amount of bribes were reported, confirming it is not a scraping error. This can be seen in the table below:

Weekday Number of bribes Median
Monday 689 963
Tuesday 93 500
Wednesday 52 400
Thursday 29 500
Friday 37 800
Saturday 59 400
Sunday 41 250

The number of bribes reported are lowest on Thursdays, while the median is lowest on Sundays. The number of reports do not seem to behave in a certain way, it seems to be rather random.

Number of bribes pr. department

We also want to see, if the number of bribes are concentrated on one or more departments. We do that by summarising the bribes by department, and then plot the number of bribes made to each department:

dfDep=df%>%
  group_by(Department)%>%
  summarise(numberofbribes = n())
p = ggplot(dfDep, aes(x = reorder(Department, numberofbribes), 
                           y = numberofbribes))
p + geom_bar(stat = "identity") + coord_flip() + ggtitle("Number of bribes pr. department")+labs(x="Number of bribes", y="Department")

The figure shows us, that most of the bribes are paid to Municipal Services. The number of bribes targeting food, civil supplies and consumer affairs, the police and transport are also quite high. Bribes paid to Revenue, Airports, Water and Sewage, Public Works Departments and Labour were more rarely paid.

Number of bribes after area

We also want to know in which areas the bribes are made. The map shows how many registered bribes that are made pr. region.

#Plots the number of bribes
p = qmplot(Lon, Lat, data=df4, zoom=6, maptype="watercolor") + geom_point(aes(size=n.Area), data=df4, alpha=0.4)

#Plots the mean bribe
p2 = qmplot(Lon, Lat, data=df6, zoom=6, maptype="watercolor") + geom_point(aes(size=Population2), data=df6, alpha=0.6)

grid.arrange(p, p2)

There are two regions, Madhya Pradesh and Uttar Pradesh, where the number of bribes are much higher than the other areas. But only one of the regions is relative more populated that the other regions. That means that there are some regions that are more corrupt than others. Or maybe some regions just register more bribes than others. This could be a result of different internet access across regions. The wealthier regions could have more access to internet, enabling them to report more, while within poorer regions people might not be able to report any bribes due to lack of internet.

Conclusion

There are a lot of different conclusions you can draw from all the data we scraped. One big downside of using this self-reported data is that for some reason there was this one date with an extreme amount of reported bribes. If this was a error on the site itself or there was another external reason, we do not know. However, things like that make the use of the data less convenient.

sebastianbarfort commented 8 years ago

Hello, I get this error when I run your code:

screen shot 2015-12-10 at 1 49 03 pm
BobKruithof commented 8 years ago

Hello,

Could it be that the error was caused, because you run the code for the scraping? We didn't expected this so (because running the scraping takes some time and if you scrape the data now, it could lead to different conclusions as with the data we used), didn't add any library code before that! I now added those to the code. Or does the error exist in the part after the scraping?

sebastianbarfort commented 8 years ago

It was caused by the scraping part.

Good assignment.

APPROVED