Step 1:
First we make a list of generic pages, one for each page shift.
Afterwards we will scrape the links from each single post
link = "" ## Definerer base-hjemmesiden ##
loop <- list() ## Genererer liste af pages (skal være 100)
for(i in seq(from = 10, to = 1050, by = 10)){ ## Looper. Tager kun hver 10'ende, som
loop[[i/10]] = print(paste("",i-10, sep="")) ## Skal starte på 0, som er side 1
g.sider = ldply(loop) ## Fra vector til data frame
names(g.sider) = c("links") ## Navngiver variablen
## SCRAPE-FUNKTION: Scraper hvert link, på en given side, der har "read-more"-markøren ##
link.scraper = function(link) { = read_html(link, encoding = "UTF-8") = %>% ## Definér, then
html_nodes("") %>% ## Registrer hvert "Read more", then
html_attr('href') ## Giv egenskab som link og træk link
return(cbind( ## Returnér det og tving det(/dem) til søjlebinding
## Opretter en liste og bruger funktionen over et loop af generiske bribe-sider. Funktionens udtræk gemmes i listen ##
start = Sys.time() ## Køres samtidigt med nedenfor. Dokumentation af ## tidspunktet på hvornår de 1000 seneste findes
links.posts = list() # initialize empty list
for (i in g.sider$links[1:nrow(g.sider)]){ ## Loop over den genererede liste ##
print(paste("processing", i, sep = " ")) ## Vis mig løbende processen ##
links.posts[[i]] = link.scraper(i)
# waiting 10 seconds between hits - jf. deres robots.txt
#Sys.sleep(10) -> Giver dubletter hvis den er her. Derfor bruges kun til scrape for data
cat(" done!\n")
dflinks=ldply(links.posts) ## Laver den om til et data frame
## Gemmer liste med links og tidspunkt
save(dflinks, file="dflinks.RData")
#### DONE ####
Step two:
We will now extract the information from the scraped links
Finally we will rename, mutate and save the date. One problem we encountered was that as.Date uses the computers date format, so October returned NA.
#1 Omdanner variable
data.frame.endelig = data.frame %>%
betaling=as.numeric(str_replace_all(betaling,"[^0-9]","")), # Fjerner alle ikke numeriske tegn og omdanner til tal
views=as.numeric(str_replace_all(views,"[^0-9]","")),,"October","Oktober"),"%B %d, %Y"), # Om danner til datovariabel ved at oversætter opbygningen den angivede dato - se:
start_scrape=start # Start tidspunkt for scrape til dokumentation
) %>%
filter(.id != "" ) # Fjerner obs der ikke er endelige
#2 Gemmer data
save(data.frame.endelig,file = "data.frame.endelig.RData")
## Test for duplicates - I alt er der 10 dubletter, hvorfor det endelige datasæt har 990 obs.
dubletter = dflinks %>%
group_by( %>%
filter(n() != 1)
dubletter2 = data.frame.endelig %>%
group_by(.id) %>%
filter(n() != 1)
df = data.frame.endelig
df$city <- (str_split_fixed(df$location, ",", n=2)[,1])
df$region <- (str_split_fixed(df$location, ",", n=2)[,2])
Brief Data Analysis
After running a summary in the console we notice that payments (betaling) has some huge outliears. Futhermore we notice that views and weekdays might also provide us with some interesting results.
The outliers in payments are confirmed by a histogram.
The solution is to filter out payments less than 1000000 Rp., approx 100000 kr.
It seems unplausable that people would contribute information about bribes above this threshold, as it would be very easy for the bribed to indentify the briber.
For a longer study a discussion about the type of data (crowdsources data) should be included.
p = ggplot(data = df, aes(x = betaling)) # data & aesthetics
p = p + geom_histogram() #add geom
p + scale_x_log10() #add log-scale
#Shows us that we have some huge outliers.
df2 = filter(df, betaling<1000000)
p = ggplot(data = df2, aes(x = betaling)) # data & aesthetics
p = p + geom_histogram() #add geom
p + scale_x_log10() #add log-scale
To get an overview of the data, which consists of different character varibles we add short summary
E.g. types has Birth Certificates and Issue of Ration Card as the most observed variables.
This might give us a clue on where to focus our attention when directing policy on the matter.
Monday is by far the day with the most observations
We will try to look into the last bullet.
Question is:
Why is monday the most observed weekday, with around six times more observations than the second most observed day.
A reason could be that ration cards and birth certificates are primarily given out on mondays.
Underneath we see that this is primarily the case with ration cards.
Another reason could be, that if corruption occurs mostly in office hours, then there might be a peak after the weekends - the same reason that you should never call a hotline on mondays.
We can draw the density functions divided on weekdays. This might provide us with further information.
What we notice from the graph underneath is that payments on mondays also stand out here. They follow a density function which is very narrow around 1000. This could mean several things
Either there could be someone trying to manipulate the data
Or Monday might be better represented, catching up from the weekend.
Whatever the reason one will have to be creative to overcome the problems related to self-reported data.
df2$weekday = wday(df2$, label = TRUE)
p = ggplot(df2, aes(x = betaling, colour = weekday))
p + geom_density() + scale_x_log10()
title: "Assignment 2" author: "Group 25" date: "9. nov. 2015"
output: html_document
In this assignment we are asked to scrape data from
Step 1: First we make a list of generic pages, one for each page shift. Afterwards we will scrape the links from each single post
Step two: We will now extract the information from the scraped links
Finally we will rename, mutate and save the date. One problem we encountered was that as.Date uses the computers date format, so October returned NA.
Brief Data Analysis
After running a summary in the console we notice that payments (betaling) has some huge outliears. Futhermore we notice that views and weekdays might also provide us with some interesting results. The outliers in payments are confirmed by a histogram. The solution is to filter out payments less than 1000000 Rp., approx 100000 kr.
It seems unplausable that people would contribute information about bribes above this threshold, as it would be very easy for the bribed to indentify the briber.
For a longer study a discussion about the type of data (crowdsources data) should be included.
To get an overview of the data, which consists of different character varibles we add short summary
From the summaries we notice the following:
We will try to look into the last bullet.
Question is: Why is monday the most observed weekday, with around six times more observations than the second most observed day.
A reason could be that ration cards and birth certificates are primarily given out on mondays. Underneath we see that this is primarily the case with ration cards. Another reason could be, that if corruption occurs mostly in office hours, then there might be a peak after the weekends - the same reason that you should never call a hotline on mondays.
We can draw the density functions divided on weekdays. This might provide us with further information. What we notice from the graph underneath is that payments on mondays also stand out here. They follow a density function which is very narrow around 1000. This could mean several things
Whatever the reason one will have to be creative to overcome the problems related to self-reported data.