Closed adamingwersen closed 9 years ago
Really good assignment!
You cover a lot of ground which is really cool. I like the map and also appreciate that you do the regression type plot in the end, although I would recommend doing everything in ggplot2 (using stat_geom
).
Keep up the good work!
APPROVED
title: "Assignment 2 - Social Data Science" author: "Group 15" date: "November 9, 2015"
output: html_document
Scrape & Analysis of www.ipaidabribe.com
We are to scrape the website and clean the gathered data primarily using the R-packages: stringr, plyr, dplyr & rvest. Then, we will delve into a very light econometric analysis paired with geo-spatial data-analysis using ggplot, countrycode etc. This assignment can be viewed in html here
Scraping the website: Steps 1-2
The website is constructed in a way such that for each page of recorded bribes contains 10 posts per page. It did not prove useful to use selector-gadget for identifying the underlying html-table. However, any page of 10 posts only has unique URL-element; http://www.ipaidabribe.com/reports/paid?page={**any integer**}. Thus, we took the following approach: create vector of integers from 0:1000 in intervals of 10. Insert 0, 10, 20 etc. into the element in URL that determines page number via loop - here looping with plyr was deemed useful as it is simple to coerce into dataframe and is rather fast.
We have now obtained 100 pages each containing 10 posts. Directly from these pages, the post-information we are interested in is accessible, given the appropriate CSS-selectors:
Defining function and creating loop for iterating trhough all observations in num.link.li2. We make use of rvest in order to fetch data.
Cleaning gathered data, preparation & external data : Step 3-4
When the 1.000 posts have been obtained, we coerce from list into a dataframe using ldply([data], data.frame). From here we need to do basic data-cleaning primarily utilizing functions contained in stringr in order to have a tidy, interpretable dataframe.
It's however obvious that this dataframe need further manipulation/cleaning:
It may prove insightful to add additional data to the existing dataframe, such as literacy. Literacy may affect the number of bribes in a given city, and it will definitely affect the number of bribes reported on ipaidabribe, since literacy will be a barrier for actually adding a post to the page especially since the webpage is in English. Therefore one could imagine that cities with a high literacrate experiences a relatively low number of bribes reported, whereas one on the other hand could argue that high literacyrates at the same time would cause many bribes since they may not ask as many questions about the bribes. Here, data from wikipedia on literacy in India is used - imported as .csv:
Analysis
Exploring geographical and time-dependent relationships within the generated data-frame
Mapping geo-spatial data: Step 5
In order to visualize discrepancies in post-views and bribes paid across cities in India, a map seems ideal. The ggmap-package allows for visualizations using Googles Maps service for plotting overlays on top. Using the maps-package we fetch the geo-coordinates needed.
This map illustrates population and views of posts by city:
From the graph, it's seen that their seems to be a correlation among large cities, i.e. the less transparent orange dots, and the number of post views for the respective cities, the filled blue dots. Now this can possibly stem from a lot of different sources, for instance that larger cities have more post hence more aggregated post views, that people is more interested in reading about bribes in their own city, which is why larger cities have more post views or some other explanation. If there were room and time for some more deep visualization, these mentioned things could definetely be a nice thing to look more into, i.e. plot the average number of views in cities to the population, to the literacy rate for which we have gathered for our dataframe or other suitable plots.'
One could imagine that more litterate cities would experience less bribes than inlitterates, but as seen by this table, there is no clear relationship between litteracy and the number of bribes paid. If we look at the cities with the highest litteracy ratio, i.e. the most litterate cities, there are only a few bribes in each city. If we instead look at the top 10 cities for number of bribes, the picture is not as clear as shown by the former table – as one can see, the cities in top 10 of bribes is neither the most or least litterate cities. We therefore have no clear evidence from the collected data that there will be more bribes in a non-litterate city, and since the number of views must be closely related to the degree of litteracy, there is neither a clear connection between views and number of bribes.
Other data visualizations: Step 6
Another interesting plot could be to look at the days for posted bribes, in other words we want to see if there's any time-dependency in the usage of the website. Extracting weekdays. Setting OS-derived locale to english in order to get english weekdays.
Apparently mondays are popular. We will investigate, what drives this tendency; geographical or departmental.
Some regions seem to be following the overall trend, however others only post on e.g. saturdays - maybe we should pick out the most populous, as the differences in population between these are rather large. The larger regions appear to be more representative in terms of this particular tendency.
Picking out all regions with populaion > 50M. Conducting simple regression with stated model:
The relationship seems dubious, however stronger than when not subsetting the dataframe to 50M+ regions.