Closed BobKruithof closed 8 years ago
Hello, I get this error when I run your code:
Hello,
Could it be that the error was caused, because you run the code for the scraping? We didn't expected this so (because running the scraping takes some time and if you scrape the data now, it could lead to different conclusions as with the data we used), didn't add any library code before that! I now added those to the code. Or does the error exist in the part after the scraping?
It was caused by the scraping part.
Good assignment.
APPROVED
title: "Assignment 2"
output: html_document
In this assignment, we started by scraping 1.000 observations from http://www.ipaidabribe.com. It is a website with the aim to measure corruption in India. The data includes title, amount, name of department, number of views, city, location, date and weekday. After scraping the data, we perform a data-analysis based on a couple of graphs/tables and one map.
Getting the data from the website:
If we had to load this data every time we wanted to work on our assignment, we would encounter a couple of issues. Everytime you would need to take time to load the data, especially with the WiFi from the university, it took quite some while. Besides that, it also could result in a situation where the observations that we used for our analysis is different when someone else runs the code. This means that for him/her the analysis would possibly not fit the data and therefore it would make our analysis wrong. We decided that the best way to handle this, is to store the data into a dataframe and load it every time we needed it.
Reading the dataframe:
Continuing working on the data:
Density of bribes pr. weekday
We want to investigate, if there is a special weekday, where the amount of the reported bribes is higher than the other weekdays.
We do this by making a plot with the density of the amount pr. weekday:
The highest density is for mondays, where both the number of bribes reported and the median is much higher than the other days. The number of bribes are 689 and the median is 963. We checked the data on the website, to ensure this is not a scraping error. It appeared that on one particulair monday (12-10-2015), an enourmous amount of bribes were reported, confirming it is not a scraping error. This can be seen in the table below:
The number of bribes reported are lowest on Thursdays, while the median is lowest on Sundays. The number of reports do not seem to behave in a certain way, it seems to be rather random.
Number of bribes pr. department
We also want to see, if the number of bribes are concentrated on one or more departments. We do that by summarising the bribes by department, and then plot the number of bribes made to each department:
The figure shows us, that most of the bribes are paid to Municipal Services. The number of bribes targeting food, civil supplies and consumer affairs, the police and transport are also quite high. Bribes paid to Revenue, Airports, Water and Sewage, Public Works Departments and Labour were more rarely paid.
Number of bribes after area
We also want to know in which areas the bribes are made. The map shows how many registered bribes that are made pr. region.
There are two regions, Madhya Pradesh and Uttar Pradesh, where the number of bribes are much higher than the other areas. But only one of the regions is relative more populated that the other regions. That means that there are some regions that are more corrupt than others. Or maybe some regions just register more bribes than others. This could be a result of different internet access across regions. The wealthier regions could have more access to internet, enabling them to report more, while within poorer regions people might not be able to report any bribes due to lack of internet.
Conclusion
There are a lot of different conclusions you can draw from all the data we scraped. One big downside of using this self-reported data is that for some reason there was this one date with an extreme amount of reported bribes. If this was a error on the site itself or there was another external reason, we do not know. However, things like that make the use of the data less convenient.