Closed Danielsloth closed 8 years ago
Ok assignment. I like that you log the count variable to make it more comparable in your plots. There are many things you could have considered (geographic distribution, measurement error, selection bias) that I would have liked to see you spend more time on.
APPROVED
title: "Assignment2_group19"
output: html_document
We start our code by loading the needed packages.
After the poackages have been loaded, we define our function. The functions scrapes the website given by the link, which is the only input of our function. It returns a dataframe that contains the variables of interest.
This section is a loop that makes sure we get data from all the pages of the website. It inputs the desired "pagenumber" e.g. for i=10 it inputs 10 after the ?page= and then as a string the #gsc.tab=0. After looping we turn our variable 'amount' numeric and remove outliers, hence any amount over 10.000.000 is removed
In this section we create plots form the retrieved data.
The first is just a summary table of our data, to find out what the overall mean, max etc is. The first thing to notice is the minimum bribe which amounts to 1 rupee. Granted, we have not looked at the purchase power, but it is only 10 øre, which is an awfully small amount. Could be a misreport, but given our lack of knowledge of the world of bribery we will allow it. The median bribe is 800, which means the bribe most often given, is relatively low compared to the average of 31730. The max is our made up cut off of 10000000 which could be an indication that, even though we only loose few observations using our upper limit, it could be too low. This will not be investigated further.
Now we look at the dispersion over dates. That is, we investigate when the data is reported. There is one clear outlier, namely october 12th with over 500 reports. The reason for this is very unclear, since the rest of the dates have fewer than 50.
The distribution of bribes is the next thing we investigate is the distribution of bribes. We look at log(amount) since there are a few outliers dispite our upper limit. It could appear to have a bell shape, however it is interesting to note, that there are what appear to be "steps" in the distribution. That is, if there are a lot at e.g. 4, then it falls a little over the next couple of steps, only to grow again later. This could be due to sort of a bribe pricing mechanism. Maybe it could be the standard to pay 500 for a driver's license, and then there are a gap until you hit the average to get a birth certificate which could be maybe 800 - the reason is unclear, but it is an interesting distribution.
In the next we look at the distribution within a couple of the departments to see whether there are major differences in the amounts paid. THe blue one are the bribes for the municipal services while the red is the police. It is interesting to see the dispersion is greater in the municipal services which could be due to the larger dispersion in their tasks. Bribes to the police could be to avoid paying a ticket or to hinder police harrasment. In the municipal services they take care of birth certificates (which seems to be a major one), property registration, taxes etc.