tpemartin / 110-1-r4ds-main

MIT License
3 stars 73 forks source link

Exercise 5.4 Import the `wdi` data from 4.8 Exercise-5 and obtain `iso2c_nonCountry` from Exercise 4.19 #64

Open tpemartin opened 2 years ago

tpemartin commented 2 years ago

Import the wdi data from 4.8 Exercise-5 and obtain iso2c_nonCountry from Exercise 4.19

  1. The following code remove any non country entry in the data.
data_set <- wdi$data 
iso2c_nonCountry <- c('ZH','ZI','1A','S3','B8','V2','Z4','4E','T4','XC','Z7','7E','T7','EU','F1','XE','XD','XF','ZT','XH','XI','XG','V3','ZJ','XJ','T2','XL','XO','XM','XN','ZQ','XQ','T3','XP','XU','XY','OE','S4','S2','V4','V1','S1','8S','T5','ZG','ZF','T6','XT','1W')
pick_countries <- !(data_set$iso2c %in% iso2c_nonCountry)
data_set[pick_countries, ]

Use it to create a function remove_nonCountries so that any data frame, say df_example with iso2c in it can do the function call as the following to remove those non-country entries.

df_example <- remove_nonCountries(data_set=df_example)
  1. The following code remove non countries from the data set and narrow down further to year 2020 data. Then summarise the indicator's mean, median, and range
    
    wdi$data |> remove_nonCountries() -> data_set

data_set |> subset(year==2020) -> data_set2020 # it is the same as code = "SG.GEN.PARL.ZS" { data_set2020[[code]] |> range(na.rm=T) -> output_range data_set2020[[code]] |> mean(na.rm=T) -> output_mean data_set2020[[code]] |> median(na.rm=T) -> output_median list( mean=output_mean, median=output_median, range=list(output_range) ) |> list2DF() }


Construct a function `summarise_numerical` which can be used to produce a summary data frame of mean, median, and range for any given data set (as input argument `data_set`) and a numerical feature column name (as input argument `feature`). In other words, with the help of `summarise_numerical` function, the above code chunk can be replace with
```{r}
wdi$data |> remove_nonCountries() -> data_set

data_set |> subset(year==2020) -> data_set2020 # it is the same as
code = "SG.GEN.PARL.ZS"
summarise_numerical(data_set=data_set2020, feature=code)
  1. Gender inequality is an important issue in social science. One possible indicator to compare this inequality across countries is:

    • Proportion of seats held by women in national parliaments (%) (code name is "SG.GEN.PARL.ZS").

    What is the year range in the data set? For each year compute the mean of this indicator across countries. Is the trend of mean increasing over time?

  2. Create a function get_meanTrendOverYears when do the following function call, it will return a vector of the mean of all countries' given code feature value over the years, with years as element names. (That is if mean is 2, 3, 8 for year 2010, 2011, 2012, then the returned vector should be the named numeric vector c("2010"=2, "2011"=3, "2012"=8).)

get_meanTrendOverYears(data_set=data_set, code="SG.GEN.PARL.ZS")
raychiu135 commented 2 years ago

https://github.com/raychiu135/110-1-r4ds-main/blob/9c240a19d1deccb086da743dee205b590d300bcb/exercise_5.4.rmd#L2

raychiu135 commented 2 years ago

https://github.com/raychiu135/110-1-r4ds-main/blob/a5d0b783a34a34c8ded64d53207080df9412a1fc/exercise_5.4.teacher.rmd#L2