kevinwang09 / learningtower

Easily accessible PISA data
https://kevinwang09.github.io/learningtower/
Other
26 stars 8 forks source link

potential duplicates in pisa data? #3

Closed njtierney closed 3 years ago

njtierney commented 4 years ago

It looks like there might be some duplicates - see e.g. student_id is duplicated below.

## Note re duplicated `pisa` data:
# From what I can see, it looks like `student_id` is duplicated

library(tsibble)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:tsibble':
#> 
#>     id
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(learningtower)

student %>%
  filter(country == "AUS") %>%
  duplicates(key = c(country,
                     school_id,
                     student_id),
             index = year) %>%
  arrange(year, student_id)
#> # A tibble: 28,803 x 16
#>     year country school_id student_id mother_educ father_educ gender computer
#>    <int> <fct>   <chr>     <chr>            <int>       <int>  <int>    <int>
#>  1  2015 AUS     3.6e+06   3.601e+06            2          NA     NA        1
#>  2  2015 AUS     3.6e+06   3.601e+06            2           1      1       NA
#>  3  2015 AUS     3.6e+06   3.601e+06            2           1      1        2
#>  4  2015 AUS     3.6e+06   3.601e+06            1           1      1        1
#>  5  2015 AUS     3.6e+06   3.601e+06            1           1      1        1
#>  6  2015 AUS     3.6e+06   3.601e+06            2           1      1        1
#>  7  2015 AUS     3.6e+06   3.601e+06            2           1      1        1
#>  8  2015 AUS     3.6e+06   3.601e+06            2           1      1        2
#>  9  2015 AUS     3.6e+06   3.601e+06            2           1      1        1
#> 10  2015 AUS     3.6e+06   3.601e+06            1           4      4        1
#> # … with 28,793 more rows, and 8 more variables: internet <int>, math <dbl>,
#> #   science <dbl>, read <dbl>, stu_wgt <dbl>, country_iso3c <fct>,
#> #   country.name <chr>, un.name.en <chr>

Created on 2019-12-18 by the reprex package (v0.3.0)

sarahromanes commented 4 years ago

Interesting! If I add on

pull(year) %>% 
unique()

It's restricted to 2015 and 2018. Should those datasets be revisited?

gvdr commented 4 years ago

This may be due by the wrong encoding of student_id. It looks like a character, that reads exactly what we see in the table above.

I'll go back to see whether I introduce this error in the binding of the columns, or it comes from a wrong encoding in some of the underlying dataset (in that case you will have to act on your side).

gvdr commented 4 years ago

So, tsibble::duplicates() breaks my laptop for some reason, but I think I solved this. PR in the afternoon.

gvdr commented 4 years ago

@njtierney @dicook care to give a look and eventually merge the PR?

kimnewzealand commented 4 years ago

following up on this issue, can someone please check when you have a second? or assign to yourself as a to-do

kevinwang09 commented 4 years ago

I had a similar problem working with this data a few days ago. But I can't remember the exact commit(s) that solved this. I think this was due to the class of the school_id in a previous version stored the ID as integers/numeric, such that there were internal truncation or something similar. I checked for the AUS schools, there were only two schools, which is obviously incorrect. The latest version of check, using janitor::get_dupes did not return any duplicated rows:

https://github.com/ropenscilabs/learningtower/blob/0e0d048132df0d202c7d4edd5d94b2c671bacf0a/vignettes/visualise_distribution.Rmd#L29-L31