ropensci / essurvey

Download data from the European Social Survey
https://docs.ropensci.org/essurvey
Other
49 stars 9 forks source link

Unexpected behavior about missing values #35

Closed ccolonescu closed 5 years ago

ccolonescu commented 5 years ago

devtools::session_info()

  • Session info ------------------------------------- setting value
    version R version 3.5.2 (2018-12-20) os Windows >= 8 x64
    system x86_64, mingw32
    ui RStudio
    language (EN)
    collate English_United States.1252
    ctype English_United States.1252
    tz America/Denver
    date 2019-04-27
briatte commented 5 years ago

I do not understand the second example:

Round8 <- import_rounds(8)
recode_missings(Round8)
data.frame(tail(attr(Round8$edulvlb, "labels"), 5))

In this example, line 2 is not doing anything. The example is strictly equivalent to the first one.

In the third example, the reason for the disappearing labels is there:

https://github.com/ropensci/essurvey/blob/cb9c28a24b89e0ea8bf370a5845e5f52f1067ca0/R/recode_missings.R#L92-L106

… at line 103.

It is indeed a good question why the labels need to be removed, on top of recoding the missing values to NA, which is what the function claims it does.

It seems to me the function could work without that line at all.

Maybe @cimentadaj can help more, as he coded the function.

cimentadaj commented 5 years ago

Thanks for the feedback! This is indeed a design choice. I can't remember clearly why I removed the labels to be honest. Having said that, I'm not sure whether the recode_missings function is still useful.

From what I've seen, the ESS is now recoding these values automatically into the standard Stata missing values (such as .a, .b, etc..) and the latest haven version supports these missings and automatically recodes them to missing values. See for example:

library(essurvey)
set_email("cimentadaj@gmail.com")
tst <- import_country("Spain", 1)
#> Downloading ESS1
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |=======                                                          |  11%
  |                                                                       
  |===========                                                      |  16%
  |                                                                       
  |==============                                                   |  22%
  |                                                                       
  |==================                                               |  27%
  |                                                                       
  |=====================                                            |  33%
  |                                                                       
  |=========================                                        |  38%
  |                                                                       
  |============================                                     |  43%
  |                                                                       
  |================================                                 |  49%
  |                                                                       
  |===================================                              |  54%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==========================================                       |  65%
  |                                                                       
  |==============================================                   |  71%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |=====================================================            |  81%
  |                                                                       
  |========================================================         |  87%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |================================================================ |  98%
  |                                                                       
  |=================================================================| 100%
head(tst$edulvla, 200)
#> <Labelled double>: Highest level of education
#>   [1]     1     3     2     2     1     2     1     5     3     3     5
#>  [12]     2     1     1     1     1     1     1     1     1     1     2
#>  [23]     2     1     1     1     1     1     1     3     4     3     5
#>  [34]     1     3     3     3     1     1     2     3     2     2     3
#>  [45]     3     1     3     1     5     1     1     1     1     3     1
#>  [56]     1     1     1     1     1     1     5     3     1     5     2
#>  [67]     3     3     1     3     1     1     2     3     1     1     3
#>  [78]     3     1     4     5     1     3     3     3     1     1     2
#>  [89]     2     1     1     3     2     3     5     3     5     3     1
#> [100]     2     5     1     5     1     1     1     3     5     3     5
#> [111]     2     1     4     1     3     1     2     1     3     1     2
#> [122]     1     1     3     2     2     1     1     5     4     4     1
#> [133]     1     3     1     1     3     2     1     3     3     1     1
#> [144]     1     1     3     1     1     1     1     5     1     2     2
#> [155]     1     3     1     1     2     1     1     1     2     1     1
#> [166]     1     2     1     5     1     3     3     3     2     2     4
#> [177]     2     2     4     3     3     2     3     1     2 NA(b)     1
#> [188]     2     3     1     2     2     5     1     2     3     1     3
#> [199]     1     3
#> 
#> Labels:
#>  value                                                     label
#>      0              Not possible to harmonise into 5-level ISCED
#>      1           Less than lower secondary education (ISCED 0-1)
#>      2             Lower secondary education completed (ISCED 2)
#>      3             Upper secondary education completed (ISCED 3)
#>      4 Post-secondary non-tertiary education completed (ISCED 4)
#>      5                  Tertiary education completed (ISCED 5-6)
#>     55                                                     Other
#>  NA(b)                                                   Refusal
#>  NA(c)                                                Don't know
#>  NA(d)                                                 No answer

These are now coded as NA(b), etc... instead of the old 777, etc... values. To make this clear, I've added a minimum version to the haven package in the DESCRIPTION and added a description of this in the documentation of recode_missings.