ropensci / skimr

A frictionless, pipeable approach to dealing with summary statistics
https://docs.ropensci.org/skimr
1.1k stars 79 forks source link

haven_labelled vectors #687

Closed benzipperer closed 2 years ago

benzipperer commented 2 years ago

Thank you for the excellent package!

Do you think you could skimr could support haven_labelled vectors, perhaps treating them as numeric? This is related to #296 and #606 but I think the issue is still relevant.

The problem is that haven_labelled vectors seem to be lumped into character variable types. For example, after

mtcars$am <- haven::labelled(mtcars$am, c("Automatic" = 0, "Manual" = 1))
class(mtcars$am)
#> [1] "haven_labelled" "vctrs_vctr"     "double"

then skimr::skim(mtcars, am) outputs message

Couldn't find skimmers for class: haven_labelled, vctrs_vctr, double, numeric; No user-defined `sfl` provided. Falling back to `character`.

Having a separate skim type for labelled vectors would be nice but that might be involved. Alternatively would you consider treating them as numeric? Something like

get_skimmers.haven_labelled <- function(column) {
  modify_default_skimmers("numeric", new_skim_type = "numeric")
}

would be helpful for datasets that include a lot of labelled vectors.

elinw commented 2 years ago

I am pretty sure that this is discussed somewhere and easily handled using the function factory. The thing is that it is very easy for you to create your own skimer for a new data type. See https://github.com/elinw/skimrextra/blob/master/R/skim_with.R#L35

https://github.com/elinw/skimrextra/blob/master/R/skim_with.R#L35

You are able to handle it exactly as you like.

benzipperer commented 2 years ago

Thanks for the suggestion! I think I'm confused. What I'm asking is if it is possible for skim() to treat haven_labelled vectors as numeric?

Currently skim() treats them as character vectors. From a data analysis point of view, this seems odd given that users analyze labelled vectors as numeric values and when "summarizing" them would generally like to see their mean, percentiles, etc. Indeed, this is how base summary() treats labelled vectors.

If I understand correctly, here you also ask why skim() does not treat them as numeric: https://github.com/ropensci/skimr/issues/606#issuecomment-670986492

michaelquinn32 commented 2 years ago

Hi Ben and Elin!

Sorry for the delay here.

The issue is that haven_labelled has some unexpected interactions with S3 and inheritance. You can see this in the code here. https://github.com/tidyverse/haven/blob/main/R/labelled.R

If haven_labelled inherited from numeric the way we expect it to in skimr, the package authors wouldn't need to redefine all of those methods. I'm sure there's good reason for this, it just puts us in a bit of a tricky spot.

The easiest path forward is to use a helper function and create a customized skimr.

my_skim <- skim_with(haven_labelled = modify_default_skimmers("numeric", new_skim_type = "haven_labelled"))  

With that, you get the expected numeric output.

my_skim(mtcars, am)
#> ── Data Summary ────────────────────────
#>                            Values
#> Name                       mtcars
#> Number of rows             32    
#> Number of columns          11    
#> _______________________          
#> Column type frequency:           
#>   haven_labelled           1     
#> ________________________         
#> Group variables            None  
#> 
#> ── Variable type: haven_labelled
#> ───────────────────────────────────────────────
#>   skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75
#> 1 am                    0             1 0.406 0.499     0     0     0     1
#>    p100 hist 
#> 1     1 ▇▁▁▁▆
benzipperer commented 2 years ago

Thanks for the reply!

You mentioned that haven_labelled is putting you in a "tricky spot" but I was wondering what is the problem with adding get_skimmers.haven_labelled to https://github.com/ropensci/skimr/blob/master/R/get_skimmers.R so that with haven_labelled columns skimr uses numeric, as opposed to default/character?

michaelquinn32 commented 2 years ago

We've been cautious about dependency creep, and this has been a problem for us in the past. But this might be a case where we have to take it on.

Here's the fundamental problem. According to the haven labelled docs, the wrapped vector can either be numeric or character. https://haven.tidyverse.org/reference/labelled.html

From the skimr standpoint, setting haven_labelled to always numeric or always character is a little problematic. Looking at the code though, it seems like there is a way to extract the underlying data from the vector. Maybe we can dispatch on that so it just works.

benzipperer commented 2 years ago

Ah... I didn't realize that haven_labelled wrapped vectors could be character. I see how that might complicate your codebase even if you can in principle grab the underlying data for column x say via class(x) <- NULL or haven::zap_labels(x) or something else.

Well, thanks for considering!

elinw commented 2 years ago

The reason we make it so easy to create your own handling is because for something like haven_labelled different users and different specific survey questions might want very different handling. For example, it is common in SPSS for there to be labels only for the UNK, MIS, NAP values (and R only handles one kind of missing) and then another thing is that if they have a variable like age and 99 is really 99 or above there will be a label attached. So each variable really needs some thought. I work with a lot of survey data and in the same data set you may get all kinds of labels .

In skimr when there is not an existing skimr for a class the fall back is to character. Then the iterative data wrangling workflow is really to look at those and figure out how to get them into the format that actually make sense.

elinw commented 2 years ago

Here is the previous discussion which actually mentions the numeric issue among other things. https://github.com/ropensci/skimr/issues/606

benzipperer commented 2 years ago

@michaelquinn32, to solve the numeric/character distinction problem for haven labelled, what about adding something like

get_skimmers.haven_labelled <- function(column) {
  column_class <- ifelse(typeof(column) %in% c("double", "integer"), "numeric", "character")
  modify_default_skimmers(column_class, new_skim_type = column_class)
}

to https://github.com/ropensci/skimr/blob/master/R/get_skimmers.R ?

michaelquinn32 commented 2 years ago

That's close, but I want to avoid conditional logic there and possibly support other internal data types.

I should be able to put something together later today, before @elinw starts our next release.

benzipperer commented 2 years ago

Looks great, thank you!