larmarange / labelled

Manipulating labelled vectors in R
https://larmarange.github.io/labelled/
GNU General Public License v3.0
75 stars 16 forks source link

look_for slow with big files #77

Closed shannonpileggi closed 3 years ago

shannonpileggi commented 3 years ago

I recently updated to labelled 2.7.0 and look_for(data, details = TRUE) hung and never resolved. I reverted back to 2.5.0 to get previous usage. Can you confirm that it is working as intended in 2.7.0?

larmarange commented 3 years ago

Look_for has been redesigned in labelled 2.6.0

What are you trying to do which is not working with the new version?

larmarange commented 3 years ago

Could you provide a reproductive example?

Are all your packages up to date?

shannonpileggi commented 3 years ago

Apologies for the uninformative issue! Let me try again.

I am using Windows 10 with: R 4.0.3
RStudio 1.3.1073
tidyverse 1.3.0
here 0.4.8
haven 2.3.1
labelled 2.5.0 / 2.7.0

Under both labelled 2.5.0 & 2.7.0 the following minimal example works:

library(tidyverse)
library(labelled)

# create data ----
ex_data <- tibble(
 id = 1:10,
 ctry = c(rep(1, 5), rep(2, 5)),
 cdy = rep(c(1, 3), 5),
 text = LETTERS[1:10]
 ) %>% 
   mutate(
     ctry = factor(ctry, labels = c("US", "UK")),
     cdy = haven::labelled(
       cdy, 
       labels = c("MMs" = 1, "Skittles" = 3), 
       label = "Preferred Candy")
   )

# create dictionary ----
dictionary <- labelled::look_for(ex_data, details = TRUE)

# view dictionary ----
dictionary
#>   variable           label                              class      type levels
#> 1       id            <NA>                            integer   integer       
#> 2     ctry            <NA>                             factor   integer US; UK
#> 3      cdy Preferred Candy haven_labelled, vctrs_vctr, double    double       
#> 4     text            <NA>                          character character       
#>            value_labels unique_values n_na na_values na_range
#> 1                                  10    0                   
#> 2                                   2    0                   
#> 3 [1] MMs; [3] Skittles             2    0                   
#> 4                                  10    0

Created on 2020-12-19 by the reprex package (v0.3.0)

When I try to create the dictionary by the same method for a larger data set, the dictionary works under 2.5.0, but under 2.7.0 the command never finishes (no error message or warning, R is just running forever).

The data that I am using is sadc_2017_national.sav. As the command never finishes, I was not able to reprex this one, but here is the code I was using.

library(tidyverse)
library(labelled)
library(haven)
library(here)

# import data ----
dat_raw <- haven::read_spss(here::here("data", "sadc_2017_national.sav"))

# create dictionary ----
dictionary <- labelled::look_for(dat_raw, details = TRUE)

# view dictionary ----
dictionary

Please let me know if there is anything else I can try on my end to help trouble shoot.

Thank you!

larmarange commented 3 years ago

Sorry for not having response earlier. I explore quickly when the problem happens.

labelled::look_for(dat_raw, details = FALSE) works quite quickly.

Your dataset is very big (lot of variables and of observations). The feature creating a problem is computing "range" of the different variables. Which is very time consuming.

I need to explore further, maybe with an option to desactivate that part of the computation.

larmarange commented 3 years ago

See #79 for a proposition of evolution of look_for()

Now, by default (details = "basic") look_for() will compute only basic details (including value labels and factor levels but not variable range). With your big SPSS file, it just take few seconds.

If you want full details (as before), indicate details = "full" or details = TRUE. But, it will take time (several minutes) with a file like yours.

larmarange commented 3 years ago

@shannonpileggi You can test it with devtools::install_github("larmarange/labelled#79")

Do not hesitate to provide me feedback

shannonpileggi commented 3 years ago

@larmarange thank you for taking a look, identifying the problem, and proposing solutions! I did install the dev version and I have a bit of feedback.

  1. The default version of look_for works, and quickly! Thank you!
    > dictionary <- labelled::look_for(dat_raw)
    > head(dictionary)
     variable                           label
    1    sitecode                       Site code
    2    sitename                       Site name
    3    sitetype                       Site type
    4 sitetypenum 1=District, 2=State, 3=National
    5        year          4-digit Year of survey
    6    survyear                1=1991...14=2017`

    I propose that the default version also include the value_codes - what do you think? To me, the variable name, variable label, and value codes are essential parts of the navigating data, which I consider part of the data dictionary.

In addition, I also attempted this by specifying the details argument, which resulted in an error.

> dictionary <- labelled::look_for(dat_raw, details = "basic")
Error in if (details) { : argument is not interpretable as logical

Am I using the argument as intended?

Thank you for all of your work on this package!

larmarange commented 3 years ago

Dear @shannonpileggi

it seems that you do not have the dev version installed. Have tried devtools::install_github("larmarange/labelled") ?

> library(labelled)
> library(questionr)
> data(fertility)

> look_for(children)
pos   variable      label                    col_type values        
<chr> <chr>         <chr>                    <chr>    <chr>         
1     id_child      Child Id                 dbl      ​              
2     id_woman      Mother Id                dbl      ​              
3     date_of_birth Date of birth            date     ​              
4     sex           Sex                      dbl+lbl  [1] male      
​      ​              ​                         ​         [2] female    
5     alive         Still alive?             dbl+lbl  [0] no, dead  
​      ​              ​                         ​         [1] yes, alive
6     age_at_death  Age at death (in months) dbl    
  ​              
> look_for(children, details = "basic")
pos   variable      label                    col_type values        
<chr> <chr>         <chr>                    <chr>    <chr>         
1     id_child      Child Id                 dbl      ​              
2     id_woman      Mother Id                dbl      ​              
3     date_of_birth Date of birth            date     ​              
4     sex           Sex                      dbl+lbl  [1] male      
​      ​              ​                         ​         [2] female    
5     alive         Still alive?             dbl+lbl  [0] no, dead  
​      ​              ​                         ​         [1] yes, alive
6     age_at_death  Age at death (in months) dbl    
  ​              
> look_for(children, details = "full")
pos   variable      label                    col_type values                        
<chr> <chr>         <chr>                    <chr>    <chr>                         
1     id_child      Child Id                 dbl      range: 1 - 1584               
2     id_woman      Mother Id                dbl      range: 1 - 2000               
3     date_of_birth Date of birth            date     range: 2007-01-03 - 2012-04-15
4     sex           Sex                      dbl+lbl  [1] male                      
​      ​              ​                         ​         [2] female                    
5     alive         Still alive?             dbl+lbl  [0] no, dead                  
​      ​              ​                         ​         [1] yes, alive                
6     age_at_death  Age at death (in months) dbl      range: 0 - 48  

As you can see, value labels (and factor levels) are returned by default

larmarange commented 3 years ago
library(labelled)
library(questionr)
library(dplyr)
#> 
#> Attachement du package : 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
data(fertility)

look_for(children) %>% glimpse()
#> Rows: 6
#> Columns: 6
#> $ pos          <int> 1, 2, 3, 4, 5, 6
#> $ variable     <chr> "id_child", "id_woman", "date_of_birth", "sex", "alive...
#> $ label        <chr> "Child Id", "Mother Id", "Date of birth", "Sex", "Stil...
#> $ col_type     <chr> "dbl", "dbl", "date", "dbl+lbl", "dbl+lbl", "dbl"
#> $ levels       <named list> [NULL, NULL, NULL, NULL, NULL, NULL]
#> $ value_labels <named list> [NULL, NULL, NULL, <1, 2>, <0, 1>, NULL]
look_for(children, details = "none") %>% glimpse()
#> Rows: 6
#> Columns: 3
#> $ pos      <int> 1, 2, 3, 4, 5, 6
#> $ variable <chr> "id_child", "id_woman", "date_of_birth", "sex", "alive", "...
#> $ label    <chr> "Child Id", "Mother Id", "Date of birth", "Sex", "Still al...
look_for(children, details = "full") %>% glimpse()
#> Rows: 6
#> Columns: 13
#> $ pos           <int> 1, 2, 3, 4, 5, 6
#> $ variable      <chr> "id_child", "id_woman", "date_of_birth", "sex", "aliv...
#> $ label         <chr> "Child Id", "Mother Id", "Date of birth", "Sex", "Sti...
#> $ col_type      <chr> "dbl", "dbl", "date", "dbl+lbl", "dbl+lbl", "dbl"
#> $ levels        <named list> [NULL, NULL, NULL, NULL, NULL, NULL]
#> $ value_labels  <named list> [NULL, NULL, NULL, <1, 2>, <0, 1>, NULL]
#> $ class         <named list> ["numeric", "numeric", "Date", <"haven_labelle...
#> $ type          <chr> "double", "double", "double", "double", "double", "do...
#> $ na_values     <named list> [NULL, NULL, NULL, NULL, NULL, NULL]
#> $ na_range      <named list> [NULL, NULL, NULL, NULL, NULL, NULL]
#> $ unique_values <int> 1584, 1090, 1038, 2, 2, 22
#> $ n_na          <int> 0, 0, 0, 0, 0, 1442
#> $ range         <named list> [<1, 1584>, <1, 2000>, <2007-01-03, 2012-04-15...

Created on 2021-01-16 by the reprex package (v0.3.0)

shannonpileggi commented 3 years ago

Ah, thank you and sorry for the confusion! My output does match yours, now. :)

library(labelled)

library(questionr)

data(fertility)

look_for(children, details = "none")
#>   pos variable      label                   
#> <int> <chr>         <chr>                   
#>     1 id_child      Child Id                
#>     2 id_woman      Mother Id               
#>     3 date_of_birth Date of birth           
#>     4 sex           Sex                     
#>     5 alive         Still alive?            
#>     6 age_at_death  Age at death (in months)

look_for(children, details = "full")
#> pos   variable      label                  col_type values                      
#> <chr> <chr>         <chr>                  <chr>    <chr>                       
#> 1     id_child      Child Id               dbl      range: 1 - 1584             
#> 2     id_woman      Mother Id              dbl      range: 1 - 2000             
#> 3     date_of_birth Date of birth          date     range: 2007-01-03 - 2012-04~
#> 4     sex           Sex                    dbl+lbl  [1] male                    
#> <U+200B>      <U+200B>              <U+200B>                       <U+200B>         [2] female                  
#> 5     alive         Still alive?           dbl+lbl  [0] no, dead                
#> <U+200B>      <U+200B>              <U+200B>                       <U+200B>         [1] yes, alive              
#> 6     age_at_death  Age at death (in mont~ dbl      range: 0 - 48

look_for(children, details = "basic")
#> pos   variable      label                    col_type values        
#> <chr> <chr>         <chr>                    <chr>    <chr>         
#> 1     id_child      Child Id                 dbl      <U+200B>              
#> 2     id_woman      Mother Id                dbl      <U+200B>              
#> 3     date_of_birth Date of birth            date     <U+200B>              
#> 4     sex           Sex                      dbl+lbl  [1] male      
#> <U+200B>      <U+200B>              <U+200B>                         <U+200B>         [2] female    
#> 5     alive         Still alive?             dbl+lbl  [0] no, dead  
#> <U+200B>      <U+200B>              <U+200B>                         <U+200B>         [1] yes, alive
#> 6     age_at_death  Age at death (in months) dbl      <U+200B>

Created on 2021-01-17 by the reprex package (v0.3.0)

Some follow up questions I have are:

  1. The difference that I am seeing between details = "full" and details = "basic" is that full provides both the range and codes under values, vs basic is codes only. Is that correct?
  2. Is there a reason the output is formatted in the long output, with 1 row per value code, instead of the wide output, with 1 row per variable?

I did really like your previous wide output with variable, label, class, type, and value_labels in wide format - I think it would be great if this could be maintained for the basic option, and then the full option could include the numeric ranges, missing value summary (and any other features that I am not recalling) as well. For me, that basic option would be the essence of a data dictionary, that you may want to flex to be long, and then the full option provides the same but with more details about what is observed in the values of the data in addition to the metadata.

Thank you for your work on this!

larmarange commented 3 years ago

There is a confusion here between the result returned by look-for() and how it is printed in the console. You can use as_tibble() to deactivate default printing. For readiness, by default, the results are printed in a long format and several columns are merge into a unique values col.

However, the tibble returned by look_for() is wide, some columns being returned as nested lists, and value_labels are stored in a separate column than factor levels.

library(labelled)
library(questionr)
library(dplyr)
data(fertility)

look_for(children) %>% as_tibble()
#> # A tibble: 6 x 6
#>     pos variable      label                    col_type levels      value_labels
#>   <int> <chr>         <chr>                    <chr>    <named lis> <named list>
#> 1     1 id_child      Child Id                 dbl      <NULL>      <NULL>      
#> 2     2 id_woman      Mother Id                dbl      <NULL>      <NULL>      
#> 3     3 date_of_birth Date of birth            date     <NULL>      <NULL>      
#> 4     4 sex           Sex                      dbl+lbl  <NULL>      <dbl [2]>   
#> 5     5 alive         Still alive?             dbl+lbl  <NULL>      <dbl [2]>   
#> 6     6 age_at_death  Age at death (in months) dbl      <NULL>      <NULL>
look_for(children, details = "none") %>% as_tibble()
#> # A tibble: 6 x 3
#>     pos variable      label                   
#>   <int> <chr>         <chr>                   
#> 1     1 id_child      Child Id                
#> 2     2 id_woman      Mother Id               
#> 3     3 date_of_birth Date of birth           
#> 4     4 sex           Sex                     
#> 5     5 alive         Still alive?            
#> 6     6 age_at_death  Age at death (in months)
look_for(children, details = "full") %>% as_tibble()
#> # A tibble: 6 x 13
#>     pos variable label col_type levels value_labels class type  na_values
#>   <int> <chr>    <chr> <chr>    <name> <named list> <nam> <chr> <named l>
#> 1     1 id_child Chil~ dbl      <NULL> <NULL>       <chr~ doub~ <NULL>   
#> 2     2 id_woman Moth~ dbl      <NULL> <NULL>       <chr~ doub~ <NULL>   
#> 3     3 date_of~ Date~ date     <NULL> <NULL>       <chr~ doub~ <NULL>   
#> 4     4 sex      Sex   dbl+lbl  <NULL> <dbl [2]>    <chr~ doub~ <NULL>   
#> 5     5 alive    Stil~ dbl+lbl  <NULL> <dbl [2]>    <chr~ doub~ <NULL>   
#> 6     6 age_at_~ Age ~ dbl      <NULL> <NULL>       <chr~ doub~ <NULL>   
#> # ... with 4 more variables: na_range <named list>, unique_values <int>,
#> #   n_na <int>, range <named list>

Created on 2021-01-18 by the reprex package (v0.3.0)

You can use two helpers function on the table returned by look_for(): convert_list_columns_to_character() and lookfor_to_long_format().

More information is available in the dedicated vignette: https://larmarange.github.io/labelled/articles/look_for.html#advanced-usages-of-look-for-

shannonpileggi commented 3 years ago

Ok. Thank you again for your thorough responses. I apologize, I think I am still used to the usage in version 2.5, and you have changed a lot! Apologies for not more thoroughly reading your new vignette.

However, after reading through the vignette, it is still not clear to me if there is an easy way to replicate the functionality in 2.5, where you can see the metadata in wide rather than long format. I would ideally like to see a quick solution to generate the table shown here, with variable, label, value_labels.

Again, thank you for your prompt responses and discussion on this matter! And my apologies in advance if this is in your documenation and I have yet again managed to miss it.

larmarange commented 3 years ago

You could use

df %>% look_for() %>% convert_list_columns_to_character() 

df %>% look_for() %>% convert_list_columns_to_character() %>% View()

NB: reinstall the last dev version. I just fixed a small bug.

shannonpileggi commented 3 years ago

Ah yes that works perfectly, thank you!

image

Just curious - is it intentional to keep the levels column with this output? Would it serve another purpose with a different data set?

And what do you think about generate_dictionary being an alias for look_for() %>% convert_list_columns_to_character()?