Closed shannonpileggi closed 3 years ago
Look_for has been redesigned in labelled 2.6.0
What are you trying to do which is not working with the new version?
Could you provide a reproductive example?
Are all your packages up to date?
Apologies for the uninformative issue! Let me try again.
I am using Windows 10 with:
R 4.0.3
RStudio 1.3.1073
tidyverse 1.3.0
here 0.4.8
haven 2.3.1
labelled 2.5.0 / 2.7.0
Under both labelled 2.5.0 & 2.7.0 the following minimal example works:
library(tidyverse)
library(labelled)
# create data ----
ex_data <- tibble(
id = 1:10,
ctry = c(rep(1, 5), rep(2, 5)),
cdy = rep(c(1, 3), 5),
text = LETTERS[1:10]
) %>%
mutate(
ctry = factor(ctry, labels = c("US", "UK")),
cdy = haven::labelled(
cdy,
labels = c("MMs" = 1, "Skittles" = 3),
label = "Preferred Candy")
)
# create dictionary ----
dictionary <- labelled::look_for(ex_data, details = TRUE)
# view dictionary ----
dictionary
#> variable label class type levels
#> 1 id <NA> integer integer
#> 2 ctry <NA> factor integer US; UK
#> 3 cdy Preferred Candy haven_labelled, vctrs_vctr, double double
#> 4 text <NA> character character
#> value_labels unique_values n_na na_values na_range
#> 1 10 0
#> 2 2 0
#> 3 [1] MMs; [3] Skittles 2 0
#> 4 10 0
Created on 2020-12-19 by the reprex package (v0.3.0)
When I try to create the dictionary by the same method for a larger data set, the dictionary works under 2.5.0, but under 2.7.0 the command never finishes (no error message or warning, R is just running forever).
The data that I am using is sadc_2017_national.sav. As the command never finishes, I was not able to reprex this one, but here is the code I was using.
library(tidyverse)
library(labelled)
library(haven)
library(here)
# import data ----
dat_raw <- haven::read_spss(here::here("data", "sadc_2017_national.sav"))
# create dictionary ----
dictionary <- labelled::look_for(dat_raw, details = TRUE)
# view dictionary ----
dictionary
Please let me know if there is anything else I can try on my end to help trouble shoot.
Thank you!
Sorry for not having response earlier. I explore quickly when the problem happens.
labelled::look_for(dat_raw, details = FALSE)
works quite quickly.
Your dataset is very big (lot of variables and of observations). The feature creating a problem is computing "range" of the different variables. Which is very time consuming.
I need to explore further, maybe with an option to desactivate that part of the computation.
See #79 for a proposition of evolution of look_for()
Now, by default (details = "basic"
) look_for() will compute only basic details (including value labels and factor levels but not variable range). With your big SPSS file, it just take few seconds.
If you want full details (as before), indicate details = "full"
or details = TRUE
. But, it will take time (several minutes) with a file like yours.
@shannonpileggi You can test it with devtools::install_github("larmarange/labelled#79")
Do not hesitate to provide me feedback
@larmarange thank you for taking a look, identifying the problem, and proposing solutions! I did install the dev version and I have a bit of feedback.
look_for
works, and quickly! Thank you!
> dictionary <- labelled::look_for(dat_raw)
> head(dictionary)
variable label
1 sitecode Site code
2 sitename Site name
3 sitetype Site type
4 sitetypenum 1=District, 2=State, 3=National
5 year 4-digit Year of survey
6 survyear 1=1991...14=2017`
I propose that the default version also include the value_codes
- what do you think? To me, the variable name, variable label, and value codes are essential parts of the navigating data, which I consider part of the data dictionary.
In addition, I also attempted this by specifying the details
argument, which resulted in an error.
> dictionary <- labelled::look_for(dat_raw, details = "basic")
Error in if (details) { : argument is not interpretable as logical
Am I using the argument as intended?
Thank you for all of your work on this package!
Dear @shannonpileggi
it seems that you do not have the dev version installed. Have tried devtools::install_github("larmarange/labelled")
?
> library(labelled)
> library(questionr)
> data(fertility)
> look_for(children)
pos variable label col_type values
<chr> <chr> <chr> <chr> <chr>
1 id_child Child Id dbl
2 id_woman Mother Id dbl
3 date_of_birth Date of birth date
4 sex Sex dbl+lbl [1] male
[2] female
5 alive Still alive? dbl+lbl [0] no, dead
[1] yes, alive
6 age_at_death Age at death (in months) dbl
> look_for(children, details = "basic")
pos variable label col_type values
<chr> <chr> <chr> <chr> <chr>
1 id_child Child Id dbl
2 id_woman Mother Id dbl
3 date_of_birth Date of birth date
4 sex Sex dbl+lbl [1] male
[2] female
5 alive Still alive? dbl+lbl [0] no, dead
[1] yes, alive
6 age_at_death Age at death (in months) dbl
> look_for(children, details = "full")
pos variable label col_type values
<chr> <chr> <chr> <chr> <chr>
1 id_child Child Id dbl range: 1 - 1584
2 id_woman Mother Id dbl range: 1 - 2000
3 date_of_birth Date of birth date range: 2007-01-03 - 2012-04-15
4 sex Sex dbl+lbl [1] male
[2] female
5 alive Still alive? dbl+lbl [0] no, dead
[1] yes, alive
6 age_at_death Age at death (in months) dbl range: 0 - 48
As you can see, value labels (and factor levels) are returned by default
library(labelled)
library(questionr)
library(dplyr)
#>
#> Attachement du package : 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
data(fertility)
look_for(children) %>% glimpse()
#> Rows: 6
#> Columns: 6
#> $ pos <int> 1, 2, 3, 4, 5, 6
#> $ variable <chr> "id_child", "id_woman", "date_of_birth", "sex", "alive...
#> $ label <chr> "Child Id", "Mother Id", "Date of birth", "Sex", "Stil...
#> $ col_type <chr> "dbl", "dbl", "date", "dbl+lbl", "dbl+lbl", "dbl"
#> $ levels <named list> [NULL, NULL, NULL, NULL, NULL, NULL]
#> $ value_labels <named list> [NULL, NULL, NULL, <1, 2>, <0, 1>, NULL]
look_for(children, details = "none") %>% glimpse()
#> Rows: 6
#> Columns: 3
#> $ pos <int> 1, 2, 3, 4, 5, 6
#> $ variable <chr> "id_child", "id_woman", "date_of_birth", "sex", "alive", "...
#> $ label <chr> "Child Id", "Mother Id", "Date of birth", "Sex", "Still al...
look_for(children, details = "full") %>% glimpse()
#> Rows: 6
#> Columns: 13
#> $ pos <int> 1, 2, 3, 4, 5, 6
#> $ variable <chr> "id_child", "id_woman", "date_of_birth", "sex", "aliv...
#> $ label <chr> "Child Id", "Mother Id", "Date of birth", "Sex", "Sti...
#> $ col_type <chr> "dbl", "dbl", "date", "dbl+lbl", "dbl+lbl", "dbl"
#> $ levels <named list> [NULL, NULL, NULL, NULL, NULL, NULL]
#> $ value_labels <named list> [NULL, NULL, NULL, <1, 2>, <0, 1>, NULL]
#> $ class <named list> ["numeric", "numeric", "Date", <"haven_labelle...
#> $ type <chr> "double", "double", "double", "double", "double", "do...
#> $ na_values <named list> [NULL, NULL, NULL, NULL, NULL, NULL]
#> $ na_range <named list> [NULL, NULL, NULL, NULL, NULL, NULL]
#> $ unique_values <int> 1584, 1090, 1038, 2, 2, 22
#> $ n_na <int> 0, 0, 0, 0, 0, 1442
#> $ range <named list> [<1, 1584>, <1, 2000>, <2007-01-03, 2012-04-15...
Created on 2021-01-16 by the reprex package (v0.3.0)
Ah, thank you and sorry for the confusion! My output does match yours, now. :)
library(labelled)
library(questionr)
data(fertility)
look_for(children, details = "none")
#> pos variable label
#> <int> <chr> <chr>
#> 1 id_child Child Id
#> 2 id_woman Mother Id
#> 3 date_of_birth Date of birth
#> 4 sex Sex
#> 5 alive Still alive?
#> 6 age_at_death Age at death (in months)
look_for(children, details = "full")
#> pos variable label col_type values
#> <chr> <chr> <chr> <chr> <chr>
#> 1 id_child Child Id dbl range: 1 - 1584
#> 2 id_woman Mother Id dbl range: 1 - 2000
#> 3 date_of_birth Date of birth date range: 2007-01-03 - 2012-04~
#> 4 sex Sex dbl+lbl [1] male
#> <U+200B> <U+200B> <U+200B> <U+200B> [2] female
#> 5 alive Still alive? dbl+lbl [0] no, dead
#> <U+200B> <U+200B> <U+200B> <U+200B> [1] yes, alive
#> 6 age_at_death Age at death (in mont~ dbl range: 0 - 48
look_for(children, details = "basic")
#> pos variable label col_type values
#> <chr> <chr> <chr> <chr> <chr>
#> 1 id_child Child Id dbl <U+200B>
#> 2 id_woman Mother Id dbl <U+200B>
#> 3 date_of_birth Date of birth date <U+200B>
#> 4 sex Sex dbl+lbl [1] male
#> <U+200B> <U+200B> <U+200B> <U+200B> [2] female
#> 5 alive Still alive? dbl+lbl [0] no, dead
#> <U+200B> <U+200B> <U+200B> <U+200B> [1] yes, alive
#> 6 age_at_death Age at death (in months) dbl <U+200B>
Created on 2021-01-17 by the reprex package (v0.3.0)
Some follow up questions I have are:
details = "full"
and details = "basic"
is that full
provides both the range and codes under values
, vs basic
is codes only. Is that correct?I did really like your previous wide output with variable
, label
, class
, type
, and value_labels
in wide format - I think it would be great if this could be maintained for the basic
option, and then the full
option could include the numeric ranges, missing value summary (and any other features that I am not recalling) as well. For me, that basic
option would be the essence of a data dictionary, that you may want to flex to be long, and then the full
option provides the same but with more details about what is observed in the values of the data in addition to the metadata.
Thank you for your work on this!
There is a confusion here between the result returned by look-for()
and how it is printed in the console. You can use as_tibble()
to deactivate default printing. For readiness, by default, the results are printed in a long format and several columns are merge into a unique values
col.
However, the tibble returned by look_for()
is wide, some columns being returned as nested lists, and value_labels are stored in a separate column than factor levels.
library(labelled)
library(questionr)
library(dplyr)
data(fertility)
look_for(children) %>% as_tibble()
#> # A tibble: 6 x 6
#> pos variable label col_type levels value_labels
#> <int> <chr> <chr> <chr> <named lis> <named list>
#> 1 1 id_child Child Id dbl <NULL> <NULL>
#> 2 2 id_woman Mother Id dbl <NULL> <NULL>
#> 3 3 date_of_birth Date of birth date <NULL> <NULL>
#> 4 4 sex Sex dbl+lbl <NULL> <dbl [2]>
#> 5 5 alive Still alive? dbl+lbl <NULL> <dbl [2]>
#> 6 6 age_at_death Age at death (in months) dbl <NULL> <NULL>
look_for(children, details = "none") %>% as_tibble()
#> # A tibble: 6 x 3
#> pos variable label
#> <int> <chr> <chr>
#> 1 1 id_child Child Id
#> 2 2 id_woman Mother Id
#> 3 3 date_of_birth Date of birth
#> 4 4 sex Sex
#> 5 5 alive Still alive?
#> 6 6 age_at_death Age at death (in months)
look_for(children, details = "full") %>% as_tibble()
#> # A tibble: 6 x 13
#> pos variable label col_type levels value_labels class type na_values
#> <int> <chr> <chr> <chr> <name> <named list> <nam> <chr> <named l>
#> 1 1 id_child Chil~ dbl <NULL> <NULL> <chr~ doub~ <NULL>
#> 2 2 id_woman Moth~ dbl <NULL> <NULL> <chr~ doub~ <NULL>
#> 3 3 date_of~ Date~ date <NULL> <NULL> <chr~ doub~ <NULL>
#> 4 4 sex Sex dbl+lbl <NULL> <dbl [2]> <chr~ doub~ <NULL>
#> 5 5 alive Stil~ dbl+lbl <NULL> <dbl [2]> <chr~ doub~ <NULL>
#> 6 6 age_at_~ Age ~ dbl <NULL> <NULL> <chr~ doub~ <NULL>
#> # ... with 4 more variables: na_range <named list>, unique_values <int>,
#> # n_na <int>, range <named list>
Created on 2021-01-18 by the reprex package (v0.3.0)
You can use two helpers function on the table returned by look_for()
: convert_list_columns_to_character()
and lookfor_to_long_format()
.
More information is available in the dedicated vignette: https://larmarange.github.io/labelled/articles/look_for.html#advanced-usages-of-look-for-
Ok. Thank you again for your thorough responses. I apologize, I think I am still used to the usage in version 2.5, and you have changed a lot! Apologies for not more thoroughly reading your new vignette.
However, after reading through the vignette, it is still not clear to me if there is an easy way to replicate the functionality in 2.5, where you can see the metadata in wide rather than long format. I would ideally like to see a quick solution to generate the table shown here, with variable
, label
, value_labels
.
Again, thank you for your prompt responses and discussion on this matter! And my apologies in advance if this is in your documenation and I have yet again managed to miss it.
You could use
df %>% look_for() %>% convert_list_columns_to_character()
df %>% look_for() %>% convert_list_columns_to_character() %>% View()
NB: reinstall the last dev version. I just fixed a small bug.
Ah yes that works perfectly, thank you!
Just curious - is it intentional to keep the levels
column with this output? Would it serve another purpose with a different data set?
And what do you think about generate_dictionary
being an alias for look_for() %>% convert_list_columns_to_character()
?
I recently updated to labelled 2.7.0 and look_for(data, details = TRUE) hung and never resolved. I reverted back to 2.5.0 to get previous usage. Can you confirm that it is working as intended in 2.7.0?