Closed franciscoss closed 3 years ago
The sjlabelled package has many functions for manipulating labels, including one to remove them:
qualtRics::fetch_survey("my_survey_key") %>% sjlabelled::remove_all_labels()
I have never been a big fan or user of the labels myself, so I wonder if there should be an option to just skip the labelling. When I first started using this package (before I became maintainer) this is what I expected label = FALSE
to do. 🙈
I’ll just add that I get terribly confused between the label functions inside haven, labelled, and sjlabelled packages.
I do see the value of labelled data structures for surveys. I often use the SPSS download format to achieve this.
For me, the labels on response data are absolutely essential. This may be because in part or wholly to how my organization utilizes Qualtrics on the front-end. I am not a front-end user -- I don't make surveys. Instead I have come in and started to build a database structure for storing our survey data, which will enable me to do programmatic analysis in R, applied across many surveys if desired; and to connect the survey data to other data sources.
Our column names when I fetch are things like q2_2
. I want to actually know and store the "question text" so that when I'm doing analysis later I don't need to consult an external data dictionary or login to qualtrics, and then have to build survey-by-survey rename
statements. Perhaps our front-end users could be utilizing a sensible "short name" instead of q2, but they are not. (In fact, I have run into occasions where multiple questions have the same name where they duplicated a question for convenience of using the same structure and then adjusted it).
And so I have chosen an architecture that stores all of this into a "question metadata" table-- one row per survey and question. There I store the qid, question text, question type, etc.
When I do analysis, I join this to the response data which is also stored in a long format (one row per question response, so many rows per survey response), get the question text, and then pivot_wider
with names_from = question_text
.
So that is the context in which I am saying the labels are essential-- I am processing them automatically both on the initial data storage, and then also processing automatically to rebuild the tables back into one row per survey before doing analysis.
This would be trivial to turn off with one parameter that passes to read_survey
that tells that function whether to ignore the call to
rawdata <- sjlabelled::set_label(rawdata, unlist(subquestions))
I agree that label
is a bit unclear here. It's established, but if it were to change, maybe parameters like:
choice_data = c("labels", "recodes")
add_displaytext = TRUE
for downloading w/choice labels vs. numeric recodes, and for whether or not to add labels?
One idea I have been thinking about is to change label
from a true/false argument to an argument which takes one of three values, perhaps something like c("choice", "value", "remove")
. This could be added in a way that isn't entirely a breaking change, i.e. if someone passes TRUE
then change to "choice"
and print a warning about the change. This could be a way to gradually get users to change their workflows.
@juliasilge +1 for keeping a backwards-compatible UI, but I think this would combine things that don't go together. label
in its current form is required by the API (useLabels
param), and you select between two options analogous to "choice" and "value" to decide whether you get the actual text of the the selected response in each cell, or the numeric recode. "Remove" as an option doesn't fit in that context.
The "remove" option seems to speak to the attribute "label", where qualtRics puts the question text. Someone correct me if this is wrong, but I get the sense that calling that attribute "label" is just a cultural convention in R, where certain kinds of variable metadata gets put to facilitate better cross compatibility with, say, Stata. Package sjlabelled
, for instance, makes that easy and is what is used internally here, and I don't think that package has an option to change the name. I for one am happy to have this text added, but others don't want it.
But they both relate to the word "label", for externally-controlled reasons, so it's confusing.
Not sure, but what if, instead, we added two new params: use_labels
, and something else (add_questiontext
, maybe?). use_labels
replaces label
and gives the API what it expects. Could stay boolean like label
or switch to c("choice", "value")
like you suggest for increased clarity (perhaps even c("choicetext", "recodevalue")
?). add_questiontext
just is a boolean saying whether displayed question text gets added under the "label" attribute.
label
then gets soft-deprecated as an alias for use_labels
, preserving legacy functionality while minimizing confusion for new users.
Would that work?
Seems worth referencing #144 here, since in the new fetch_survey
the new column mapping is also something to be extracted from the response download. So, a new param could potentially incorporate what other things got pulled out as well.
Worth noting that there's no other way to get them later, though, if they're dropped during response fetching. So, I suppose this brings up another alternative: go ahead and process the entire column mapping when fetching into a secondary dataframe that's written as a single attribute ("columnmap" or whatever) to the entire dataframe.
DF-level attributes don't automatically print, which might be what people want here, plus it makes the column map easily accessible later w/o further processing.
With the changes made by @jmobrien in #199, you can now more easily specify that you do not want any attributes on the output of fetch_survey()
:
library(qualtRics)
res <- fetch_survey("SV_5BJxxxxxxxx", add_column_map = FALSE, add_var_labels = FALSE)
#> | | | 0% | |======================================================================| 100%
#>
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#> .default = col_character(),
#> StartDate = col_datetime(format = ""),
#> EndDate = col_datetime(format = ""),
#> Progress = col_double(),
#> `Duration (in seconds)` = col_double(),
#> Finished = col_logical(),
#> RecordedDate = col_datetime(format = ""),
#> RecipientLastName = col_logical(),
#> RecipientFirstName = col_logical(),
#> RecipientEmail = col_logical(),
#> ExternalReference = col_logical(),
#> LocationLatitude = col_double(),
#> LocationLongitude = col_double(),
#> Q1007 = col_double(),
#> Q1_DO_1 = col_double(),
#> Q1_DO_2 = col_double(),
#> Q1_DO_3 = col_double(),
#> Q1_DO_4 = col_double(),
#> Q1_DO_5 = col_double(),
#> SolutionRevision = col_double(),
#> FL_6_DO_FL_7 = col_double()
#> # ... with 4 more columns
#> )
#> ℹ Use `spec()` for the full column specifications.
attributes(res)
#> $names
#> [1] "StartDate" "EndDate" "Status"
#> [4] "IPAddress" "Progress" "Duration (in seconds)"
#> [7] "Finished" "RecordedDate" "ResponseId"
#> [10] "RecipientLastName" "RecipientFirstName" "RecipientEmail"
#> [13] "ExternalReference" "LocationLatitude" "LocationLongitude"
#> [16] "DistributionChannel" "UserLanguage" "Q1002"
#> [19] "Q1006" "Q1007" "Q1_1"
#> [22] "Q1_2" "Q1_3" "Q1_4"
#> [25] "Q1_5" "Q1_DO_1" "Q1_DO_2"
#> [28] "Q1_DO_3" "Q1_DO_4" "Q1_DO_5"
#> [31] "Q200" "Q300" "Q201"
#> [34] "Q301" "Q202" "Q302"
#> [37] "Q203" "Q303" "Q204"
#> [40] "Q304" "SolutionRevision" "FL_6_DO_FL_7"
#> [43] "FL_6_DO_FL_8" "FL_6_DO_FL_9" "FL_6_DO_FL_10"
#> [46] "FL_6_DO_FL_11"
#>
#> $row.names
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
#> [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
#> [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
#> [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
#> [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
#> [91] 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
#> [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122
#>
#> $class
#> [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
Created on 2021-01-10 by the reprex package (v0.3.0.9001)
No special attributes (like labels, or the column mapping, etc) are on the tibble if you set those two arguments to FALSE
.
If you have further questions or problems, please open a new issue! 🙌
Is there a way to get the resulting data frame without any attributes or labels?
Thanks