Open arthur-shaw opened 9 months ago
@kbjarkefur , as agreed, here's a bit more detail on which pieces of metadata are needed by use case:
variable_label
question_text
answer_value_{index}
, where {index}
is the order index of the value in JSON. In the example below for region (excerpt from shared data), ZIGUINCHOR is index 1 in JSON but has value 2 in the variable:
"Answers": [
{
"AnswerText": "ZIGUINCHOR",
"AnswerValue": "2"
},
{
"AnswerText": "SAINT-LOUIS",
"AnswerValue": "4"
},
{
"AnswerText": "KAOLACK",
"AnswerValue": "6"
},
{
"AnswerText": "FATICK",
"AnswerValue": "9"
},
{
"AnswerText": "KOLDA",
"AnswerValue": "10"
},
{
"AnswerText": "MATAM",
"AnswerValue": "11"
},
{
"AnswerText": "KAFFRINE",
"AnswerValue": "12"
},
{
"AnswerText": "KEDOUGOU",
"AnswerValue": "13"
},
{
"AnswerText": "SEDHIOU",
"AnswerValue": "14"
}
],
answer_text_{index}
The most common use case would be construct a variable label for a multi-select question that is the combination of some user-provided text and the value label (e.g., "Region : ZIGUINCHOR"). The user would come to this with the variable name (e.g., region
), variable value (e.g., 2
), and some user-provided text.
In susometa, the get_answer_options
does something here that may be of interest to us: reshapes answer_value_{index}
and answer_text_{index}
into a data frame from which the label for a given variable value could be plucked.
For the region example above:
# A tibble: 9 x 3
index text value
<chr> <chr> <chr>
1 1 ZIGUINCHOR 2
2 2 SAINT-LOUIS 4
3 3 KAOLACK 6
4 4 FATICK 9
5 5 KOLDA 10
6 6 MATAM 11
7 7 KAFFRINE 12
8 8 KEDOUGOU 13
9 9 SEDHIOU 14
Since sel_add_metadata
as of https://github.com/lsms-worldbank/selector/pull/11 adds the AnswerText
value to the char answer_text
there is nothing more specfic we need to do related to multi-select vars, right? Or is it something you want me to implement related to loops? See what I am suggesting below and let me know if what I am suggest covers all the use-cases you have in mind.
I am thinking that we have a new command lbl_use_meta
. It requires a varlist and has the required option value(string)
. So the most basic case would be:
lbl_use_meta my_var, value("type")
di "`r(meta_string)'"
where r(meta_string)
in this case would correspond to char my_var[type]
. So it is just a way to retreive a char without knowing the syntax of chars.
Next, you can add the option template(string)
, as in:
lbl_use_meta my_var, value("answer_text") template("Region: {{META}}")
di "`r(meta_string)'"
di "`r(modified_string)'"
wherer r(meta_string)
is the same as in the base case. Lets say answer_text
was "XYZ"
for my_var
. Then r(modified_string)
would be "Region: XYZ"
Then we can add the option apply(string)
that has a few options as valid input. For example varlab
. Then we have:
lbl_use_meta my_var, value("answer_text") template("Region: {{META}}") apply("varlab")
Then the variable label for my_var
would been set to "Region: XYZ"
. Had we not had the option template()
, then we would have set it to r(meta_string)
, in this case just: "XYZ"
.
Finally, the command should be able to handle varlists in combination to apply()
. Like this:
lbl_use_meta v409a__*, value("answer_text") template("Region: {{META}}") apply("varlab")
Then in this case, all variables that fit the format v409a__*
will have their varaible labels updated to "Region: {{META}}"
where for each variable {{META}}
is replaced with whatever that variable had in the char answer_text
.
Would this satisfy everything you had in mind? Do you want to suggest other names of anything before I get started?
Since sel_add_metadata as of https://github.com/lsms-worldbank/selector/pull/11 adds the AnswerText value to the char answer_text there is nothing more specfic we need to do related to multi-select vars, right?
Nope. The PR did exactly what was needed.
While I still need to assimilate everything that your sel_add_metadata
code does, the multi-select part from the PR is spot on.
Would this satisfy everything you had in mind? Do you want to suggest other names of anything before I get started?
This is exactly what I had in mind. 🎯
In fact, this proposal is much more elegant than mine.
However, I have a few disparate (stream-of-consciousness) thoughts that I'll try to arrange below
varlist
. To be consistent with other commands in the package, I think we should (also?) have a required varlist(varlist)
option.value
.
value
. For the end user, I wonder if there's a way to promote discovery of the different values. For power users, there's char list
and manual inspection. For everyone else, there's reading the docs. Should we provide a function that lists the char names (e.g., sel_list_metadata
)? The (relevant) names are few and fairly human-friendly: answer_text
and variable_label
. Maybe question_text
in future.from
, value_from
, or from_meta
. The command is fetching a value from the metadata.apply
.
to
, template_to
, apply_to
, or to_data
. The command is applying a constructed template to some data attributetemplate
.
template
. However, I wonder if there's a Stata vocabulary for this. If so, perhaps we should consider using it.lbl_use_meta
. If the command is meant to be very general in scope--the command operating as a general getter/setter--the name is fine (e.g., get replace. If the command is meant to update variable labels with a templated string part of whose values come from answer_text
, perhaps we should consider a more evocative name, while using a more general function behind the scenes--akin to what we did with sel_vars
and filter_vars
. Here, unfortunately, I'm coming up short on ideas. A few low-quality suggestions: lbl_update_var_lbl
, lbl_replace_ms_var_lbl
, lbl_apply_answer_text_to_label
.For users who want to do a particular thing, potentially provide a command that does that thing. These commands could help a data user learn what a variable represents without leaving Stata:
lbl_get_question_text, var()
. Get question text for a specific variable.lbl_get_var_label, var()
. Get variable label stored in char.lbl_get_answer_text, var()
. Get the answer text for a particular variable. For power users and/or the back end:
lbl_list_metadata, varlist()
. List chars attached to varlist. Could also be lbl_list_chars
but contain the same command definition. LSMS users will think in terms of SuSo metadata. Others may think in terms of chars.lbl_get_meta, var() value()
. Retrieve the value of the char. Could also be lbl_get_char, var() value()
.@kbjarkefur , did ☝️ help? Happy to discuss IRL if helpful.
yes, I am implementing this now - almost done with the command. Have not worked on the "other functions" yet. Then you can review as I write documentation
Problem
Several Survey Solutions question types capture data that must be exported as several separate variables (e.g., list, multi-select, GPS, geography, etc.). For those question types, the variable label consists of two components: first, the label (i.e., either
Variable label
orQuestion text
field); second, the component in a particular column (e.g., value label for multi-select item; latitute, longitude, altitude, accuracy for GPS; etc). Often, these composite variable labels can be longer than the max length for labels (80 characters) or may not present information in the order/way the data user desires.To address these issues currently, users must manually create their desired labels. In doing so, they may need to search for components in Designer and copy-paste-modify their way to the desired result (e.g., open the questionnaire on Designer, navigate to/search for the source question, find the desired multi-select item, copy the value label, and paste into Stata to construct the desired label).
Solution
To minimize the manual work, create Stata functions that:
Implementation ideas
Getters
Some commands to get fields from Survey Solution's questionnaire metadata for a given:
For each of these commands:
r(qnr_attrib)
)Setters
While Stata's label functions are enough to set a variable label, perhaps it might be welcome to have a command that might automate some of this for cases where the same get-transform-set operation is undertaken for, say, all answer options of a multi-select question.
To expand on that idea, imagine that a data set contains
var__1
,var__2
, ... ,var__20
. For each multi-select, the end user wants to:"Asset owned: `val_label_1"
)label variable var__1 "Asset owned: `val_label_1'"
)While this could clearly be done with
lbl_get_var_label
andlabel variable
and a loop, it might be better to have a function that applies this to all components of a variable (e.g., ifvar__1
,var__2
, ... ,var__20
-> loop 20 times, passing the indices along), since this need arises frequently wherever multi-select questions are present.