Fix labels of SuSo questions that export as multiple variables

arthur-shaw commented 9 months ago

Problem

Several Survey Solutions question types capture data that must be exported as several separate variables (e.g., list, multi-select, GPS, geography, etc.). For those question types, the variable label consists of two components: first, the label (i.e., either Variable label or Question text field); second, the component in a particular column (e.g., value label for multi-select item; latitute, longitude, altitude, accuracy for GPS; etc). Often, these composite variable labels can be longer than the max length for labels (80 characters) or may not present information in the order/way the data user desires.

To address these issues currently, users must manually create their desired labels. In doing so, they may need to search for components in Designer and copy-paste-modify their way to the desired result (e.g., open the questionnaire on Designer, navigate to/search for the source question, find the desired multi-select item, copy the value label, and paste into Stata to construct the desired label).

Solution

To minimize the manual work, create Stata functions that:

Get the components from Survey Solution's questionnaire metadata (e.g., question text, variable label)
Compose a label with the returned component and other text desired by the end user

Implementation ideas

Getters

Some commands to get fields from Survey Solution's questionnaire metadata for a given:

lbl_get_question_text
lbl_get_var_label

For each of these commands:

Load the SuSo questionnaire metadata file into a frame
Find the relevant variable observation (erroring if either no or multilple variables found)
Return the desired component to a macro (e.g. r(qnr_attrib))

Setters

While Stata's label functions are enough to set a variable label, perhaps it might be welcome to have a command that might automate some of this for cases where the same get-transform-set operation is undertaken for, say, all answer options of a multi-select question.

To expand on that idea, imagine that a data set contains var__1, var__2, ... , var__20. For each multi-select, the end user wants to:

Extract the answer option from the questionnaire metadata (e.g., get the value label for option 1)
Construct a label that follows a pattern (e.g., "Asset owned: `val_label_1")
Apply that label (e.g. label variable var__1 "Asset owned: `val_label_1'")

While this could clearly be done with lbl_get_var_label and label variable and a loop, it might be better to have a function that applies this to all components of a variable (e.g., if var__1, var__2, ... , var__20 -> loop 20 times, passing the indices along), since this need arises frequently wherever multi-select questions are present.

arthur-shaw commented 8 months ago

@kbjarkefur , as agreed, here's a bit more detail on which pieces of metadata are needed by use case:

Replace current variable label with the
- original variable label: variable_label
- question text: question_text

Replace current value label with another value label

value found in answer_value_{index}, where {index} is the order index of the value in JSON. In the example below for region (excerpt from shared data), ZIGUINCHOR is index 1 in JSON but has value 2 in the variable:

  "Answers": [
    {
      "AnswerText": "ZIGUINCHOR",
      "AnswerValue": "2"
    },
    {
      "AnswerText": "SAINT-LOUIS",
      "AnswerValue": "4"
    },
    {
      "AnswerText": "KAOLACK",
      "AnswerValue": "6"
    },
    {
      "AnswerText": "FATICK",
      "AnswerValue": "9"
    },
    {
      "AnswerText": "KOLDA",
      "AnswerValue": "10"
    },
    {
      "AnswerText": "MATAM",
      "AnswerValue": "11"
    },
    {
      "AnswerText": "KAFFRINE",
      "AnswerValue": "12"
    },
    {
      "AnswerText": "KEDOUGOU",
      "AnswerValue": "13"
    },
    {
      "AnswerText": "SEDHIOU",
      "AnswerValue": "14"
    }
  ],

label found in answer_text_{index}

The most common use case would be construct a variable label for a multi-select question that is the combination of some user-provided text and the value label (e.g., "Region : ZIGUINCHOR"). The user would come to this with the variable name (e.g., region), variable value (e.g., 2), and some user-provided text.

In susometa, the get_answer_options does something here that may be of interest to us: reshapes answer_value_{index} and answer_text_{index} into a data frame from which the label for a given variable value could be plucked.

For the region example above:

# A tibble: 9 x 3        
  index text        value
  <chr> <chr>       <chr>
1 1     ZIGUINCHOR  2    
2 2     SAINT-LOUIS 4    
3 3     KAOLACK     6    
4 4     FATICK      9    
5 5     KOLDA       10
6 6     MATAM       11
7 7     KAFFRINE    12
8 8     KEDOUGOU    13
9 9     SEDHIOU     14

kbjarkefur commented 7 months ago

Since sel_add_metadata as of https://github.com/lsms-worldbank/selector/pull/11 adds the AnswerText value to the char answer_text there is nothing more specfic we need to do related to multi-select vars, right? Or is it something you want me to implement related to loops? See what I am suggesting below and let me know if what I am suggest covers all the use-cases you have in mind.

My suggestion:

I am thinking that we have a new command lbl_use_meta. It requires a varlist and has the required option value(string). So the most basic case would be:

lbl_use_meta my_var, value("type")
di "`r(meta_string)'"

where r(meta_string) in this case would correspond to char my_var[type]. So it is just a way to retreive a char without knowing the syntax of chars.

Next, you can add the option template(string), as in:

lbl_use_meta my_var, value("answer_text") template("Region: {{META}}")
di "`r(meta_string)'"
di "`r(modified_string)'"

wherer r(meta_string) is the same as in the base case. Lets say answer_text was "XYZ" for my_var. Then r(modified_string) would be "Region: XYZ"

Then we can add the option apply(string) that has a few options as valid input. For example varlab. Then we have:

lbl_use_meta my_var, value("answer_text") template("Region: {{META}}") apply("varlab")

Then the variable label for my_var would been set to "Region: XYZ". Had we not had the option template(), then we would have set it to r(meta_string), in this case just: "XYZ".

Finally, the command should be able to handle varlists in combination to apply(). Like this:

lbl_use_meta v409a__*, value("answer_text") template("Region: {{META}}") apply("varlab")

Then in this case, all variables that fit the format v409a__* will have their varaible labels updated to "Region: {{META}}" where for each variable {{META}} is replaced with whatever that variable had in the char answer_text.

Would this satisfy everything you had in mind? Do you want to suggest other names of anything before I get started?

arthur-shaw commented 7 months ago

Since sel_add_metadata as of https://github.com/lsms-worldbank/selector/pull/11 adds the AnswerText value to the char answer_text there is nothing more specfic we need to do related to multi-select vars, right?

Nope. The PR did exactly what was needed.

While I still need to assimilate everything that your sel_add_metadata code does, the multi-select part from the PR is spot on.

Would this satisfy everything you had in mind? Do you want to suggest other names of anything before I get started?

This is exactly what I had in mind. 🎯

In fact, this proposal is much more elegant than mine.

However, I have a few disparate (stream-of-consciousness) thoughts that I'll try to arrange below

Names / API

varlist. To be consistent with other commands in the package, I think we should (also?) have a required varlist(varlist) option.
value.
- Values of value. For the end user, I wonder if there's a way to promote discovery of the different values. For power users, there's char list and manual inspection. For everyone else, there's reading the docs. Should we provide a function that lists the char names (e.g., sel_list_metadata)? The (relevant) names are few and fairly human-friendly: answer_text and variable_label. Maybe question_text in future.
- Name of option. Consider from, value_from, or from_meta. The command is fetching a value from the metadata.
apply.
- Name of option. Consider to, template_to, apply_to, or to_data. The command is applying a constructed template to some data attribute
template.
- Syntax of template. Super small suggestion that's a matter of personal preference: use a single pair of curly braces. mustache, a logicless templating language, uses two pair of curly braces. glue, an R package for string interpolation, uses a single pair. I favor a single pair only because it makes for less typing.
- Name of option. Personally, I like template. However, I wonder if there's a Stata vocabulary for this. If so, perhaps we should consider using it.
lbl_use_meta. If the command is meant to be very general in scope--the command operating as a general getter/setter--the name is fine (e.g., get replace. If the command is meant to update variable labels with a templated string part of whose values come from answer_text, perhaps we should consider a more evocative name, while using a more general function behind the scenes--akin to what we did with sel_vars and filter_vars. Here, unfortunately, I'm coming up short on ideas. A few low-quality suggestions: lbl_update_var_lbl, lbl_replace_ms_var_lbl, lbl_apply_answer_text_to_label.

Other functions

For users who want to do a particular thing, potentially provide a command that does that thing. These commands could help a data user learn what a variable represents without leaving Stata:

lbl_get_question_text, var(). Get question text for a specific variable.
lbl_get_var_label, var(). Get variable label stored in char.
lbl_get_answer_text, var(). Get the answer text for a particular variable.

For power users and/or the back end:

lbl_list_metadata, varlist(). List chars attached to varlist. Could also be lbl_list_chars but contain the same command definition. LSMS users will think in terms of SuSo metadata. Others may think in terms of chars.
lbl_get_meta, var() value(). Retrieve the value of the char. Could also be lbl_get_char, var() value().

arthur-shaw commented 7 months ago

@kbjarkefur , did ☝️ help? Happy to discuss IRL if helpful.

kbjarkefur commented 7 months ago

yes, I am implementing this now - almost done with the command. Have not worked on the "other functions" yet. Then you can review as I write documentation

lsms-worldbank / labeller