lsms-worldbank / labeller

Spot and fix common problems with variable and value labels🏷
https://lsms-worldbank.github.io/labeller/
0 stars 0 forks source link

Fix labels of SuSo questions that export as multiple variables #4

Open arthur-shaw opened 9 months ago

arthur-shaw commented 9 months ago

Problem

Several Survey Solutions question types capture data that must be exported as several separate variables (e.g., list, multi-select, GPS, geography, etc.). For those question types, the variable label consists of two components: first, the label (i.e., either Variable label or Question text field); second, the component in a particular column (e.g., value label for multi-select item; latitute, longitude, altitude, accuracy for GPS; etc). Often, these composite variable labels can be longer than the max length for labels (80 characters) or may not present information in the order/way the data user desires.

To address these issues currently, users must manually create their desired labels. In doing so, they may need to search for components in Designer and copy-paste-modify their way to the desired result (e.g., open the questionnaire on Designer, navigate to/search for the source question, find the desired multi-select item, copy the value label, and paste into Stata to construct the desired label).

Solution

To minimize the manual work, create Stata functions that:

Implementation ideas

Getters

Some commands to get fields from Survey Solution's questionnaire metadata for a given:

For each of these commands:

Setters

While Stata's label functions are enough to set a variable label, perhaps it might be welcome to have a command that might automate some of this for cases where the same get-transform-set operation is undertaken for, say, all answer options of a multi-select question.

To expand on that idea, imagine that a data set contains var__1, var__2, ... , var__20. For each multi-select, the end user wants to:

While this could clearly be done with lbl_get_var_label and label variable and a loop, it might be better to have a function that applies this to all components of a variable (e.g., if var__1, var__2, ... , var__20 -> loop 20 times, passing the indices along), since this need arises frequently wherever multi-select questions are present.

arthur-shaw commented 8 months ago

@kbjarkefur , as agreed, here's a bit more detail on which pieces of metadata are needed by use case:

The most common use case would be construct a variable label for a multi-select question that is the combination of some user-provided text and the value label (e.g., "Region : ZIGUINCHOR"). The user would come to this with the variable name (e.g., region), variable value (e.g., 2), and some user-provided text.

In susometa, the get_answer_options does something here that may be of interest to us: reshapes answer_value_{index} and answer_text_{index} into a data frame from which the label for a given variable value could be plucked.

For the region example above:

# A tibble: 9 x 3        
  index text        value
  <chr> <chr>       <chr>
1 1     ZIGUINCHOR  2    
2 2     SAINT-LOUIS 4    
3 3     KAOLACK     6    
4 4     FATICK      9    
5 5     KOLDA       10
6 6     MATAM       11
7 7     KAFFRINE    12
8 8     KEDOUGOU    13
9 9     SEDHIOU     14
kbjarkefur commented 7 months ago

Since sel_add_metadata as of https://github.com/lsms-worldbank/selector/pull/11 adds the AnswerText value to the char answer_text there is nothing more specfic we need to do related to multi-select vars, right? Or is it something you want me to implement related to loops? See what I am suggesting below and let me know if what I am suggest covers all the use-cases you have in mind.

My suggestion:

I am thinking that we have a new command lbl_use_meta. It requires a varlist and has the required option value(string). So the most basic case would be:

lbl_use_meta my_var, value("type")
di "`r(meta_string)'"

where r(meta_string) in this case would correspond to char my_var[type]. So it is just a way to retreive a char without knowing the syntax of chars.

Next, you can add the option template(string), as in:

lbl_use_meta my_var, value("answer_text") template("Region: {{META}}")
di "`r(meta_string)'"
di "`r(modified_string)'"

wherer r(meta_string) is the same as in the base case. Lets say answer_text was "XYZ" for my_var. Then r(modified_string) would be "Region: XYZ"

Then we can add the option apply(string) that has a few options as valid input. For example varlab. Then we have:

lbl_use_meta my_var, value("answer_text") template("Region: {{META}}") apply("varlab")

Then the variable label for my_var would been set to "Region: XYZ". Had we not had the option template(), then we would have set it to r(meta_string), in this case just: "XYZ".

Finally, the command should be able to handle varlists in combination to apply(). Like this:

lbl_use_meta v409a__*, value("answer_text") template("Region: {{META}}") apply("varlab")

Then in this case, all variables that fit the format v409a__* will have their varaible labels updated to "Region: {{META}}" where for each variable {{META}} is replaced with whatever that variable had in the char answer_text.

Would this satisfy everything you had in mind? Do you want to suggest other names of anything before I get started?

arthur-shaw commented 7 months ago

Since sel_add_metadata as of https://github.com/lsms-worldbank/selector/pull/11 adds the AnswerText value to the char answer_text there is nothing more specfic we need to do related to multi-select vars, right?

Nope. The PR did exactly what was needed.

While I still need to assimilate everything that your sel_add_metadata code does, the multi-select part from the PR is spot on.

Would this satisfy everything you had in mind? Do you want to suggest other names of anything before I get started?

This is exactly what I had in mind. 🎯

In fact, this proposal is much more elegant than mine.

However, I have a few disparate (stream-of-consciousness) thoughts that I'll try to arrange below

Names / API

Other functions

For users who want to do a particular thing, potentially provide a command that does that thing. These commands could help a data user learn what a variable represents without leaving Stata:

For power users and/or the back end:

arthur-shaw commented 7 months ago

@kbjarkefur , did ☝️ help? Happy to discuss IRL if helpful.

kbjarkefur commented 7 months ago

yes, I am implementing this now - almost done with the command. Have not worked on the "other functions" yet. Then you can review as I write documentation