New command: `lab_pipe`

New command that detects pipes from SuSo data (ex. %pipe%) in labels and offer a convinient way to replace them with a more human readable string.

@kbjarkefur , this is simply stellar! 🤩

One high-level comment-cum-question. Would it make sense to split lab_pipe into two commands? Inspired by R's stringr package, I thought to split detection and replacement functions into their own dedicated commands. Perhaps something like this:

##
lbl_find_pipes, [ignore()]
- print out findings
- return pipes_found in macro?
- return vars_w_pipes in macro?

lbl_replace_pipe, pipe() replacement()
- perform find-replace over all labels
- take pipe to be "fixed" (rather than RegEx) or construct pre-pend and post-pend `%` to pipe so that the right replacements are made (e.g. `var1` isn't replaced where `var12`, of which `var1` is a substring, is found)

Like some combination of str_detect and str_subset, lbl_find_pipes would be tasked with finding pipes (whose pattern we hard-code into the command rather than expose to the user). To aid in inspection, perhaps lbl_find_pipes found return a macro that could facilitate (command-line-based) work. Returning the variables with pipes--perhaps to inspect--strikes me as the most compelling option. (Not sure if this would improve on the already great lab_pipe, outputlevel("veryverbose") output to show variables with the pipes and their full labels.)

Inpired by stringr::str_replace/replace_all, lbl_replace_pipe would only replace pipes wherever they are found with some user-specified replacement text. This should contain your excellent logic problem resolution when the replace operation yields a label that's more than 80 characters long.

An addendum: I like commands whose names communicate, to the extent possible, what they do.

What do you think?

Thanks for your kind words.

The reason I made it into one command was that I am envisioning (without telling you yet) a workflow where this command can tell you if you have more unreplaced pipes. I was thinking to add an option expectclean that throws an error if any pipes are still not replaced or ignored. I like commands that after performing a task, especially cleaning tasks, then verifies that the task was complete across the dataset. This becomes a quality assurance check that checks that even if the data is changed, there are not new pipes that appear.

So in the workflow I intended you would first run your script with:

lbl_replace_pipes, expectclean

Let's say there are two pips %crop% and `%rostertitle%. Then the command would throw and error with the output that lists the pipes on the screen. Let's say the user then address those two pipes like this:

lbl_replace_pipes, expectclean  pipevalues(`" "crop [crop]" "rostertitle [name]" "')

Then the error goes away and the full script is executed each time the script is re-run. If the underlying data changes so there is a new pipe in the labels, then the user must either replace or ignore that pipe.

If we were to break this command up into two commands it is not clear where to put the option expectclean as it would look like this:

lbl_find_pipes
lbl_replace_pipe, pipe("crop") replacement("[crop]")
lbl_replace_pipe, pipe("rostertitle") replacement("[name]")

It would be possible to implement this by expecting the human to make sure that the last replace has the option expectclean but I think it is an error prone workflow to expect the human to remember to move it.

Let me know what you think. If we stick to one command I am happy to rename it to lbl_replace_pipes.

Happy to look into to updating the ustrregexm. When I am looking for the pipes I implemented my own parser. As I could not get the value of the pipe with any built in Stata command. I could either just true/false match or replace the value.

If we split into two functions, would it make sense to have an expectclean option for both functions?

With lbl_find_pipes, expectclean, this would allow for something useful for two audiences. For users, they could have a program error if any any pipes are found. That would signal to them that issues need addressing. For us as developers, we could call expect lbl_find_pipes inside of lbl_replace_pipes, passing down the expectclean flag. That would enable lbl_find_pipes to do the work of searching for pipes and erroring any are found (if expectclean flag provided), while the core of lbl_replace_pipes tackles the find-replace operation for the user-specified pipe.

Does that resolve the tension?

Also, do you think that splitting the function, but reusing one inside the other, allows us to test each function separately while ensuring that the replace command can check that it's has resolved everything? (Or would it be better for the user to sandwich calls to the find and replace functions, with the final call using the expectclean flag?)

With lbl_find_pipes, expectclean, this would allow for something useful for two audiences. For users, they could have a program error if any any pipes are found. That would signal to them that issues need addressing. For us as developers, we could call expect lbl_find_pipes inside of lbl_replace_pipes, passing down the expectclean flag. That would enable lbl_find_pipes to do the work of searching for pipes and erroring any are found (if expectclean flag provided), while the core of lbl_replace_pipes tackles the find-replace operation for the user-specified pipe.

I see how this can be implemented in the command, but I am still not sure how the workflow would be for the users perspective.

In this example, there is no way for the command on line 1 to know that the following two lines will clean the existing pipes.

lbl_find_pipes, expectclean
lbl_replace_pipe, pipe("crop") replacement("[crop]")
lbl_replace_pipe, pipe("rostertitle") replacement("[name]")

In this example, there is no way for the command on line 2 to know that next line will clean the last pipe.

lbl_find_pipes
lbl_replace_pipe, pipe("crop") replacement("[crop]") expectclean
lbl_replace_pipe, pipe("rostertitle") replacement("[name]") expectclean

The next two examples could work but they require users to have a deeper understanding of the command is intended to be used. Either to only use expectclean to the very last case of lbl_replace_pipe or to use lbl_replace_pipe before lbl_find_pipes.

lbl_find_pipes
lbl_replace_pipe, pipe("crop") replacement("[crop]")
lbl_replace_pipe, pipe("rostertitle") replacement("[name]") expectclean

lbl_replace_pipe, pipe("crop") replacement("[crop]")
lbl_replace_pipe, pipe("rostertitle") replacement("[name]")
lbl_find_pipes, expectclean

My experience from the Stata community is that users like you and me would read the helpfile and figure this out, but that users like are are a small minority.

The way the command is now, i.e. lbl_replace_pipes, expectclean pipevalues('" "rostertitle [name]" "') would, if this would be used on the the same hypothetical example data, return an error message that would suggest a solution that is applied in the same command as the command that threw the error. Meaning somthing like this: "After lbl_replace_pipes completed, there is still the pipe(s) [crop] in one or more of the variable labels. Either provide a value the pipe should be replaced with using the pipevalues() option, or add the remaining pipe(s) to the ignorepipes() option. Alternatively, the option expectclean can be removed to not show this error again."

I am still happy to go with what you think are best. But to understand what workflow you have from the user's perseptive, can you point to which of the examples above you see as being the workflow we intend. Or if I have not thought of the example you suggest, please provide that.

Great questions.

Workflow

There's no standardized workflow within the team. But looking at the code from Tanzania (and knowing that Malawi largely adopts the same approach), I think that the current workflow boils down to the following manual, high-touch process:

Look at the variable labels
Fix/tweak what needs fixing/tweaking
Rinse and repeat

To streamline the first step, I'm proposing we adopt an assert-ive approach to labels--that is, use commands to assert things about labels (e.g., contain no pipes, less than 80 characters, etc.). To streamline the second step, I'm proposing, where possible, commands that intelligently perform batch operations (e.g., replace pipes with a user-specified placeholder).

With these commands, I think the workflow would be iterative, and look something like this.

First iteration:

* assert that there are no labels with pipes
* if assertion is false, issue error and print where pipes appear
lbl_find_pipes, expectclean

Final iteration:

* knowing where pipes apepar from an earlier iteration, fix each issue with one line of code
* (perhaps we don't need an `expectclean` option for this command
lbl_replace_pipe, pipe("crop") replacement("[crop]")
lbl_replace_pipe, pipe("rostertitle") replacement("[name]")

* assert that there are no remaining labels with pipes
* if assertion is false, issue error and print where pipes appear
lbl_find_pipes, expectclean

If we're fearful that users will forget options, perhaps we can change the default behavior lbl_find_pipes to error unless an option is specified (and/or have a command lbl_no_pipes that asserts that there are no pipes and errors informatively if so).

Building on that glimmer of an idea, I wonder if we should have two classes commands for each domain where we want labeller commands to intervene:

Check that object(s) meet expectations (e.g., variable has a label, labels are less than 80 characters, labels have no SuSo pipes, labels do not contain non-ASCII characters, labels are not duplicated, etc.)
Fix the issue (if base Stata doesn't provide perfectly good/straightforward solutions)

Examples

Here, for reference, are a few approaches used by the team. One caveat: the Senegal approach is mine.

Tanzania

Among the .do files for the survey, there are a set dedicated to data preparation. In those types of files, there is a section where the label is set for most (all?) of the variables. While the file does not say so explicitly, this is to remedy problems with the labels in Survey Solutions (e.g., replace question text with a short label, replace pipes with a placeholder, etc.). This excerpt comes from 00_reference/03_lsms_projects/01_tanzania/nps_2021/DATA_SYSTEM/02-FORMAT/AG.Module.3A.do:

*********************************************
* LABELS                                    *
*********************************************

label variable plot_id    "PLOT ID"
label variable ag3a_01a   "LIST ALL PLOTS OWNED OR CULTIVATED BY THE HOUSEHOLD IN THE LONG RAINY SEASON 2020"
label variable plot_desc  "Description / Location"
label variable prevplot_id "Previous Plot Number NPSY4"
label variable ag3a_02_1  "What is the distance from [PLOT] to: HOME (KM)"
label variable ag3a_02_2  "What is the distance from [PLOT] to: ROAD (KM)"
label variable ag3a_02_3  "What is the distance from [PLOT] to: MARKET (KM)"
label variable ag3a_03    "How did you use this plot during the long rainy season 2020?"
label variable ag3a_04    "What was the total income from renting out this plot during the long rainy season 2020 (TSH)?"
label variable ag3a_07    "What was the main crop cultivated on this plot in the long rainy season 2020?"
label variable ag3a_08a_1 "HOUSEHOLD MEMBER: Decided what to plant on this plot in the long rainy season 2020?"
label variable ag3a_08a_2 "NON-HOUSEHOLD MEMBER: Decided what to plant on this plot in the long rainy season 2020?"
label variable ag3a_08b_1 "Who decided what to plant on this plot in the long rainy season 2020? (ID 1)"
label variable ag3a_08b_2 "Who decided what to plant on this plot in the long rainy season 2020? (ID 2)"
label variable ag3a_08b_3 "Who decided what to plant on this plot in the long rainy season 2020? (ID 3)"
label variable ag3a_08c1  "Who decided what to plant on this plot in the long rainy season 2020? (NETWORK ROSTER ID)"
label variable ag3a_08c2  "Where is [NETWORK] located? (NETWORK ROSTER LOCATION)"

Senegal

The final block of most "cleaning" programs checks the labels and fixes them as needed--but, unfortunately, doesn't recheck that all issues have been addessed. This excerpt comes from 00_reference\03_lsms_projects\04_senegal\01_clean\progs\mod_4_women_nutrition.do:

* ============================================================================
* Adjust labels
* ============================================================================

* ----------------------------------------------------------------------------
* Check labels
* ----------------------------------------------------------------------------

d `nutrition_vars', full

foreach var of local nutrition_vars {

    di as result "Checking -`var'-"
    di ""
    * has label
    has_label `var'

    * label contains French characters
    lbl_has_fr_chr `var'

    * label contains piping (e.g., %rostertitle%) 
    lbl_has_piping `var'

    * label does not contain question number at the beginning
    lbl_has_quest_num `var'

}

* ----------------------------------------------------------------------------
* Add labels where they are missing
* ----------------------------------------------------------------------------

* disparate variables
label variable v400a        "TIME MODULE STARTED."
label variable v400e2       "WHO IN THE HOUSEHOLD GAVE CONSENT FOR [NAME] TO BE INTERVIEWED?"
label variable w_mob        "Age calculated using birth year and month only"
label variable v404         "IS THE RESPONDENT BETWEEN THE AGES OF 15 AND 49 YEARS?"
label variable v409a__409   "Food made from grains like maize, millet, rice or fonio ?"
label variable v409a__410   "Pumpkin, carrots, squash, or sweet potatoes that are yellow or orange inside?"
label variable v409a__411   "White potatoes, white yams, or cassava, or any foods made from roots?"
label variable v409a__4111  "Any plantain or green bananas?"
label variable v409a__412   "Any dark green leafy vegetables such as moringa, baobab leaves, sweet potato ..?"
label variable v409a__413   "Any other vegetables, for example green beans, aubergine, okra, onion, ... ?"
label variable v409a__414   "Ripe mangoes, ripe papayas, cantaloupe or other foods rich in vitamin A?"
label variable v409a__415   "Any other fruits?"
label variable v409a__416   "Any liver, kidney, heart, or other organ meats from domesticated animals ..."
label variable v409a__417   "Any meat from domesticated animals, such as cow, sheep, goat, chicken, or duck?"
label variable v409a__418   "Any liver, kidney, heart, or other organ meats from wild animals such as ... ?"
label variable v409a__419   "Any flesh from wild animals, such as hare, lizards or deer?"
label variable v409a__420   "Eggs?"
label variable v409a__421   "Fresh or dried fish, shellfish, or seafood?"
label variable v409a__422   "Any foods made from beans, peas, lentils, or peanuts such as ... ?"
label variable v409a__423   "Any foods made from nuts or seeds such as cashew, wild nuts, or sesame seeds?"
label variable v409a__424   "Milk, cheese, yogurt, or other foods with milk, like milk and cereal, or ... ?"
label variable v409a__425   "Any oil, fats, or butter, or foods made with any of these?"
label variable v409a__4251  "Processed meats"
label variable v409a__4252  "Fried and salty foods such as crisps, donuts, fatayas, french fries"
label variable v409a__4253  "Soft drinks/sodas, sugary juices, chocolate flavored drinks."
label variable v409a__426   "Any sugary foods such as chocolates, sweets, candies, pastries, cakes, ... ?"
label variable v409a__427   "Condiments for flavor, such as bouillon cubes, chilies, fish powder, etc.?"
label variable v409a__428   "Grubs, snails, or insects like locusts or termites?"
label variable v409a__429   "Foods made with red palm oil, red palm nut, or red palm nut pulp sauce?"
label variable v409a__430   "ANY OTHER FOODS?"
label variable v431         "TIME MODULE FINISHED"
label variable v433_os      "OTHER MODULE OUTCOME"

* list variable
label variable v430_os__0   "OTHER FOOD 1"
label variable v430_os__1   "OTHER FOOD 2"
label variable v430_os__2   "OTHER FOOD 3"
label variable v430_os__3   "OTHER FOOD 4"
label variable v430_os__4   "OTHER FOOD 5"
label variable v430_os__5   "OTHER FOOD 6"
label variable v430_os__6   "OTHER FOOD 7"

Ok, how about this compromise. The lbl commands will be grouped into three groups.

lbl_list_*** - Commands that generate well-formatted outputs related to the label issues we want to identify
lbl_modify_*** or lbl_replace_*** - Commands that are used to make changes to these label issues
lbl_assert - one command that throws errors if labels does not meet som test, for example has no pipes

Here would be a workflow

* List variables with pipes
lbl_list_pipes

* knowing where pipes apepar from an earlier iteration, fix each issue with one line of code
* (perhaps we don't need an `expectclean` option for this command
lbl_replace_pipe, pipe("crop") replacement("[crop]")
lbl_replace_pipe, pipe("rostertitle") replacement("[name]")

* List var labels that are 80 characters (likely to have been truncated)
lbl_list_longlbl 

* No need to have a custom command for this
label variable v409a__428   "<short label>"
label variable v409a__429   "<short label>"

* Use the assert command at the bottom of the file. Use options to indicate what to test for.
* This command use the lbl_list commands to output the error thorwing cases
lbl_assert, nopipes nolonglbl

This I like more as assert means to test and throw errors so any decent Stata coder will intuitively figure out it should come after any corrections made.

If you like this approach:

Do you like the namings?
Should lbl_assert default to test for eveything and user specifies what assertions to skip?

I definitely like this. I like the naming too.

For lbl_assert_*, I'm inclined to have one separate command per thing we're testing (e.g. lbl_assert_no_long, lbl_assert_no_pipes, etc.). First, I'm sorta biased towards this Unix-like philosophy. Second, I find that functions with evocative names are more descriptive than a chain of options. And third, it gives us more freedom to develop new commands as issues arise, and a cleaner way to test functionality of focused functions. However, I do see a downside. For users that one command that tests all the things, they'll have to either accept to call one command per thing to test, or to write their own wrapper function. (Once we have a better sense of what people use and how they use it, perhaps we could write an opinionated wrapper that does all of the things worth doing in one command.)

Great!

We can have an lbl_assert_all that runs through all of them.

I will work on implmenting the lbl_*_pipes in this branch next week.

@arthur-shaw , I have today implemented the changes we discussed. We now have the commands:

lbl_list_pipes
lbl_replace_pipe
lbl_assert_no_pipes

See the draft of a test file here for intended usage: https://github.com/lsms-worldbank/labeller/blob/lab_pipe/src/tests/lab_pipe/lab_pipe.do

A few disparate questions:

Put long option strings in snake case. Some option names--like missingok and outputlevel, ignorepipes--are rather long. For readability, would it make sense to insert an underscore to break long options into their constituent words (e.g. missingok -> missing_ok)? Not sure what the conventions are in Stata. For R, snake case currently reigns supreme. For outputlevel in particular, what would you think of how_verbose or verbosity (e.g., with levels, low, medium, and high)?
*Return code for `lblassertcommands.** While I love the waylbl_assert_no_pipesfails informatively, I wonder if others would simply like the error signal to do something else (e.g., tryCatch-style programming, orcapture assert ...followed by ``if "_rc'" != 0 ....`). Based on my meager understanding (memory), Stata errors should be captured bycapture, but I couldn't capture the error code oflbl_assert_no_pipes`. How does one do this? If it's not currently possible, would it be easy to add this?
Whether to handle both "%my_pipe%" and "my_pipe" as equivalent specifications in all functions and in all options. While doing some informal testing, I noticed that both lbl_replace_pipe, pipe("my_pipe") replacement("my_marker") and lbl_replace_pipe, pipe("%my_pipe%") replacement("my_marker") yielded the same (expected) result while lbl_list_pipes, ignorepipes("foo") and lbl_list_pipes, ignorepipes("%foo%") did not (the latter didn't ignore %foo%). Ultimately, I'm fine with either supporting both specifications everywhere or only one specification. Whatever the case, we should saying something more about what the functions mean by a pipe. In function output, we seem to call both specifications as a pipe. For example, the overall inventory omits % from pipes while the per-pipe inventory uses % :

. lbl_list_pipes, ignorepipes("foo")

    Pipes found in dataset:
    - bar

    -------------------------------------------------------------------

    Pipe %bar% used in variable(s):
    - no_consent

Added snake case in: 2926634aecc2778b4436e40ca0a1c92d0b093261
I was able to capture the return code. Can you share an example of where you cannot catch it? Update the code in /src/tests/lab_pipe/lab_pipe.do to the code below. Here the error code is captured in _rc for me.

    * Create example data
    sysuse auto, clear
    label variable mpg "Mileage (%unit%)"
    label variable price "Cost of car (%unit%)"
    label variable foreign "Is (%country%)"

    * List pipes
    lbl_list_pipes, output_level(veryverbose)

    capture lbl_assert_no_pipes, output_level(veryverbose) ///
        ignore_pipes(country) 
    di _rc

c3329e52f6ee133fc8e4927c828344e7b975a901 - allows both %pipe% and pipe in ignore_pipes()
1a2d271a686edd4727ac9f603146f3f7939a1129 - standardizes how pipes are outputted. It is always outputted as %pipe% now

lsms-worldbank / labeller