reconhub / linelist

An R package to import, clean, and store case data
https://www.repidemicsconsortium.org/linelist
Other
25 stars 5 forks source link

Enable multi-column matches in clean_variable_spelling() with .regex keyword #98

Closed patrickbarks closed 4 years ago

patrickbarks commented 4 years ago

Implements a .regex keyword in clean_spelling_vars(), per #40. For example, the wordlist variable ".regex ^lab_result_" will match any column name beginning with "labresult". Any variable in the wordlist without a .regex keyword (or .global) will be matched literally, as in the current behaviour.

Simple example:

  # create data
dat <- data.frame(
  id = formatC(1:10, width = 2, flag = "0"),
  site = sample(LETTERS[1:3], 10, replace = TRUE),
  lab_result_01 = sample(c(c("high", "low", "norm", "inc")), 10, replace = TRUE),
  lab_result_02 = sample(c(c("high", "low", "norm", "inc")), 10, replace = TRUE),
  lab_result_03 = sample(c(c("high", "low", "norm", "inc")), 10, replace = TRUE),
  stringsAsFactors = FALSE
)

# create wordlist
# key ".regex ^lab_result_" will match any column starting with "lab_result_"
wordlist <- data.frame(
  value = c("high", "low", "norm", "inc"),
  replacement = c("High", "Low", "Normal", "Inconclusive"),
  variable = rep(".regex ^lab_result_", 4),
  stringsAsFactors = FALSE
)

# compare original and cleaned data
head(dat)
#>   id site lab_result_01 lab_result_02 lab_result_03
#> 1 01    B           inc           low           inc
#> 2 02    C           low          high           low
#> 3 03    A           low          high          norm
#> 4 04    A           inc          norm           inc
#> 5 05    B          high          norm           inc
#> 6 06    B           low           low           inc
head(linelist::clean_variable_spelling(dat, wordlist))
#>   id site lab_result_01 lab_result_02 lab_result_03
#> 1 01    B  Inconclusive           Low  Inconclusive
#> 2 02    C           Low          High           Low
#> 3 03    A           Low          High        Normal
#> 4 04    A  Inconclusive        Normal  Inconclusive
#> 5 05    B          High        Normal  Inconclusive
#> 6 06    B           Low           Low  Inconclusive

Created on 2019-10-15 by the reprex package (v0.3.0)

I've also added some corresponding documentation and tests. Happy to make edits to the PR if you have suggestions/requests.

patrickbarks commented 4 years ago

Great, thanks for the review! I've made all the changes you recommended, plus some minor changes to the internal code documentation. I also made a minor change to the example for clean_variable_spelling(), such that the possible values for my lab_result_ columns are c('High', 'Normal', 'Inconclusive') rather than c('Positive', 'Negative', 'Inconclusive'). The mix of positive and negative results for a given patient was bothering me :)

Also, I just realized that I mistakenly referred to clean_spelling_vars() in the PR title rather than clean_variable_spelling(). Sorry about that!

zkamvar commented 4 years ago

Great, thanks for the review! I've made all the changes you recommended, plus some minor changes to the internal code documentation. I also made a minor change to the example for clean_variable_spelling(), such that the possible values for my lab_result_ columns are c('High', 'Normal', 'Inconclusive') rather than c('Positive', 'Negative', 'Inconclusive'). The mix of positive and negative results for a given patient was bothering me :)

Great! I'll have a look at the changes today. I'm fine with how you changed the examples (though I'm not as bothered by a mix of positive and negative test results assuming the tests are different).

Also, I just realized that I mistakenly referred to clean_spelling_vars() in the PR title rather than clean_variable_spelling(). Sorry about that!

No worries! FWIW, there should be a little "edit" button that you can use to update/fix your comment.

zkamvar commented 4 years ago

Looks good to me! One more thing before you merge: would you mind adding your name to the authors line in the function documentation?

zkamvar commented 4 years ago

Thank you!