add from and to arguments to spelling funs

zkamvar commented 4 years ago

This will address #99 to add from and to arguments to clean_variable_spelling() to allow users to import dictionaries with keys and values in any column. This will fix #99

new function linelist_example() to get csv examples
basic documentation
functionality
tests for regression

You can try this out by installing it from this PR:

devtools::install_github("reconhub/linelist#105")

library(linelist)

wordlist <- read.csv(linelist_example("spelling-dictionary.csv"), 
                     stringsAsFactors = FALSE)
dat      <- read.csv(linelist_example("coded-data.csv"), 
                     stringsAsFactors = FALSE)
dat$date <- as.Date(dat$date)

wordlist <- wordlist[sample(4)]
wordlist # show the wordlist
#>          values                 grp orders  options
#> 1           Yes         readmission      1        y
#> 2            No         readmission      2        n
#> 3       Unknown         readmission      3        u
#> 4       Missing         readmission      4 .missing
#> 5           Yes             treated      1        0
#> 6            No             treated      2        1
#> 7       Missing             treated      3 .missing
#> 8   Facility  1            facility      1        1
#> 9   Facility  2            facility      2        2
#> 10  Facility  3            facility      3        3
#> 11  Facility  4            facility      4        4
#> 12  Facility  5            facility      5        5
#> 13  Facility  6            facility      6        6
#> 14  Facility  7            facility      7        7
#> 15  Facility  8            facility      8        8
#> 16  Facility  9            facility      9        9
#> 17  Facility 10            facility     10       10
#> 18      Unknown            facility     11 .default
#> 19          0-9           age_group      1        0
#> 20        10-19           age_group      2       10
#> 21        20-29           age_group      3       20
#> 22        30-39           age_group      4       30
#> 23        40-49           age_group      5       40
#> 24          50+           age_group      6       50
#> 25         High .regex ^lab_result_      1     high
#> 26       Normal .regex ^lab_result_      2     norm
#> 27 Inconclusive .regex ^lab_result_      3      inc
#> 28          yes             .global    Inf        y
#> 29           no             .global    Inf        n
#> 30      unknown             .global    Inf        u
#> 31      unknown             .global    Inf      unk
#> 32          yes             .global    Inf      oui
#> 33      missing             .global    Inf .missing
head(dat) # show the data
#>       id       date readmission treated facility age_group lab_result_01
#> 1 ef267c 2019-07-08        <NA>       0        C        10           unk
#> 2 e80a37 2019-07-07           y       0        3        10           inc
#> 3 b72883 2019-07-07           y       1        8        30           inc
#> 4 c9ee86 2019-07-09           n       1        4        40           inc
#> 5 40bc7a 2019-07-12           n       1        6         0          norm
#> 6 46566e 2019-07-14           y      NA        B        50           unk
#>   lab_result_02 lab_result_03 has_symptoms followup
#> 1          high           inc         <NA>        u
#> 2           unk          norm            y      oui
#> 3          norm           inc                   oui
#> 4           inc           unk            y      oui
#> 5           unk          norm         <NA>        n
#> 6           unk           inc         <NA>     <NA>

res1 <- clean_variable_spelling(dat,
                                wordlists = wordlist,
                                from = "options",
                                to = "values",
                                spelling_vars = "grp")
head(res1)
#>       id       date readmission treated    facility age_group lab_result_01
#> 1 ef267c 2019-07-08     missing     Yes     Unknown     10-19       unknown
#> 2 e80a37 2019-07-07         yes     Yes Facility  3     10-19  Inconclusive
#> 3 b72883 2019-07-07         yes      No Facility  8     30-39  Inconclusive
#> 4 c9ee86 2019-07-09          no      No Facility  4     40-49  Inconclusive
#> 5 40bc7a 2019-07-12          no      No Facility  6       0-9        Normal
#> 6 46566e 2019-07-14         yes missing     Unknown       50+       unknown
#>   lab_result_02 lab_result_03 has_symptoms followup
#> 1          High  Inconclusive      missing  unknown
#> 2       unknown        Normal          yes      yes
#> 3        Normal  Inconclusive      missing      yes
#> 4  Inconclusive       unknown          yes      yes
#> 5       unknown        Normal      missing       no
#> 6       unknown  Inconclusive      missing  missing

^{Created on 2019-12-02 by the reprex package (v0.3.0)}

zkamvar commented 4 years ago

Note: I am also toying with the idea of moving this into a separate, stand-alone package: https://github.com/reconhub/matchmaker

amygimma commented 4 years ago

also nice implementation of the standalone package

reconhub / linelist

add from and to arguments to spelling funs #105