reconhub / linelist

An R package to import, clean, and store case data
https://www.repidemicsconsortium.org/linelist
Other
25 stars 5 forks source link

add .regex keyword to clean_spelling #79

Closed zkamvar closed 5 years ago

zkamvar commented 5 years ago

This addresses @thibautjombart's private issue that clean_spelling needs regex capabilities. Here is the solution:

library("linelist")
# create some fake data
my_data <- c(letters[1:5], "foubar", "foobr", "fubar", NA, "", "unknown", "fumar")
cleaned_data <- c(letters[1:5], "foobar", "foobar", "foobar", "missing", "missing", "missing", "fumar")

# You can use regular expressions to simplify your list
corrections <- data.frame(
  bad =  c(".regex f[ou][^m].+?r$", "unknown", ".missing"), 
  good = c("foobar",                ".na",     "missing"),
  stringsAsFactors = FALSE
)

corrections
#>                     bad    good
#> 1 .regex f[ou][^m].+?r$  foobar
#> 2               unknown     .na
#> 3              .missing missing
data.frame(original = my_data, cleaned = clean_spelling(my_data, corrections))
#>    original cleaned
#> 1         a       a
#> 2         b       b
#> 3         c       c
#> 4         d       d
#> 5         e       e
#> 6    foubar  foobar
#> 7     foobr  foobar
#> 8     fubar  foobar
#> 9      <NA> missing
#> 10          missing
#> 11  unknown    <NA>
#> 12    fumar   fumar

Created on 2019-05-29 by the reprex package (v0.3.0)

thibautjombart commented 5 years ago

You rock, this is really really really cool. Fanx!