reconhub / linelist

An R package to import, clean, and store case data
https://www.repidemicsconsortium.org/linelist
Other
25 stars 5 forks source link

modify guess_dates to be able to output mixture of converted and unconverted dates #87

Closed cwhittaker1000 closed 5 years ago

cwhittaker1000 commented 5 years ago

When working with a linelist containing a mixture of dates in modern excel format and other unrecognised formats e.g.

bloop <- c("40001", "40002", "40003", "22_Aout_2019", NA)

I noticed that with the error_tolerance parameter turned up and modern_excel = TRUE, "22_Aout_2019" gets coerced to an NA, and the first 3 elements get converted to their correct dates. When the error_tolerance parameter is low, the original vector is returned.

Would it be possible to add some functionality into guess_dates (possibly another argument) to allow a third type of output? Specifically, an output where all the dates that can be converted (i.e. the first three elements in the example above) get converted and returned, and all the inputs that can't be converted (the fourth element in the example above) gets returned as they originally were (instead of being converted to an NA)?

zkamvar commented 5 years ago

I'm hesitant to add functionality like this because it provides yet another layer where uncertainty can pop up. Part of the problem is that the vector you get back will be a character vector, not a date vector, and you still have to convert it to date if you want to do anything useful with it. If you want to preserve the old dates, I would suggest to use guess_dates to add a new column instead of modifying the column in place like so:

library("tidyverse") 
library("linelist")

(locale <- Sys.getlocale("LC_TIME"))
#> [1] "en_GB.UTF-8"
Sys.setlocale("LC_TIME", "fr_FR.utf8")
#> [1] "fr_FR.utf8"

bloop <- c("40001", "22_Août_2019", "22_Aout_2019", NA)
dat   <- tibble::tibble(bloop = bloop, floop = sample(bloop))

DATES <- c("bloop", "floop")
dat %>%
  mutate_at(.vars = vars(DATES), 
            .funs = list(cleaned = ~guess_dates(., error_tolerance=1)))
#> # A tibble: 4 x 4
#>   bloop        floop        bloop_cleaned floop_cleaned
#>   <chr>        <chr>        <date>        <date>       
#> 1 40001        22_Août_2019 2009-07-07    2019-08-22   
#> 2 22_Août_2019 40001        2019-08-22    2009-07-07   
#> 3 22_Aout_2019 22_Aout_2019 NA            NA           
#> 4 <NA>         <NA>         NA            NA

Sys.setlocale("LC_TIME", locale) # reset to original locale
#> [1] "en_GB.UTF-8"

Created on 2019-08-28 by the reprex package (v0.3.0)

zkamvar commented 5 years ago

@cwhittaker1000, does this work for you?

cwhittaker1000 commented 5 years ago

@zkamvar that works for me!