reconhub / linelist

An R package to import, clean, and store case data
https://www.repidemicsconsortium.org/linelist
Other
25 stars 5 forks source link

modify guess_dates to handle French months #86

Closed cwhittaker1000 closed 5 years ago

cwhittaker1000 commented 5 years ago

guess_dates currently doesn't appear to handle French months well. For example:

bloop <- c("40001", "40002", "40003", "22_Aug_2019", NA)
linelist::guess_dates(bloop, error_tolerance = 0.1, modern_excel = TRUE)

works perfectly (converting the first 4 elements to their respective dates and leaving the NA as is), whereas:

bloop <- c("40001", "40002", "40003", "22_Aout_2019", NA)
linelist::guess_dates(bloop, error_tolerance = 0.1, modern_excel = TRUE)

doesn't currently and returns the original input vector. There might be some functionality I'm missing out on, but if not, would it be possible to modify guess_dates to incorporate some functionality enabling it to handle French months (both with and without the accents on them) in addition to English months?

zkamvar commented 5 years ago

This depends on the locale of the current R session. In a French session, the dates would be parsed correctly only if they have the accents correct.

For example:

# In the english locale, neither Aout nor Août are recognised
(locale <- Sys.getlocale("LC_TIME"))
#> [1] "en_GB.UTF-8"
bloop <- c("40001", "22_Août_2019", "22_Aout_2019", NA)
linelist::guess_dates(bloop, error_tolerance = 1, modern_excel = TRUE)
#> [1] "2009-07-07" NA           NA           NA

# if we set the locale to french, then the one with the accent is recognised by lubridate.
Sys.setlocale("LC_TIME", "fr_FR.utf8")
#> [1] "fr_FR.utf8"

linelist::guess_dates(bloop, error_tolerance = 1, modern_excel = TRUE)
#> [1] "2009-07-07" "2019-08-22" NA           NA

As a bonus, here is how the other date parsing packages compare:

as.Date(bloop, "%d_%B_%Y")
#> [1] NA           "2019-08-22" NA           NA
parsedate::parse_date(bloop)
#> [1] "2019-08-28 10:23:18 UTC" "2019-01-22 00:00:00 UTC"
#> [3] "2019-01-22 00:00:00 UTC" NA
anytime::anydate(bloop)
#> [1] "4000-01-01" NA           NA           NA

Sys.setlocale("LC_TIME", locale) # reset to original locale
#> [1] "en_GB.UTF-8"

Created on 2019-08-28 by the reprex package (v0.3.0)

cwhittaker1000 commented 5 years ago

Thanks for this and completely understood. Do you have a sense of whether it would be possible to handle French months without accents? I've had a look through lubridate but all of the functionality I've been able to find appears to be contingent on the accents being in place.

zkamvar commented 5 years ago

the only thing I can think of would be to have a dictionary that replaces these:

stringr::str_replace_all("get Aout", "Aout", "Août")
#> [1] "get Août"

Created on 2019-08-28 by the reprex package (v0.3.0)

cwhittaker1000 commented 5 years ago

That makes perfect sense and is super helpful, thanks a bunch!