reconhub / linelist

An R package to import, clean, and store case data
https://www.repidemicsconsortium.org/linelist
Other
25 stars 5 forks source link

Improve and test guess dates #67

Closed zkamvar closed 5 years ago

zkamvar commented 5 years ago

This will fix #64, fix #65, and fix #66 and address #6 with the following improvements:

Here is the new example documentation:

library("linelist")

Mixed format date —————————————–

guess_dates(c("03 Jan 2018", "07/03/1982", "08/20/85")) # default
#> [1] "2018-01-03" "1982-03-07" "1985-08-20"

Prioritizing specific date formats ————————

# 
# The default orders prioritize world date ordering over American-style.

print(ord <- getOption("linelist_guess_orders"))
#> $world_named_months
#> [1] "Ybd" "dby"
#> 
#> $world_digit_months
#> [1] "dmy" "Ymd"
#> 
#> $US_formats
#> [1] "Omdy" "YOmd"

# if you want to prioritize American-style dates with numeric months, you
# can switch the second and third elements of the default orders

print(ord <- getOption("linelist_guess_orders"))
#> $world_named_months
#> [1] "Ybd" "dby"
#> 
#> $world_digit_months
#> [1] "dmy" "Ymd"
#> 
#> $US_formats
#> [1] "Omdy" "YOmd"
print(us_ord <- ord[c(1, 3, 2)])
#> $world_named_months
#> [1] "Ybd" "dby"
#> 
#> $US_formats
#> [1] "Omdy" "YOmd"
#> 
#> $world_digit_months
#> [1] "dmy" "Ymd"
guess_dates(c("03 Jan 2018", "07/03/1982", "08/20/85"), orders = us_ord)
#> [1] "2018-01-03" "1982-07-03" "1985-08-20"

Handling dates with time formats ————————–

This one is for @ffinger addressing #64

#
# If you have a format with hours, minutes and seconds, you can also add that
# to the list of formats. Note, however, that this function will drop levels
# below day.

print(ord$ymdhms <- c("Ymdhms", "Ymdhm"))
#> [1] "Ymdhms" "Ymdhm"

guess_dates(c("2014_04_05_23:15:43", "03 Jan 2018", "07/03/1982", "08/20/85"), orders = ord)
#> [1] "2014-04-05" "2018-01-03" "1982-03-07" "1985-08-20"

Handling missing and nonsense data ———————–

@thibautjombart, you can see in this section, I've added an the Excel date for 2018-10-16 addressing #66 and #6

# 
# guess_dates can handle messy dates and tolerate missing data

x <- c("01-12-2001", "male", "female", "2018-10-18", NA, NA, "2018_10_17",
     "43387", "2018 10 19", "// 24/12/1989", "this is 24/12/1989!",
     "RECON NGO: 19 Sep 2018 :)", "6/9/11", "10/10/10")

guess_dates(x, error_tolerance = 1) # forced conversion
#>  [1] "2001-12-01" NA           NA           "2018-10-18" NA          
#>  [6] NA           "2018-10-17" "2018-10-16" "2018-10-19" "1989-12-24"
#> [11] "1989-12-24" "2018-09-19" "2011-09-06" "2010-10-10"

guess_dates(x, error_tolerance = 0.15) # only 15% errors allowed
#>  [1] "01-12-2001"                "male"                     
#>  [3] "female"                    "2018-10-18"               
#>  [5] NA                          NA                         
#>  [7] "2018_10_17"                "43387"                    
#>  [9] "2018 10 19"                "// 24/12/1989"            
#> [11] "this is 24/12/1989!"       "RECON NGO: 19 Sep 2018 :)"
#> [13] "6/9/11"                    "10/10/10"

Created on 2019-04-08 by the reprex package (v0.2.1)

zkamvar commented 5 years ago

This is ready to merge and passes on travis macos (the others are mired by devtoolsgate: https://twitter.com/ZKamvar/status/1115523891454148609). Imma merge it.