reconhub / linelist

An R package to import, clean, and store case data
https://www.repidemicsconsortium.org/linelist
Other
25 stars 5 forks source link

convert dates stored as character numbers #66

Closed thibautjombart closed 5 years ago

thibautjombart commented 5 years ago

Some dates with inconsistent formats in Excel in a given column end up importd as a character containing a mixed bag of numbers and mis-formatted dates such as:


x <- list(anniversaire = c("43301", "43313", NA, "43312", "43315", 
"43314", "43309", "43317", "43310", NA, "43308", "43293", "43318", 
"43316", "43320", "43319", "43262", "43321", "43296", "43307", 
"43322", "43305", "43324", "43325", "43326", "43327", "43323", 
"43329", "43300", "43330", "43328", "43331", "43299", "43332", 
"43334", "43335", "43333", "43336", "43337", "43338", "43339", 
"43341", "43340", "43343", "43345", "43346", "43348", "43347", 
"43344", "43349", "43350", "43352", "43351", "43366", "43353", 
"43357", "43358", "43360", "43359", "43355", "43361", "43271", 
"3 MOIS", "43356", "43362", "43287", "43363", "43364", "43250", 
"43365", "43367", "43368", "43369", "43371", "43370", "43372", 
"43374", "43373", "43376", "43375", "43377", "43379", "43378", 
"43380", "43395", "43381", "40456", "43383", "43382", "43384", 
"43385", "43386", "43387", "43388", "43389", "43390", "41559", 
"40467", "43391", "17/20/2018", "11/10//2018", "43392", "18/2/10/2018", 
"43393", "43397", "43394", "43757", "43396", "43398", "25", "43399", 
"43400", "43401", "43402", "43403", "43404", "43405", "43432", 
"43433", "43434", "43426", "31/11/2018", "43406", "43407", "43408", 
"43409", "43438", "43410", "43411", "43412", "43044", "43413", 
"05/11/2018", "43414", "39758", "43415", "43416", "43417", "012/11/2018", 
"12/111/2018", "43418", "43419", "12//11/2018", "43420", "43422", 
"43421", "17/11/2018/", "43423", "43424", "43425", "201/11/2018", 
"43427", "43123", "43122", "18/18/2018", "20/112018", "43429", 
"43124", "43428", "24/11/208", "2311/2018", "43430", "24/11/2018M", 
"43431", "43435", "43462", "43436", "43101", "43112", "43143", 
"43171", "43232", "43439", "43437", "43440", "43441", "43202", 
"43263", "31/09/2018", "43446", "10//12/2018", "43448", "43447", 
"43449", "43451", "14//12/2018", "43450", "43452", "43453", "43464", 
"43463", "43454", "43817", "43455", "43818", "43456", "22§12§2018", 
"43457", "43458", "43459", "43460", "43461", "76326", "43465", 
"43466", "3O/12/2018", "43129", "43497", "43829", "43525", "43826", 
"43132", "43556", "43819", "43586", "43191", "43617", "43221", 
"43647", "43678", "43824", "43618", "43679", "43709", "43496", 
"43282", "2012/2019", "43739", "43648", "43770", "43815", "47119", 
"43800", "43827", "11", "9", "43825", "43828", "43688", "43478", 
"43479", "43480", "43481", "43114", "43482", "16", "10/O1/2019", 
"43483", "43484", "17", "43485", "43486", "43851", "44217", "43487", 
"43488", "22/01208", "43489", "43490", "43853", "44219", "44584", 
"43491", "43492", "43493", "43494", "43128", "26/01/21019", "43495", 
"43501", "43467", "31/01/019", "43498", "43526", "43557", "43587", 
"31/02/2019", "43499", "2", "43588", "43589", "43590", "43591", 
"43231", "43710", "43740", "43771", "43801", "43509", "43510", 
"1", "43511", "43512", "43513", "43514", "1O/02/2019", "43515", 
"43516", "43517", "43518", "43519", "43520", "43521", "43522", 
"43523", "43524", "43468", "29/02/2019", "43552", "43527", "43558", 
"43619", "43649", "43680", "43711", "43650", "43651", "43741", 
"43652", "43653", "43654", "43655", "43772", "43802", "43537", 
"43538", "43545", "43539", "43540", "43541", "43542", "43906", 
"44271", "44636", "45001", "45367", "45732", "46097", "46462", 
"46828", "45568", "45366", "43159", "45364", "45369", "45476", 
"45368", "45370", "45365", "45629", "45371", "45372", "45734", 
"46099", "46464", "46830", "43544", "43543", "43546", "43547", 
"19", "43548", "43549", "2O/03/2019", "43550", "43551", "43553", 
"43555", "43554", "43469"))

It looks like lubridate::as_date() can handle this, but we'll need to be able to pass an origin argument from guess_dates(). The origin currently used in Excel is 1900-01-01 - can't find indications of it being platform-dependent.

thibautjombart commented 5 years ago

The following hack works, but obviously the implementation will need something a bit different ;)

guess_dates <- function(x, ...) {

  x <- as.character(x)

  to_replace <- !is.na(suppressWarnings(as.integer(x)))
  replacement <- lubridate::as_date(
      as.integer(x[to_replace]),
      origin = as.Date("1900-01-01"))
  replacement <- as.character(replacement)
  x[to_replace] <- replacement

  linelist::guess_dates(x, ...)
}

guess_dates(x, error_tolerance = 1)
zkamvar commented 5 years ago

The origin currently used in Excel is 1900-01-01 - can't find indications of it being platform-dependent.

The fun thing is that there IS a caveat: on Excel for OSX pre 2011, the origin is 1904 :sweat_smile: https://support.office.com/en-us/article/date-systems-in-excel-e7fe7167-48a9-4b96-bb53-5612a800b487?ui=en-US&rs=en-US&ad=US

zkamvar commented 5 years ago

Additionally, this is similar to issue #6

zkamvar commented 5 years ago

Additional question: are the dates above real or contrived? For example: "18/2/10/2018" should not render as a date.