reconhub / linelist

An R package to import, clean, and store case data
https://www.repidemicsconsortium.org/linelist
Other
25 stars 5 forks source link

use parsedate instead of lubridate for guess_dates() internals #85

Open zkamvar opened 5 years ago

zkamvar commented 5 years ago

Background:

When I updated guess_dates, I used lubridate::parse_date_time() because it was somewhat flexible with the definitions and I was remotely familiar with it.

The process that happens is that the user specifies specific orders for dates to be parsed and these get passed to lubridate internally, returning a data frame where each column represents a separate lubridate run. These dates are then parsed first come, first served to make up the date column.

The reason for this approach is because lubridate::parse_date_time() is finnicky (see https://github.com/tidyverse/lubridate/issues/752 and https://github.com/tidyverse/lubridate/issues/749).

If any dates return as NA and can be converted to a number, then they are assumed to be excel dates, which are then parsed as as.Date(as.integer(x), origin = "1899-12-30") or 1904-01-01, if the user specifies that they have an old mac version of excel.

After this, any dates that are still NA are passed through Thibaut's regex, which will check for various European-style dates after cleaning.

Problem

This process of passing the dates through multiple rounds of lubridate makes the processing steps super slow. I've done my best to optimize this search by converting the date data frame to a matrix and reducing the call stack when subsetting, but it's still.... not optimal because it still requires users to specify the parsing options and, really, they should only have to choose between whether or not they want to have ambiguous dates like 03/07/2019 be American or European dates.

Solution

I propose that we use parsedate::parse_date() as a replacement for the lubridate hack we have.

The only quirk about this is the question of how to deal with ambiguous dates. It appears that parsedate will interpret ambiguous dates differently if it uses a slash or a dot as the separator:

parsedate::parse_date("06/08/10")
#> [1] "2010-06-08 UTC"
parsedate::parse_date("06.08.10")
#> [1] "2010-08-06 UTC"

Created on 2019-08-16 by the reprex package (v0.3.0)

so all we would need to do is to pass the date vector to epitrix::clean_labels() and condition the sep argument on whether or not the user wants to prioritize American dates or European dates.

Comparison with linelist

Accuracy Given a vector of (a bit contrived) messy dates: ``` r options(width = 100) x <- c("01-12-2001", "02 Feb, 2019", "2019", "male", "female", "2018-10-18", NA, NA, "2018_10_17", "43391", "2018 10 19", "2018 10 19 16:24:39.xlsx", "// 24/12/1989", "this is 24/12/1989!", "RECON NGO: 19 Sep 2018 :)", "6/9/11", "10/10/10") y <- epitrix::clean_labels(x, sep = "/") ``` We compare the results of: - linelist (current version) - lubridate (with the options of linelist) - parsedate (prioritizing American dates) - parsedate (prioritizing European dates) - anytime (with default options) Notes: - there is one date that would appear as if it were recorded in excel: 43391. This represents October 18, 2018 as the number of days since 1899-12-30. Only linelist can handle this one. - linelist and lubridate need to have "Y" present in the orders in order to properly parse the bare "2019" - lubridate struggles with parsing 02 Feb, 2019 and incorrectly parses it as 2019-02-20 - anytime struggles in weird ways ```r data.frame(clean = y, # Linelist uses lubridate, Thibaut's regex, and a numeric as.Date under the hood ll = linelist::guess_dates(y, err = 1), # lubridate is... not very good at this lu = lubridate::parse_date_time(y, unlist(getOption("linelist_guess_orders"))), # parsedate with xx/xx/xx will interpret this as dd/mm/yy pd = as.Date(parsedate::parse_date(y)), # parsedate with xx.xx.xx will interpret this as mm.dd.yy pd2 = as.Date(parsedate::parse_date(gsub("/", ".", y))), # anytime is... okay at parsing. at = anytime::anydate(y) ) #> Warning in linelist::guess_dates(y, err = 1): #> The following 1 dates were not in the correct timeframe (1969-08-16 -- 2019-08-16): #> #> original | parsed #> -------- | ------ #> 2019 | 1905-07-11 #> Warning: 5 failed to parse. #> clean ll lu pd pd2 at #> 1 01/12/2001 2001-12-01 2001-12-01 2001-01-12 2001-12-01 2001-01-12 #> 2 02/feb/2019 2019-02-02 2019-02-20 2019-02-02 2019-02-02 2019-02-02 #> 3 2019 2019-01-01 2019-01-01 2019-01-01 #> 4 male #> 5 female #> 6 2018/10/18 2018-10-18 2018-10-18 2018-10-18 2018-10-18 2018-10-18 #> 7 #> 8 #> 9 2018/10/17 2018-10-17 2018-10-17 2018-10-17 2018-10-17 2018-10-17 #> 10 43391 2018-10-18 2019-08-16 2019-08-16 4339-01-01 #> 11 2018/10/19 2018-10-19 2018-10-19 2018-10-19 2018-10-19 2018-10-19 #> 12 2018/10/19/16/24/39/xlsx 2018-10-19 2018-10-19 2018-10-19 2018-10-19 #> 13 24/12/1989 1989-12-24 1989-12-24 1989-12-24 1989-12-24 #> 14 this/is/24/12/1989 1989-12-24 1989-12-24 1989-12-24 1989-12-24 #> 15 recon/ngo/19/sep/2018 2018-09-19 2018-09-19 2018-09-19 2018-09-19 #> 16 6/9/11 2011-09-06 2011-09-06 2011-06-09 2011-09-06 #> 17 10/10/10 2010-10-10 2010-10-10 2010-10-10 2010-10-10 ``` Created on 2019-08-16 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)
Benchmarks Linelist takes 3x as long as the others to parse, which is partially because of the 3 lubridate runs it has to do. The best one is anytime, but that's because it gives up. ``` r x <- c("01-12-2001", "02 Feb, 2019", "2019", "male", "female", "2018-10-18", NA, NA, "2018_10_17", "43391", "2018 10 19", "2018 10 19 16:24:39.xlsx", "// 24/12/1989", "this is 24/12/1989!", "RECON NGO: 19 Sep 2018 :)", "6/9/11", "10/10/10") microbenchmark::microbenchmark( # Linelist uses lubridate, Thibaut's regex, and a numeric as.Date under the hood linelist = suppressWarnings(linelist::guess_dates(x, err = 1)), lubridate = suppressWarnings(lubridate::parse_date_time(x, c("mdy", "dmy"))), # parsedate with xx/xx/xx will interpret this as dd/mm/yy parsedate = as.Date(parsedate::parse_date(epitrix::clean_labels(x, sep = "/"))), # parsedate with xx.xx.xx will interpret this as mm.dd.yy parsedate2 = as.Date(parsedate::parse_date(epitrix::clean_labels(x, sep = "."))), # anytime is... okay at parsing. anytime = anytime::anydate(epitrix::clean_labels(x, sep = "/")) ) #> Unit: milliseconds #> expr min lq mean median uq max #> linelist 13.654571 13.972625 15.225080 14.293672 16.454477 25.16745 #> lubridate 4.447104 4.609014 6.168438 4.682035 4.998145 119.44763 #> parsedate 2.987486 3.131123 3.600410 3.223812 3.404889 18.08269 #> parsedate2 2.986551 3.111414 5.206118 3.197665 3.465905 185.35637 #> anytime 3.553340 3.689048 5.060121 3.731018 3.797623 132.79393 #> neval cld #> 100 b #> 100 a #> 100 a #> 100 a #> 100 a ``` Created on 2019-08-16 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)
thibautjombart commented 4 years ago

Thanks for the extensive comparison! Very interesting to see the differences. parsedate seems cool but I am doubtful about 2 aspects:

Depending on the overhead of refactoring code, it may be better to leave the current version for a first CRAN release (so long as the interface remains the same). This said, I agree it would be nice to have a faster, and (perhaps more importantly) more straightforward code for this feature.

zkamvar commented 4 years ago

Thanks for the extensive comparison! Very interesting to see the differences. parsedate seems cool but I am doubtful about 2 aspects:

  • 43391 being silently converted but off by 2 days

If you look closely, parsedate is converting this to the current date, it's not off by two days.

  • 2019 being silently converted to 2019-01-01; default months and days are dangerous, and probably best avoided by default

I think this falls into a grey area because it really depends on what the application is. For example, it's possible to set the orders option to include "Y" as part of the parsing and the date will correctly be parsed to the beginning of the year.

Depending on the overhead of refactoring code, it may be better to leave the current version for a first CRAN release (so long as the interface remains the same). This said, I agree it would be nice to have a faster, and (perhaps more importantly) more straightforward code for this feature.

I think one of the things to reiterate is that parsing dates is super difficult and no one has a perfect solution because of so many corner cases. The current implementation in linelist is tortuous, but gets the job done (for the most part). I think the parsedate implementation would be less tortuous because the only thing we really need to check for is dates that can be converted to numbers aren't parsed as the current date. If they are, then we know we need to parse them accordingly.

zkamvar commented 4 years ago

I'm probably going to write a blog post about all this because dates are annoying and weird.

thibautjombart commented 4 years ago

I'm probably going to write a blog post about all this because dates are annoying and weird.

I think that's a very good idea. Agreed with refactoring, we'll just need to make sure there's no loss of backward compatibility, but all good otherwise. If you see this as something that needs to be done for the first CRAN release, can you add this issue to the project?

zkamvar commented 4 years ago

Agreed with refactoring, we'll just need to make sure there's no loss of backward compatibility, but all good otherwise

Agreed. For this to happen, we would need a non-contrived set of dates to write tests before the refactor.

If you see this as something that needs to be done for the first CRAN release, can you add this issue to the project?

Because this would involve changing a dependency, I can easily see this as something that would be priority for 0.1 release (otherwise, it's a bit messy to deal with changing dependencies mid-stream).