Open zkamvar opened 5 years ago
Thanks for the extensive comparison! Very interesting to see the differences. parsedate
seems cool but I am doubtful about 2 aspects:
43391
being silently converted but off by 2 days2019
being silently converted to 2019-01-01
; default months and days are dangerous, and probably best avoided by defaultDepending on the overhead of refactoring code, it may be better to leave the current version for a first CRAN release (so long as the interface remains the same). This said, I agree it would be nice to have a faster, and (perhaps more importantly) more straightforward code for this feature.
Thanks for the extensive comparison! Very interesting to see the differences.
parsedate
seems cool but I am doubtful about 2 aspects:
43391
being silently converted but off by 2 days
If you look closely, parsedate is converting this to the current date, it's not off by two days.
2019
being silently converted to2019-01-01
; default months and days are dangerous, and probably best avoided by default
I think this falls into a grey area because it really depends on what the application is. For example, it's possible to set the orders
option to include "Y" as part of the parsing and the date will correctly be parsed to the beginning of the year.
Depending on the overhead of refactoring code, it may be better to leave the current version for a first CRAN release (so long as the interface remains the same). This said, I agree it would be nice to have a faster, and (perhaps more importantly) more straightforward code for this feature.
I think one of the things to reiterate is that parsing dates is super difficult and no one has a perfect solution because of so many corner cases. The current implementation in linelist is tortuous, but gets the job done (for the most part). I think the parsedate implementation would be less tortuous because the only thing we really need to check for is dates that can be converted to numbers aren't parsed as the current date. If they are, then we know we need to parse them accordingly.
I'm probably going to write a blog post about all this because dates are annoying and weird.
I'm probably going to write a blog post about all this because dates are annoying and weird.
I think that's a very good idea. Agreed with refactoring, we'll just need to make sure there's no loss of backward compatibility, but all good otherwise. If you see this as something that needs to be done for the first CRAN release, can you add this issue to the project?
Agreed with refactoring, we'll just need to make sure there's no loss of backward compatibility, but all good otherwise
Agreed. For this to happen, we would need a non-contrived set of dates to write tests before the refactor.
If you see this as something that needs to be done for the first CRAN release, can you add this issue to the project?
Because this would involve changing a dependency, I can easily see this as something that would be priority for 0.1 release (otherwise, it's a bit messy to deal with changing dependencies mid-stream).
Background:
When I updated guess_dates, I used
lubridate::parse_date_time()
because it was somewhat flexible with the definitions and I was remotely familiar with it.The process that happens is that the user specifies specific orders for dates to be parsed and these get passed to lubridate internally, returning a data frame where each column represents a separate lubridate run. These dates are then parsed first come, first served to make up the date column.
The reason for this approach is because
lubridate::parse_date_time()
is finnicky (see https://github.com/tidyverse/lubridate/issues/752 and https://github.com/tidyverse/lubridate/issues/749).If any dates return as NA and can be converted to a number, then they are assumed to be excel dates, which are then parsed as as.Date(as.integer(x), origin = "1899-12-30") or 1904-01-01, if the user specifies that they have an old mac version of excel.
After this, any dates that are still NA are passed through Thibaut's regex, which will check for various European-style dates after cleaning.
Problem
This process of passing the dates through multiple rounds of lubridate makes the processing steps super slow. I've done my best to optimize this search by converting the date data frame to a matrix and reducing the call stack when subsetting, but it's still.... not optimal because it still requires users to specify the parsing options and, really, they should only have to choose between whether or not they want to have ambiguous dates like 03/07/2019 be American or European dates.
Solution
I propose that we use
parsedate::parse_date()
as a replacement for the lubridate hack we have.The only quirk about this is the question of how to deal with ambiguous dates. It appears that parsedate will interpret ambiguous dates differently if it uses a slash or a dot as the separator:
Created on 2019-08-16 by the reprex package (v0.3.0)
so all we would need to do is to pass the date vector to
epitrix::clean_labels()
and condition thesep
argument on whether or not the user wants to prioritize American dates or European dates.Comparison with linelist
Accuracy
Given a vector of (a bit contrived) messy dates: ``` r options(width = 100) x <- c("01-12-2001", "02 Feb, 2019", "2019", "male", "female", "2018-10-18", NA, NA, "2018_10_17", "43391", "2018 10 19", "2018 10 19 16:24:39.xlsx", "// 24/12/1989", "this is 24/12/1989!", "RECON NGO: 19 Sep 2018 :)", "6/9/11", "10/10/10") y <- epitrix::clean_labels(x, sep = "/") ``` We compare the results of: - linelist (current version) - lubridate (with the options of linelist) - parsedate (prioritizing American dates) - parsedate (prioritizing European dates) - anytime (with default options) Notes: - there is one date that would appear as if it were recorded in excel: 43391. This represents October 18, 2018 as the number of days since 1899-12-30. Only linelist can handle this one. - linelist and lubridate need to have "Y" present in the orders in order to properly parse the bare "2019" - lubridate struggles with parsing 02 Feb, 2019 and incorrectly parses it as 2019-02-20 - anytime struggles in weird ways ```r data.frame(clean = y, # Linelist uses lubridate, Thibaut's regex, and a numeric as.Date under the hood ll = linelist::guess_dates(y, err = 1), # lubridate is... not very good at this lu = lubridate::parse_date_time(y, unlist(getOption("linelist_guess_orders"))), # parsedate with xx/xx/xx will interpret this as dd/mm/yy pd = as.Date(parsedate::parse_date(y)), # parsedate with xx.xx.xx will interpret this as mm.dd.yy pd2 = as.Date(parsedate::parse_date(gsub("/", ".", y))), # anytime is... okay at parsing. at = anytime::anydate(y) ) #> Warning in linelist::guess_dates(y, err = 1): #> The following 1 dates were not in the correct timeframe (1969-08-16 -- 2019-08-16): #> #> original | parsed #> -------- | ------ #> 2019 | 1905-07-11 #> Warning: 5 failed to parse. #> clean ll lu pd pd2 at #> 1 01/12/2001 2001-12-01 2001-12-01 2001-01-12 2001-12-01 2001-01-12 #> 2 02/feb/2019 2019-02-02 2019-02-20 2019-02-02 2019-02-02 2019-02-02 #> 3 2019Benchmarks
Linelist takes 3x as long as the others to parse, which is partially because of the 3 lubridate runs it has to do. The best one is anytime, but that's because it gives up. ``` r x <- c("01-12-2001", "02 Feb, 2019", "2019", "male", "female", "2018-10-18", NA, NA, "2018_10_17", "43391", "2018 10 19", "2018 10 19 16:24:39.xlsx", "// 24/12/1989", "this is 24/12/1989!", "RECON NGO: 19 Sep 2018 :)", "6/9/11", "10/10/10") microbenchmark::microbenchmark( # Linelist uses lubridate, Thibaut's regex, and a numeric as.Date under the hood linelist = suppressWarnings(linelist::guess_dates(x, err = 1)), lubridate = suppressWarnings(lubridate::parse_date_time(x, c("mdy", "dmy"))), # parsedate with xx/xx/xx will interpret this as dd/mm/yy parsedate = as.Date(parsedate::parse_date(epitrix::clean_labels(x, sep = "/"))), # parsedate with xx.xx.xx will interpret this as mm.dd.yy parsedate2 = as.Date(parsedate::parse_date(epitrix::clean_labels(x, sep = "."))), # anytime is... okay at parsing. anytime = anytime::anydate(epitrix::clean_labels(x, sep = "/")) ) #> Unit: milliseconds #> expr min lq mean median uq max #> linelist 13.654571 13.972625 15.225080 14.293672 16.454477 25.16745 #> lubridate 4.447104 4.609014 6.168438 4.682035 4.998145 119.44763 #> parsedate 2.987486 3.131123 3.600410 3.223812 3.404889 18.08269 #> parsedate2 2.986551 3.111414 5.206118 3.197665 3.465905 185.35637 #> anytime 3.553340 3.689048 5.060121 3.731018 3.797623 132.79393 #> neval cld #> 100 b #> 100 a #> 100 a #> 100 a #> 100 a ``` Created on 2019-08-16 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)