reconhub / linelist

An R package to import, clean, and store case data
https://www.repidemicsconsortium.org/linelist
Other
25 stars 5 forks source link

Fix spurious warnings in guess_dates #75

Open thibautjombart opened 5 years ago

thibautjombart commented 5 years ago

Sorry for I cannot share data to reproduce this, but I get the following on the latest version of linelist:

> x <- x %>%
+   mutate_at(.vars = vars(contains("date")),
+             .funs = guess_dates)
Warning messages:
1: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) : 
The following 1 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):

  original    |  parsed    
  --------    |  ------    
  2019-12-16  |  2019-12-16
2: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) : 
The following 2 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):

  original    |  parsed    
  --------    |  ------    
  2019-07-04  |  2019-07-04
  2019-10-21  |  2019-10-21
3: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) : 
The following 9 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):

  original    |  parsed    
  --------    |  ------    
  2019-08-08  |  2019-08-08
  2019-08-22  |  2019-08-22
  2019-09-02  |  2019-09-02
  2019-09-03  |  2019-09-03
  2019-09-11  |  2019-09-11
  2019-10-20  |  2019-10-20
  2019-11-02  |  2019-11-02
  2019-11-20  |  2019-11-20
  2019-12-12  |  2019-12-12
4: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) : 
The following 27 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):

  original    |  parsed    
  --------    |  ------    
  2019-08-08  |  2019-08-08
  2019-08-15  |  2019-08-15
  2019-08-16  |  2019-08-16
  2019-08-18  |  2019-08-18
  2019-08-19  |  2019-08-19
  2019-08-22  |  2019-08-22
  2019-08-30  |  2019-08-30
  2019-09-06  |  2019-09-06
  2019-09-14  |  2019-09-14
  2019-09-16  |  2019-09-16
  2019-09-17  |  2019-09-17
  2019-09-19  |  2019-09-19
  2019-09-21  |  2019-09-21
  2019-09-27  |  2019-09-27
  2019-10-04  |  2019-10-04
  2019-10-08  |  2019-10-08
  2019-10-10  |  2019-10-10
  2019-10-12  |  2019-10-12
  2019-10-13  |  2019-10-13
  2019-10-24  |  2019-10-24
  2019-10-25  |  2019-10-25
  2019-10-30  |  2019-10-30
  2019-10-31  |  2019-10-31
  2019-11-02  |  2019-11-02
  2019-11-09  |  2019-11-09
  2019-11-13  |  2019-11-13
  2019-12-14  |  2019-12-14
5: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) : 
The following 16 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):

  original    |  parsed    
  --------    |  ------    
  2019-08-17  |  2019-08-17
  2019-08-19  |  2019-08-19
  2019-09-04  |  2019-09-04
  2019-09-19  |  2019-09-19
  2019-09-20  |  2019-09-20
  2019-09-22  |  2019-09-22
  2019-09-28  |  2019-09-28
  2019-09-29  |  2019-09-29
  2019-10-13  |  2019-10-13
  2019-10-14  |  2019-10-14
  2019-10-16  |  2019-10-16
  2019-10-30  |  2019-10-30
  2019-11-03  |  2019-11-03
  2019-11-15  |  2019-11-15
  2019-11-17  |  2019-11-17
  2019-12-18  |  2019-12-18
> 
zkamvar commented 5 years ago

I wouldn’t call this spurious. These dates are all beyond last_date. What behavior do you expect?

If you want to get rid of the warning, then set last_date = Sys.date() + 365

Sent from my iPhone

On May 19, 2019, at 12:34, Thibaut Jombart notifications@github.com wrote:

Sorry for I cannot share data to reproduce this, but I get the following on the latest version of linelist:

x <- x %>%

  • mutate_at(.vars = vars(contains("date")),
  • .funs = guess_dates) Warning messages: 1: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) : The following 1 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):
original parsed
2019-12-16 2019-12-16

2: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) : The following 2 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):

original parsed
2019-07-04 2019-07-04
2019-10-21 2019-10-21

3: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) : The following 9 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):

original parsed
2019-08-08 2019-08-08
2019-08-22 2019-08-22
2019-09-02 2019-09-02
2019-09-03 2019-09-03
2019-09-11 2019-09-11
2019-10-20 2019-10-20
2019-11-02 2019-11-02
2019-11-20 2019-11-20
2019-12-12 2019-12-12

4: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) : The following 27 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):

original parsed
2019-08-08 2019-08-08
2019-08-15 2019-08-15
2019-08-16 2019-08-16
2019-08-18 2019-08-18
2019-08-19 2019-08-19
2019-08-22 2019-08-22
2019-08-30 2019-08-30
2019-09-06 2019-09-06
2019-09-14 2019-09-14
2019-09-16 2019-09-16
2019-09-17 2019-09-17
2019-09-19 2019-09-19
2019-09-21 2019-09-21
2019-09-27 2019-09-27
2019-10-04 2019-10-04
2019-10-08 2019-10-08
2019-10-10 2019-10-10
2019-10-12 2019-10-12
2019-10-13 2019-10-13
2019-10-24 2019-10-24
2019-10-25 2019-10-25
2019-10-30 2019-10-30
2019-10-31 2019-10-31
2019-11-02 2019-11-02
2019-11-09 2019-11-09
2019-11-13 2019-11-13
2019-12-14 2019-12-14

5: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) : The following 16 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):

original parsed
2019-08-17 2019-08-17
2019-08-19 2019-08-19
2019-09-04 2019-09-04
2019-09-19 2019-09-19
2019-09-20 2019-09-20
2019-09-22 2019-09-22
2019-09-28 2019-09-28
2019-09-29 2019-09-29
2019-10-13 2019-10-13
2019-10-14 2019-10-14
2019-10-16 2019-10-16
2019-10-30 2019-10-30
2019-11-03 2019-11-03
2019-11-15 2019-11-15
2019-11-17 2019-11-17
2019-12-18 2019-12-18

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

thibautjombart commented 5 years ago

Problem is having long lists of dates original / parsed that are identical.

zkamvar commented 5 years ago

Problem is having long lists of dates original / parsed that are identical.

Is the problem the length or the fact that they appear to be identical?

zkamvar commented 5 years ago

To give a bit of background as to what is happening:

Because guess_dates() attempts to convert YMD, DMY, and MDY in that specific order, it's possible for some dates to fail because they were parsed incorrectly (e.g. the DMY date 11/02/2019 is interpreted as 2019-11-02 under the MDY system). These results are collected as they are parsed and then presented in a table as you saw. Usually it looks something like this:

library("linelist")
x <- c("04 Feb 1982", "19 Sep 2018", "2001-01-01", "2011.12.13",
       "ba;abb;a: 03:11:2012!", "haha... 2013-12-13..",
       "that's a NA", "gender", "not a date", "01__Feb__1999___", 
       "19/09/18", "09/08/18", "2018-08-09")
last_date <-as.Date("2012-11-05")
first_date <- as.Date("1962-11-05")
res <- guess_dates(x, error_tolerance = 1, last_date = last_date)
#> Warning in guess_dates(x, error_tolerance = 1, last_date = last_date): 
#> The following 5 dates were not in the correct timeframe (1962-11-05 -- 2012-11-05):
#> 
#>   original              |  parsed    
#>   --------              |  ------    
#>   09/08/18              |  2018-08-09
#>   09/08/18              |  2018-09-08
#>   19 Sep 2018           |  2018-09-19
#>   19/09/18              |  2018-09-19
#>   2018-08-09            |  2018-08-09
#>   haha... 2013-12-13..  |  2013-12-13
res
#>  [1] "1982-02-04" NA           "2001-01-01" "2011-12-13" "2012-11-03"
#>  [6] NA           NA           NA           NA           "1999-02-01"
#> [11] NA           NA           NA

Created on 2019-05-20 by the reprex package (v0.3.0)

Do you want me to get rid of this warning alltogether?

ffinger commented 4 years ago

I think it would already be a step forward if the warning could say which column is concerned. If guess_dates is called by clean_data you don't necessarily know which part of the warning comes from which column.

zkamvar commented 4 years ago

I think it would already be a step forward if the warning could say which column is concerned. If guess_dates is called by clean_data you don't necessarily know which part of the warning comes from which column.

Thank you for adding this clarification, @ffinger, and I agree with you. Collecting warnings in a loop is not a straightforward problem, but luckily, I've already written some code to handle this situation in clean_variable_spelling() (see below) and can implement it in clean_dates() if you want.

I think adopting the warning pattern that readr::parse_date() uses will be helpful: https://readr.tidyverse.org/reference/parse_datetime.html

  library("linelist")
  my_data_frame <- data.frame(
    raboof    = c(letters[1:5], "foubar", "foobr", "fubar", "", "unknown", "fumar"),
    treatment = c(letters[5:1], "Y", "Yes", "N", NA, "No", "yes"),
    region    = state.name[1:11]
  )
  corrections <- data.frame(
    bad = c("foubar", "foobr", "fubar", ".missing", "unknown", "Yes", "Y", "No", "N", ".missing"),
    good = c("foobar", "foobar", "foobar", "missing", "missing", "yes", "yes", "no", "no", "missing"),
    column = c(rep("raboof", 5), rep("treatment", 5)),
    orders = c(1:5, 5:1),
    stringsAsFactors = FALSE
  )
  corr <- data.frame(bad = c(".default", ".default"),
                     good = c("check data", "check data"),
                     column = c("raboof", "treatment"),
                     orders = Inf,
                     stringsAsFactors = FALSE
  )
  corr <- rbind(corrections, corr)
   clean_variable_spelling(my_data_frame, corr, warn = TRUE)
#> Warning in clean_variable_spelling(my_data_frame, corr, warn = TRUE): The following warnings were found...
#>   raboof_____:
#>   .... 'a', 'b', 'c', 'd', 'e', 'fumar' were changed to the default value ('check data')
#>   treatment__:
#>   .... 'a', 'b', 'c', 'd', 'e' were changed to the default value ('check data')
#>        raboof  treatment      region
#> 1  check data check data     Alabama
#> 2  check data check data      Alaska
#> 3  check data check data     Arizona
#> 4  check data check data    Arkansas
#> 5  check data check data  California
#> 6      foobar        yes    Colorado
#> 7      foobar        yes Connecticut
#> 8      foobar         no    Delaware
#> 9     missing    missing     Florida
#> 10    missing         no     Georgia
#> 11 check data        yes      Hawaii

Created on 2019-10-28 by the reprex package (v0.3.0)

thibautjombart commented 4 years ago

I am getting warnings which look like they may not be appropriate. Example below

dates <- c("18_03_2020", "19_03_2020", "20_03_2020", "21_03_2020", "22_03_2020", 
"23_03_2020", "24_03_2020", "25_03_2020", "26_03_2020", "27_03_2020", 
"28_03_2020", "29_03_2020", "30_03_2020", "31_03_2020", "01_04_2020", 
"02_04_2020", "03_04_2020", "04_04_2020", "05_04_2020", "06_04_2020", 
"07_04_2020", "08_04_2020")

res <- linelist::guess_dates(dates)

gives the following warning:


Warning message:
In linelist::guess_dates(dates) : 
The following 4 dates were not in the correct timeframe (1970-04-10 -- 2020-04-10):

  original    |  parsed    
  --------    |  ------    
  05_04_2020  |  2020-05-04
  06_04_2020  |  2020-06-04
  07_04_2020  |  2020-07-04
  08_04_2020  |  2020-08-04

Which would suggest conversion did not go as planned, but it is actually not the case:

> res
 [1] "2020-03-18" "2020-03-19" "2020-03-20" "2020-03-21" "2020-03-22"
 [6] "2020-03-23" "2020-03-24" "2020-03-25" "2020-03-26" "2020-03-27"
[11] "2020-03-28" "2020-03-29" "2020-03-30" "2020-03-31" "2020-04-01"
[16] "2020-04-02" "2020-04-03" "2020-04-04" "2020-04-05" "2020-04-06"
[21] "2020-04-07" "2020-04-08"
> range(res)
[1] "2020-03-18" "2020-04-08"
zkamvar commented 4 years ago

The warnings come from the fact that it's trying out both the "mdy" and "dmy" versions of the dates. If you only expect dmy versions of dates, then set orders = "dmy"