reconhub / linelist

An R package to import, clean, and store case data
https://www.repidemicsconsortium.org/linelist
Other
25 stars 5 forks source link

origin in guess_dates #73

Closed scottyaz closed 5 years ago

scottyaz commented 5 years ago

It seems like the origin for dates from Excel is dependent on the OS with origin in windows being "1899-12-30" rather than "1900-01-01". Current version of guess_dates() assumes "1900-01-01" if modern excel flag is on.

zkamvar commented 5 years ago

Hi @scottyaz,

Thank you for the info! I completely forgot about that really frustrating curse, which is absolutely NOT documented in their help page on the issue: https://support.office.com/en-us/article/date-systems-in-excel-e7fe7167-48a9-4b96-bb53-5612a800b487

Can you give the output of R.version, so that I can see what version of windows you use?

R.version
#>                _                           
#> platform       x86_64-pc-linux-gnu         
#> arch           x86_64                      
#> os             linux-gnu                   
#> system         x86_64, linux-gnu           
#> status                                     
#> major          3                           
#> minor          6.0                         
#> year           2019                        
#> month          04                          
#> day            26                          
#> svn rev        76424                       
#> language       R                           
#> version.string R version 3.6.0 (2019-04-26)
#> nickname       Planting of a Tree

Created on 2019-05-14 by the reprex package (v0.2.1)

scottyaz commented 5 years ago

I am not using windows but the excel file I was using was generated on a windows machine (like most that will come from MoHs). I forked the repo and if I manage to find some time, I’ll try to come up with a solution (though I can’t promise I’ll deliver).

On 14 May 2019, at 12:34, Zhian N. Kamvar wrote:

Hi @scottyaz,

Thank you for the info! I completely forgot about that really frustrating curse, which is absolutely NOT documented in their help page on the issue: https://support.office.com/en-us/article/date-systems-in-excel-e7fe7167-48a9-4b96-bb53-5612a800b487

Can you give the output of R.version, so that I can see what version of windows you use?

R.version
#>                _
#> platform       x86_64-pc-linux-gnu
#> arch           x86_64
#> os             linux-gnu
#> system         x86_64, linux-gnu
#> status
#> major          3
#> minor          6.0
#> year           2019
#> month          04
#> day            26
#> svn rev        76424
#> language       R
#> version.string R version 3.6.0 (2019-04-26)
#> nickname       Planting of a Tree

Created on 2019-05-14 by the reprex package (v0.2.1)

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/reconhub/linelist/issues/73#issuecomment-492183294

andrew azman

scottyaz commented 5 years ago

I wonder if there is useful meta-data in an excel file (recognising that you expect a data.frame or tidy variant, not an excel file) that can be used to figure out the OS.

xmlInternalTreeParse(unzip('my_linelist.xlsx','xl/workbook.xml')) 

## or
xmlInternalTreeParse(unzip('my_linelist.xlsx','docProps/app.xml')) 

Doubt you want to go here but...

zkamvar commented 5 years ago

Hi @scottyaz, I definitely do not want to go down the route of attempting to peer into excel files themselves. I've proposed a fix (#74) by changing the modern_excel flag to excel and having three options:

(modern <- as.character(as.integer(as.Date("2019-05-15") - as.Date("1900-01-01"))))
#> [1] "43598"
(oldmac <- as.character(as.integer(as.Date("2019-05-15") - as.Date("1904-01-01"))))
#> [1] "42138"
(oldwin <- as.character(as.integer(as.Date("2019-05-15") - as.Date("1899-12-30"))))
#> [1] "43600"
guess_dates(modern)
#> [1] "2019-05-15"
guess_dates(oldmac, excel = "old-mac")
#> [1] "2019-05-15"
guess_dates(oldwin, excel = "old-win")
#> [1] "2019-05-15"

Created on 2019-05-15 by the reprex package (v0.2.1)

Ultimately, it will be up to the user to know which version of excel their data was created with and include some positive checks to make sure things are working smoothly (especially since these dates can be off by only a few days).

zkamvar commented 5 years ago

Let me know if you think this will work, and I'll merge it.

scottyaz commented 5 years ago

That is totally reasonable!

zkamvar commented 5 years ago

Huh.... I think actually there are only two dates and 1900-01-01 ain't one of them :thinking:

Both readxl and janitor use 1899-12-30: https://github.com/tidyverse/readxl/blob/15fde898887a8941bb3839892bf30fa9866290ee/src/utils.h#L7-L40, which has to do with the leap year bug, but it's absolutely not clear to me why it need 12-30 and not 12-31 :weary: I'm going to change it back to modern_excel and just change the date to 1899-12-30

scottyaz commented 5 years ago

documentation still says 1900-01-01...probably want to update.

scottyaz commented 5 years ago

Also, unless I am missing something, still doesn't work properly when the 1899 origin is being used:

old <- as.character(as.integer(as.Date("2019-05-20") - as.Date("1899-12-30"))))
[1] "43605"
guess_dates(oldwin,modern_excel = TRUE) # obviously doesn't give correct answer
[1] "2019-05-22"
guess_dates(old,modern_excel = FALSE)
[1] "43605"
Warning message:
In guess_dates(old, modern_excel = FALSE) : 
The following 1 dates were not in the correct timeframe (1969-05-24 -- 2019-05-24):

  original  |  parsed    
  --------  |  ------    
  43605     |  2023-05-21
zkamvar commented 5 years ago

Also, unless I am missing something, still doesn't work properly when the 1899 origin is being used:

This is because it's a contrived example. As I explained in https://github.com/reconhub/linelist/issues/73#issuecomment-492700196, both readxl and janitor use this system and they both have EXTENSIVE tests. It has something to do with the leap year bug that was introduced in the 90's.

I think I have an excel file lying around somewhere that needed some of this cleaning.... let me check

scottyaz commented 5 years ago

I was trying this on real data and couldn't get the correct date and this is why I posted...See what you come up with.

zkamvar commented 5 years ago

I hate everything

zkamvar commented 5 years ago

Would you mind trying the janitor function and see if it gives you the same shitty result?

zkamvar commented 5 years ago

Also, check to make sure you have the latest version of linelist

scottyaz commented 5 years ago

Just installed from GitHub so I am pretty sure I do.

On 24 May 2019, at 15:09, Zhian N. Kamvar wrote:

Also, check to make sure you have the latest version of linelist

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/reconhub/linelist/issues/73#issuecomment-495619968

andrew azman

zkamvar commented 5 years ago

This is the sheet that has mixed date formats: https://github.com/everhartlab/SscPhenoProj/blob/54c8d75ddb917866e94c06f359f3bc2b61b8de7f/Analysis.R#L194-L207

zkamvar commented 5 years ago

For clarity, here is how guess_dates works on a wild excel sheet:

x <- "https://osf.io/2yfre/download"
tmp <- tempdir()
f <- file.path(tmp, "test.xlsx")
download.file(x, f)
ex <- readxl::read_excel(f)
ex
#> # A tibble: 95 x 10
#>    MCG   `AP-GenoID` InventoryID `in-JRS-collect… `AP-Continent_C…
#>    <chr> <chr>       <chr>       <chr>            <chr>           
#>  1 A     143         143         TRUE             North America_U…
#>  2 K     202         202         TRUE             North America_U…
#>  3 K     264         264         TRUE             North America_U…
#>  4 H     265         265         TRUE             North America_U…
#>  5 A     266         266         TRUE             North America_U…
#>  6 I     267         267         TRUE             North America_U…
#>  7 J     268         268         TRUE             North America_U…
#>  8 A     276         276         TRUE             North America_U…
#>  9 K     289         289         TRUE             North America_U…
#> 10 C     293A        293         FALSE            South America_A…
#> # … with 85 more rows, and 5 more variables: `JRS-Isolate #` <chr>,
#> #   `JRS-Collection Date` <chr>, `JRS-Source (Host)` <chr>,
#> #   `JRS-Geographical Location` <chr>, `JRS-Notes` <chr>
ex[["JRS-Collection Date"]]
#>  [1] "1977"     "33117"    "34516"    "34547"    "34547"    "34578"   
#>  [7] "34578"    "35309"    "1996"     "1996"     "1996"     "1996"    
#> [13] "1996"     "1996"     "36373"    "36373"    "41175"    "41175"   
#> [19] "8/1/2009" "8/1/2009" "2012"     "2014"     "2014"     "2014"    
#> [25] "2014"     "2014"     "2012"     "2012"     "2012"     "2012"    
#> [31] "2012"     "2012"     "2012"     "2012"     "2012"     "2010"    
#> [37] "2010"     "2010"     "2010"     "2010"     "2010"     "2010"    
#> [43] "2010"     "2010"     "2010"     "2009"     "2009"     "2009"    
#> [49] "2009"     "2009"     "2009"     "2009"     "2010"     "2010"    
#> [55] "2010"     "2010"     "2010"     "2010"     "NA"       "2012"    
#> [61] "2012"     "2012"     "2012"     "2012"     "2012"     "2012"    
#> [67] "2012"     "2012"     "2012"     "2012"     "2012"     "2012"    
#> [73] "2012"     "2012"     "2012"     "2012"     "2012"     "2012"    
#> [79] "2012"     "2012"     "2012"     "2012"     "2012"     "2012"    
#> [85] "2012"     "2012"     "2012"     "2012"     "2012"     "2012"    
#> [91] "2012"     "2012"     "2012"     "2012"     "2012"
linelist::guess_dates(ex[["JRS-Collection Date"]], err = 1, orders = list("Y", "mdy"))
#>  [1] "1977-01-01" "1990-09-01" "1994-07-01" "1994-08-01" "1994-08-01"
#>  [6] "1994-09-01" "1994-09-01" "1996-09-01" "1996-01-01" "1996-01-01"
#> [11] "1996-01-01" "1996-01-01" "1996-01-01" "1996-01-01" "1999-08-01"
#> [16] "1999-08-01" "2012-09-23" "2012-09-23" "2009-08-01" "2009-08-01"
#> [21] "2012-01-01" "2014-01-01" "2014-01-01" "2014-01-01" "2014-01-01"
#> [26] "2014-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01"
#> [31] "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01"
#> [36] "2010-01-01" "2010-01-01" "2010-01-01" "2010-01-01" "2010-01-01"
#> [41] "2010-01-01" "2010-01-01" "2010-01-01" "2010-01-01" "2010-01-01"
#> [46] "2009-01-01" "2009-01-01" "2009-01-01" "2009-01-01" "2009-01-01"
#> [51] "2009-01-01" "2009-01-01" "2010-01-01" "2010-01-01" "2010-01-01"
#> [56] "2010-01-01" "2010-01-01" "2010-01-01" NA           "2012-01-01"
#> [61] "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01"
#> [66] "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01"
#> [71] "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01"
#> [76] "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01"
#> [81] "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01"
#> [86] "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01"
#> [91] "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01"

Created on 2019-05-24 by the reprex package (v0.3.0)

zkamvar commented 5 years ago

If you compare this to the original sheet, you'll see that the dates have been rendered correctly: https://osf.io/2yfre/

zkamvar commented 5 years ago

Also, unless I am missing something, still doesn't work properly when the 1899 origin is being used:

old <- as.character(as.integer(as.Date("2019-05-20") - as.Date("1899-12-30"))))
[1] "43605"
guess_dates(oldwin,modern_excel = TRUE) # obviously doesn't give correct answer
[1] "2019-05-22"
guess_dates(old,modern_excel = FALSE)
[1] "43605"
Warning message:
In guess_dates(old, modern_excel = FALSE) : 
The following 1 dates were not in the correct timeframe (1969-05-24 -- 2019-05-24):

  original  |  parsed    
  --------  |  ------    
  43605     |  2023-05-21

I just noticed, the first iteration you show uses "oldwin" and not "old" as the input. When I try this, I get the following:

old <- as.character(as.integer(as.Date("2019-05-20") - as.Date("1899-12-30")))
linelist::guess_dates(old, modern_excel = TRUE)
#> [1] "2019-05-20"

Created on 2019-05-24 by the reprex package (v0.3.0)

It's an understandable mistake, but please use reprex in the future when making examples.

scottyaz commented 5 years ago

Strange….

R> old <- as.character(as.integer(as.Date("2019-05-20") - 
as.Date("1899-12-30")))
R> linelist::guess_dates(old, modern_excel = TRUE)
[1] "2019-05-22"

On 24 May 2019, at 15:55, Zhian N. Kamvar wrote:

Also, unless I am missing something, still doesn't work properly when the 1899 origin is being used:

old <- as.character(as.integer(as.Date("2019-05-20") - 
as.Date("1899-12-30"))))
[1] "43605"
guess_dates(oldwin,modern_excel = TRUE) # obviously doesn't give 
correct answer
[1] "2019-05-22"
guess_dates(old,modern_excel = FALSE)
[1] "43605"
Warning message:
In guess_dates(old, modern_excel = FALSE) :
The following 1 dates were not in the correct timeframe (1969-05-24 
-- 2019-05-24):

  original  |  parsed
  --------  |  ------
  43605     |  2023-05-21

I just noticed, the first iteration you show uses "oldwin" and not "old" as the input. When I try this, I get the following:

old <- as.character(as.integer(as.Date("2019-05-20") - 
as.Date("1899-12-30")))
linelist::guess_dates(old, modern_excel = TRUE)
#> [1] "2019-05-20"

Created on 2019-05-24 by the reprex package (v0.3.0)

It's an understandable mistake, but please use reprex in the future when making examples.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/reconhub/linelist/issues/73#issuecomment-495639813

andrew azman

zkamvar commented 5 years ago

Can you post the package description?

packageDescription("linelist")
#> Package: linelist
#> Title: Tools to Import and Tidy Case Linelist Data
#> Version: 0.0.33.9000
#> Authors@R: c(person("Thibaut", "Jombart", email =
#>        "thibautjombart@gmail.com", role = c("aut", "cre")),
#>        person("Zhian N.", "Kamvar", email = "zkamvar@gmail.com",
#>        role = c("aut")))
#> Description: A collection of wrappers for importing case linelist
#>        data from usual formats, and tools for cleaning data,
#>        detecting dates, and storing meta-information on the
#>        content of the resulting data frame.
#> Depends: R (>= 3.0.0)
#> License: MIT + file LICENSE
#> Encoding: UTF-8
#> LazyData: true
#> Suggests: outbreaks, incidence, testthat, knitr, roxygen2, covr,
#>        tibble, dplyr, magrittr
#> RoxygenNote: 6.1.1
#> Imports: lubridate, rio, epitrix, tidyselect, rlang, forcats,
#>        utils, stats, crayon
#> URL: https://github.com/reconhub/linelist
#> BugReports: https://github.com/reconhub/linelist/issues
#> VignetteBuilder: knitr
#> Roxygen: list(markdown = TRUE)
#> RemoteType: github
#> RemoteHost: api.github.com
#> RemoteRepo: linelist
#> RemoteUsername: reconhub
#> RemoteRef: master
#> RemoteSha: fda9e18b02f5853cd311ddcc513c427244b21dd7
#> GithubRepo: linelist
#> GithubUsername: reconhub
#> GithubRef: master
#> GithubSHA1: fda9e18b02f5853cd311ddcc513c427244b21dd7
#> NeedsCompilation: no
#> Packaged: 2019-05-24 13:12:12 UTC; zkamvar
#> Author: Thibaut Jombart [aut, cre], Zhian N. Kamvar [aut]
#> Maintainer: Thibaut Jombart <thibautjombart@gmail.com>
#> Built: R 3.6.0; ; 2019-05-24 13:12:13 UTC; unix
#> 
#> -- File: /home/zkamvar/Documents/RLibrary/linelist/Meta/package.rds

Created on 2019-05-24 by the reprex package (v0.3.0)

scottyaz commented 5 years ago
R> packageDescription("linelist")
Package: linelist
Title: Tools to Import and Tidy Case Linelist Data
Version: 0.0.33.9000
Authors@R: c(person("Thibaut", "Jombart", email = 
"thibautjombart@gmail.com", role =
           c("aut", "cre")), person("Zhian N.", "Kamvar", email = 
"zkamvar@gmail.com", role =
           c("aut")))
Description: A collection of wrappers for importing case linelist data 
from usual formats,
           and tools for cleaning data, detecting dates, and storing 
meta-information on the
           content of the resulting data frame.
Depends: R (>= 3.0.0)
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Suggests: outbreaks, incidence, testthat, knitr, roxygen2, covr, tibble, 
dplyr, magrittr
RoxygenNote: 6.1.1
Imports: lubridate, rio, epitrix, tidyselect, rlang, forcats, utils, 
stats, crayon
URL: https://github.com/reconhub/linelist
BugReports: https://github.com/reconhub/linelist/issues
VignetteBuilder: knitr
Roxygen: list(markdown = TRUE)
RemoteType: github
RemoteHost: api.github.com
RemoteRepo: linelist
RemoteUsername: reconhub
RemoteRef: master
RemoteSha: fda9e18b02f5853cd311ddcc513c427244b21dd7
GithubRepo: linelist
GithubUsername: reconhub
GithubRef: master
GithubSHA1: fda9e18b02f5853cd311ddcc513c427244b21dd7
NeedsCompilation: no
Packaged: 2019-05-24 14:04:18 UTC; andrewa
Author: Thibaut Jombart [aut, cre], Zhian N. Kamvar [aut]
Maintainer: Thibaut Jombart <thibautjombart@gmail.com>
Built: R 3.5.2; ; 2019-05-24 14:04:18 UTC; unix

-- File: 
/Library/Frameworks/R.framework/Versions/3.5/Resources/library/linelist/Meta/package.rds

I wonder if it is something about dependency versions?

R> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.3

Matrix products: default
BLAS: 
/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: 
/Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] linelist_0.0.33.9000 stringr_1.4.0        aweek_0.2.0          
tidyr_0.8.3
[5] here_0.1             dplyr_0.8.0.1        readr_1.3.1          
readxl_1.3.1
[9] pacman_0.5.1

loaded via a namespace (and not attached):
  [1] tidyselect_0.2.5  remotes_2.0.2     purrr_0.3.1       haven_2.1.0  
      usethis_1.4.0
  [6] yaml_2.2.0        utf8_1.1.4        rlang_0.3.1       
pkgbuild_1.0.2    pillar_1.3.1
[11] foreign_0.8-71    glue_1.3.0        withr_2.1.2       
sessioninfo_1.1.1 cellranger_1.1.0
[16] zip_2.0.1         devtools_2.0.1    memoise_1.1.0     rio_0.5.16    
     forcats_0.4.0
[21] callr_3.1.1       ps_1.3.0          curl_3.3          fansi_0.4.0   
     Rcpp_1.0.0
[26] backports_1.1.3   desc_1.2.0        pkgload_1.0.2     fs_1.2.6      
     hms_0.4.2
[31] digest_0.6.18     stringi_1.3.1     openxlsx_4.1.0    
processx_3.2.1    rprojroot_1.3-2
[36] cli_1.0.1         tools_3.5.2       magrittr_1.5      tibble_2.0.1  
     crayon_1.3.4
[41] pkgconfig_2.0.2   data.table_1.12.0 prettyunits_1.0.2 
lubridate_1.7.4   assertthat_0.2.0
[46] rstudioapi_0.9.0  R6_2.4.0          compiler_3.5.2    git2r_0.24.0

On 24 May 2019, at 16:00, Zhian N. Kamvar wrote:

Can you post the package description?

packageDescription("linelist")
#> Package: linelist
#> Title: Tools to Import and Tidy Case Linelist Data
#> Version: 0.0.33.9000
#> Authors@R: c(person("Thibaut", "Jombart", email =
#>        "thibautjombart@gmail.com", role = c("aut", "cre")),
#>        person("Zhian N.", "Kamvar", email = "zkamvar@gmail.com",
#>        role = c("aut")))
#> Description: A collection of wrappers for importing case linelist
#>        data from usual formats, and tools for cleaning data,
#>        detecting dates, and storing meta-information on the
#>        content of the resulting data frame.
#> Depends: R (>= 3.0.0)
#> License: MIT + file LICENSE
#> Encoding: UTF-8
#> LazyData: true
#> Suggests: outbreaks, incidence, testthat, knitr, roxygen2, covr,
#>        tibble, dplyr, magrittr
#> RoxygenNote: 6.1.1
#> Imports: lubridate, rio, epitrix, tidyselect, rlang, forcats,
#>        utils, stats, crayon
#> URL: https://github.com/reconhub/linelist
#> BugReports: https://github.com/reconhub/linelist/issues
#> VignetteBuilder: knitr
#> Roxygen: list(markdown = TRUE)
#> RemoteType: github
#> RemoteHost: api.github.com
#> RemoteRepo: linelist
#> RemoteUsername: reconhub
#> RemoteRef: master
#> RemoteSha: fda9e18b02f5853cd311ddcc513c427244b21dd7
#> GithubRepo: linelist
#> GithubUsername: reconhub
#> GithubRef: master
#> GithubSHA1: fda9e18b02f5853cd311ddcc513c427244b21dd7
#> NeedsCompilation: no
#> Packaged: 2019-05-24 13:12:12 UTC; zkamvar
#> Author: Thibaut Jombart [aut, cre], Zhian N. Kamvar [aut]
#> Maintainer: Thibaut Jombart <thibautjombart@gmail.com>
#> Built: R 3.6.0; ; 2019-05-24 13:12:13 UTC; unix
#>
#> -- File: /home/zkamvar/Documents/RLibrary/linelist/Meta/package.rds

Created on 2019-05-24 by the reprex package (v0.3.0)

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/reconhub/linelist/issues/73#issuecomment-495641978

andrew azman

zkamvar commented 5 years ago

The only think I can think of why there would be a discrepancy is that you may have an unclean R session. Can you restart R and try again?

scottyaz commented 5 years ago

arggggg. i'll shut up now.