Closed scottyaz closed 5 years ago
Hi @scottyaz,
Thank you for the info! I completely forgot about that really frustrating curse, which is absolutely NOT documented in their help page on the issue: https://support.office.com/en-us/article/date-systems-in-excel-e7fe7167-48a9-4b96-bb53-5612a800b487
Can you give the output of R.version
, so that I can see what version of windows you use?
R.version
#> _
#> platform x86_64-pc-linux-gnu
#> arch x86_64
#> os linux-gnu
#> system x86_64, linux-gnu
#> status
#> major 3
#> minor 6.0
#> year 2019
#> month 04
#> day 26
#> svn rev 76424
#> language R
#> version.string R version 3.6.0 (2019-04-26)
#> nickname Planting of a Tree
Created on 2019-05-14 by the reprex package (v0.2.1)
I am not using windows but the excel file I was using was generated on a windows machine (like most that will come from MoHs). I forked the repo and if I manage to find some time, I’ll try to come up with a solution (though I can’t promise I’ll deliver).
On 14 May 2019, at 12:34, Zhian N. Kamvar wrote:
Hi @scottyaz,
Thank you for the info! I completely forgot about that really frustrating curse, which is absolutely NOT documented in their help page on the issue: https://support.office.com/en-us/article/date-systems-in-excel-e7fe7167-48a9-4b96-bb53-5612a800b487
Can you give the output of
R.version
, so that I can see what version of windows you use?R.version #> _ #> platform x86_64-pc-linux-gnu #> arch x86_64 #> os linux-gnu #> system x86_64, linux-gnu #> status #> major 3 #> minor 6.0 #> year 2019 #> month 04 #> day 26 #> svn rev 76424 #> language R #> version.string R version 3.6.0 (2019-04-26) #> nickname Planting of a Tree
Created on 2019-05-14 by the reprex package (v0.2.1)
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/reconhub/linelist/issues/73#issuecomment-492183294
andrew azman
I wonder if there is useful meta-data in an excel file (recognising that you expect a data.frame or tidy variant, not an excel file) that can be used to figure out the OS.
xmlInternalTreeParse(unzip('my_linelist.xlsx','xl/workbook.xml'))
## or
xmlInternalTreeParse(unzip('my_linelist.xlsx','docProps/app.xml'))
Doubt you want to go here but...
Hi @scottyaz, I definitely do not want to go down the route of attempting to peer into excel files themselves. I've proposed a fix (#74) by changing the modern_excel
flag to excel
and having three options:
(modern <- as.character(as.integer(as.Date("2019-05-15") - as.Date("1900-01-01"))))
#> [1] "43598"
(oldmac <- as.character(as.integer(as.Date("2019-05-15") - as.Date("1904-01-01"))))
#> [1] "42138"
(oldwin <- as.character(as.integer(as.Date("2019-05-15") - as.Date("1899-12-30"))))
#> [1] "43600"
guess_dates(modern)
#> [1] "2019-05-15"
guess_dates(oldmac, excel = "old-mac")
#> [1] "2019-05-15"
guess_dates(oldwin, excel = "old-win")
#> [1] "2019-05-15"
Created on 2019-05-15 by the reprex package (v0.2.1)
Ultimately, it will be up to the user to know which version of excel their data was created with and include some positive checks to make sure things are working smoothly (especially since these dates can be off by only a few days).
Let me know if you think this will work, and I'll merge it.
That is totally reasonable!
Huh.... I think actually there are only two dates and 1900-01-01 ain't one of them :thinking:
Both readxl and janitor use 1899-12-30: https://github.com/tidyverse/readxl/blob/15fde898887a8941bb3839892bf30fa9866290ee/src/utils.h#L7-L40, which has to do with the leap year bug, but it's absolutely not clear to me why it need 12-30 and not 12-31 :weary: I'm going to change it back to modern_excel and just change the date to 1899-12-30
documentation still says 1900-01-01...probably want to update.
Also, unless I am missing something, still doesn't work properly when the 1899 origin is being used:
old <- as.character(as.integer(as.Date("2019-05-20") - as.Date("1899-12-30"))))
[1] "43605"
guess_dates(oldwin,modern_excel = TRUE) # obviously doesn't give correct answer
[1] "2019-05-22"
guess_dates(old,modern_excel = FALSE)
[1] "43605"
Warning message:
In guess_dates(old, modern_excel = FALSE) :
The following 1 dates were not in the correct timeframe (1969-05-24 -- 2019-05-24):
original | parsed
-------- | ------
43605 | 2023-05-21
Also, unless I am missing something, still doesn't work properly when the 1899 origin is being used:
This is because it's a contrived example. As I explained in https://github.com/reconhub/linelist/issues/73#issuecomment-492700196, both readxl and janitor use this system and they both have EXTENSIVE tests. It has something to do with the leap year bug that was introduced in the 90's.
I think I have an excel file lying around somewhere that needed some of this cleaning.... let me check
I was trying this on real data and couldn't get the correct date and this is why I posted...See what you come up with.
I hate everything
Would you mind trying the janitor function and see if it gives you the same shitty result?
Also, check to make sure you have the latest version of linelist
Just installed from GitHub so I am pretty sure I do.
On 24 May 2019, at 15:09, Zhian N. Kamvar wrote:
Also, check to make sure you have the latest version of linelist
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/reconhub/linelist/issues/73#issuecomment-495619968
andrew azman
This is the sheet that has mixed date formats: https://github.com/everhartlab/SscPhenoProj/blob/54c8d75ddb917866e94c06f359f3bc2b61b8de7f/Analysis.R#L194-L207
For clarity, here is how guess_dates works on a wild excel sheet:
x <- "https://osf.io/2yfre/download"
tmp <- tempdir()
f <- file.path(tmp, "test.xlsx")
download.file(x, f)
ex <- readxl::read_excel(f)
ex
#> # A tibble: 95 x 10
#> MCG `AP-GenoID` InventoryID `in-JRS-collect… `AP-Continent_C…
#> <chr> <chr> <chr> <chr> <chr>
#> 1 A 143 143 TRUE North America_U…
#> 2 K 202 202 TRUE North America_U…
#> 3 K 264 264 TRUE North America_U…
#> 4 H 265 265 TRUE North America_U…
#> 5 A 266 266 TRUE North America_U…
#> 6 I 267 267 TRUE North America_U…
#> 7 J 268 268 TRUE North America_U…
#> 8 A 276 276 TRUE North America_U…
#> 9 K 289 289 TRUE North America_U…
#> 10 C 293A 293 FALSE South America_A…
#> # … with 85 more rows, and 5 more variables: `JRS-Isolate #` <chr>,
#> # `JRS-Collection Date` <chr>, `JRS-Source (Host)` <chr>,
#> # `JRS-Geographical Location` <chr>, `JRS-Notes` <chr>
ex[["JRS-Collection Date"]]
#> [1] "1977" "33117" "34516" "34547" "34547" "34578"
#> [7] "34578" "35309" "1996" "1996" "1996" "1996"
#> [13] "1996" "1996" "36373" "36373" "41175" "41175"
#> [19] "8/1/2009" "8/1/2009" "2012" "2014" "2014" "2014"
#> [25] "2014" "2014" "2012" "2012" "2012" "2012"
#> [31] "2012" "2012" "2012" "2012" "2012" "2010"
#> [37] "2010" "2010" "2010" "2010" "2010" "2010"
#> [43] "2010" "2010" "2010" "2009" "2009" "2009"
#> [49] "2009" "2009" "2009" "2009" "2010" "2010"
#> [55] "2010" "2010" "2010" "2010" "NA" "2012"
#> [61] "2012" "2012" "2012" "2012" "2012" "2012"
#> [67] "2012" "2012" "2012" "2012" "2012" "2012"
#> [73] "2012" "2012" "2012" "2012" "2012" "2012"
#> [79] "2012" "2012" "2012" "2012" "2012" "2012"
#> [85] "2012" "2012" "2012" "2012" "2012" "2012"
#> [91] "2012" "2012" "2012" "2012" "2012"
linelist::guess_dates(ex[["JRS-Collection Date"]], err = 1, orders = list("Y", "mdy"))
#> [1] "1977-01-01" "1990-09-01" "1994-07-01" "1994-08-01" "1994-08-01"
#> [6] "1994-09-01" "1994-09-01" "1996-09-01" "1996-01-01" "1996-01-01"
#> [11] "1996-01-01" "1996-01-01" "1996-01-01" "1996-01-01" "1999-08-01"
#> [16] "1999-08-01" "2012-09-23" "2012-09-23" "2009-08-01" "2009-08-01"
#> [21] "2012-01-01" "2014-01-01" "2014-01-01" "2014-01-01" "2014-01-01"
#> [26] "2014-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01"
#> [31] "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01"
#> [36] "2010-01-01" "2010-01-01" "2010-01-01" "2010-01-01" "2010-01-01"
#> [41] "2010-01-01" "2010-01-01" "2010-01-01" "2010-01-01" "2010-01-01"
#> [46] "2009-01-01" "2009-01-01" "2009-01-01" "2009-01-01" "2009-01-01"
#> [51] "2009-01-01" "2009-01-01" "2010-01-01" "2010-01-01" "2010-01-01"
#> [56] "2010-01-01" "2010-01-01" "2010-01-01" NA "2012-01-01"
#> [61] "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01"
#> [66] "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01"
#> [71] "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01"
#> [76] "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01"
#> [81] "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01"
#> [86] "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01"
#> [91] "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01"
Created on 2019-05-24 by the reprex package (v0.3.0)
If you compare this to the original sheet, you'll see that the dates have been rendered correctly: https://osf.io/2yfre/
Also, unless I am missing something, still doesn't work properly when the 1899 origin is being used:
old <- as.character(as.integer(as.Date("2019-05-20") - as.Date("1899-12-30")))) [1] "43605" guess_dates(oldwin,modern_excel = TRUE) # obviously doesn't give correct answer [1] "2019-05-22" guess_dates(old,modern_excel = FALSE) [1] "43605" Warning message: In guess_dates(old, modern_excel = FALSE) : The following 1 dates were not in the correct timeframe (1969-05-24 -- 2019-05-24): original | parsed -------- | ------ 43605 | 2023-05-21
I just noticed, the first iteration you show uses "oldwin" and not "old" as the input. When I try this, I get the following:
old <- as.character(as.integer(as.Date("2019-05-20") - as.Date("1899-12-30")))
linelist::guess_dates(old, modern_excel = TRUE)
#> [1] "2019-05-20"
Created on 2019-05-24 by the reprex package (v0.3.0)
It's an understandable mistake, but please use reprex in the future when making examples.
Strange….
R> old <- as.character(as.integer(as.Date("2019-05-20") -
as.Date("1899-12-30")))
R> linelist::guess_dates(old, modern_excel = TRUE)
[1] "2019-05-22"
On 24 May 2019, at 15:55, Zhian N. Kamvar wrote:
Also, unless I am missing something, still doesn't work properly when the 1899 origin is being used:
old <- as.character(as.integer(as.Date("2019-05-20") - as.Date("1899-12-30")))) [1] "43605" guess_dates(oldwin,modern_excel = TRUE) # obviously doesn't give correct answer [1] "2019-05-22" guess_dates(old,modern_excel = FALSE) [1] "43605" Warning message: In guess_dates(old, modern_excel = FALSE) : The following 1 dates were not in the correct timeframe (1969-05-24 -- 2019-05-24): original | parsed -------- | ------ 43605 | 2023-05-21
I just noticed, the first iteration you show uses "oldwin" and not "old" as the input. When I try this, I get the following:
old <- as.character(as.integer(as.Date("2019-05-20") - as.Date("1899-12-30"))) linelist::guess_dates(old, modern_excel = TRUE) #> [1] "2019-05-20"
Created on 2019-05-24 by the reprex package (v0.3.0)
It's an understandable mistake, but please use reprex in the future when making examples.
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/reconhub/linelist/issues/73#issuecomment-495639813
andrew azman
Can you post the package description?
packageDescription("linelist")
#> Package: linelist
#> Title: Tools to Import and Tidy Case Linelist Data
#> Version: 0.0.33.9000
#> Authors@R: c(person("Thibaut", "Jombart", email =
#> "thibautjombart@gmail.com", role = c("aut", "cre")),
#> person("Zhian N.", "Kamvar", email = "zkamvar@gmail.com",
#> role = c("aut")))
#> Description: A collection of wrappers for importing case linelist
#> data from usual formats, and tools for cleaning data,
#> detecting dates, and storing meta-information on the
#> content of the resulting data frame.
#> Depends: R (>= 3.0.0)
#> License: MIT + file LICENSE
#> Encoding: UTF-8
#> LazyData: true
#> Suggests: outbreaks, incidence, testthat, knitr, roxygen2, covr,
#> tibble, dplyr, magrittr
#> RoxygenNote: 6.1.1
#> Imports: lubridate, rio, epitrix, tidyselect, rlang, forcats,
#> utils, stats, crayon
#> URL: https://github.com/reconhub/linelist
#> BugReports: https://github.com/reconhub/linelist/issues
#> VignetteBuilder: knitr
#> Roxygen: list(markdown = TRUE)
#> RemoteType: github
#> RemoteHost: api.github.com
#> RemoteRepo: linelist
#> RemoteUsername: reconhub
#> RemoteRef: master
#> RemoteSha: fda9e18b02f5853cd311ddcc513c427244b21dd7
#> GithubRepo: linelist
#> GithubUsername: reconhub
#> GithubRef: master
#> GithubSHA1: fda9e18b02f5853cd311ddcc513c427244b21dd7
#> NeedsCompilation: no
#> Packaged: 2019-05-24 13:12:12 UTC; zkamvar
#> Author: Thibaut Jombart [aut, cre], Zhian N. Kamvar [aut]
#> Maintainer: Thibaut Jombart <thibautjombart@gmail.com>
#> Built: R 3.6.0; ; 2019-05-24 13:12:13 UTC; unix
#>
#> -- File: /home/zkamvar/Documents/RLibrary/linelist/Meta/package.rds
Created on 2019-05-24 by the reprex package (v0.3.0)
R> packageDescription("linelist")
Package: linelist
Title: Tools to Import and Tidy Case Linelist Data
Version: 0.0.33.9000
Authors@R: c(person("Thibaut", "Jombart", email =
"thibautjombart@gmail.com", role =
c("aut", "cre")), person("Zhian N.", "Kamvar", email =
"zkamvar@gmail.com", role =
c("aut")))
Description: A collection of wrappers for importing case linelist data
from usual formats,
and tools for cleaning data, detecting dates, and storing
meta-information on the
content of the resulting data frame.
Depends: R (>= 3.0.0)
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Suggests: outbreaks, incidence, testthat, knitr, roxygen2, covr, tibble,
dplyr, magrittr
RoxygenNote: 6.1.1
Imports: lubridate, rio, epitrix, tidyselect, rlang, forcats, utils,
stats, crayon
URL: https://github.com/reconhub/linelist
BugReports: https://github.com/reconhub/linelist/issues
VignetteBuilder: knitr
Roxygen: list(markdown = TRUE)
RemoteType: github
RemoteHost: api.github.com
RemoteRepo: linelist
RemoteUsername: reconhub
RemoteRef: master
RemoteSha: fda9e18b02f5853cd311ddcc513c427244b21dd7
GithubRepo: linelist
GithubUsername: reconhub
GithubRef: master
GithubSHA1: fda9e18b02f5853cd311ddcc513c427244b21dd7
NeedsCompilation: no
Packaged: 2019-05-24 14:04:18 UTC; andrewa
Author: Thibaut Jombart [aut, cre], Zhian N. Kamvar [aut]
Maintainer: Thibaut Jombart <thibautjombart@gmail.com>
Built: R 3.5.2; ; 2019-05-24 14:04:18 UTC; unix
-- File:
/Library/Frameworks/R.framework/Versions/3.5/Resources/library/linelist/Meta/package.rds
I wonder if it is something about dependency versions?
R> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.3
Matrix products: default
BLAS:
/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK:
/Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] linelist_0.0.33.9000 stringr_1.4.0 aweek_0.2.0
tidyr_0.8.3
[5] here_0.1 dplyr_0.8.0.1 readr_1.3.1
readxl_1.3.1
[9] pacman_0.5.1
loaded via a namespace (and not attached):
[1] tidyselect_0.2.5 remotes_2.0.2 purrr_0.3.1 haven_2.1.0
usethis_1.4.0
[6] yaml_2.2.0 utf8_1.1.4 rlang_0.3.1
pkgbuild_1.0.2 pillar_1.3.1
[11] foreign_0.8-71 glue_1.3.0 withr_2.1.2
sessioninfo_1.1.1 cellranger_1.1.0
[16] zip_2.0.1 devtools_2.0.1 memoise_1.1.0 rio_0.5.16
forcats_0.4.0
[21] callr_3.1.1 ps_1.3.0 curl_3.3 fansi_0.4.0
Rcpp_1.0.0
[26] backports_1.1.3 desc_1.2.0 pkgload_1.0.2 fs_1.2.6
hms_0.4.2
[31] digest_0.6.18 stringi_1.3.1 openxlsx_4.1.0
processx_3.2.1 rprojroot_1.3-2
[36] cli_1.0.1 tools_3.5.2 magrittr_1.5 tibble_2.0.1
crayon_1.3.4
[41] pkgconfig_2.0.2 data.table_1.12.0 prettyunits_1.0.2
lubridate_1.7.4 assertthat_0.2.0
[46] rstudioapi_0.9.0 R6_2.4.0 compiler_3.5.2 git2r_0.24.0
On 24 May 2019, at 16:00, Zhian N. Kamvar wrote:
Can you post the package description?
packageDescription("linelist") #> Package: linelist #> Title: Tools to Import and Tidy Case Linelist Data #> Version: 0.0.33.9000 #> Authors@R: c(person("Thibaut", "Jombart", email = #> "thibautjombart@gmail.com", role = c("aut", "cre")), #> person("Zhian N.", "Kamvar", email = "zkamvar@gmail.com", #> role = c("aut"))) #> Description: A collection of wrappers for importing case linelist #> data from usual formats, and tools for cleaning data, #> detecting dates, and storing meta-information on the #> content of the resulting data frame. #> Depends: R (>= 3.0.0) #> License: MIT + file LICENSE #> Encoding: UTF-8 #> LazyData: true #> Suggests: outbreaks, incidence, testthat, knitr, roxygen2, covr, #> tibble, dplyr, magrittr #> RoxygenNote: 6.1.1 #> Imports: lubridate, rio, epitrix, tidyselect, rlang, forcats, #> utils, stats, crayon #> URL: https://github.com/reconhub/linelist #> BugReports: https://github.com/reconhub/linelist/issues #> VignetteBuilder: knitr #> Roxygen: list(markdown = TRUE) #> RemoteType: github #> RemoteHost: api.github.com #> RemoteRepo: linelist #> RemoteUsername: reconhub #> RemoteRef: master #> RemoteSha: fda9e18b02f5853cd311ddcc513c427244b21dd7 #> GithubRepo: linelist #> GithubUsername: reconhub #> GithubRef: master #> GithubSHA1: fda9e18b02f5853cd311ddcc513c427244b21dd7 #> NeedsCompilation: no #> Packaged: 2019-05-24 13:12:12 UTC; zkamvar #> Author: Thibaut Jombart [aut, cre], Zhian N. Kamvar [aut] #> Maintainer: Thibaut Jombart <thibautjombart@gmail.com> #> Built: R 3.6.0; ; 2019-05-24 13:12:13 UTC; unix #> #> -- File: /home/zkamvar/Documents/RLibrary/linelist/Meta/package.rds
Created on 2019-05-24 by the reprex package (v0.3.0)
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/reconhub/linelist/issues/73#issuecomment-495641978
andrew azman
The only think I can think of why there would be a discrepancy is that you may have an unclean R session. Can you restart R and try again?
arggggg. i'll shut up now.
It seems like the origin for dates from Excel is dependent on the OS with origin in windows being "1899-12-30" rather than "1900-01-01". Current version of
guess_dates()
assumes "1900-01-01" if modern excel flag is on.