Closed dfalster closed 8 years ago
@RemkoDuursma, can you please run the following to help me diagnose an error, then paste results in here? You'll need to change the variable
path <- tempfile()
library(baad.data)
d <- baad_data("1.0.0")
storr:::hash_object(d)
d <- baad_data("1.0.0", path)
storr:::hash_object(d)
baad_data_del("1.0.0", path)
sessionInfo()
> path <- tempfile()
> library(baad.data)
> d <- baad_data("1.0.0")
> storr:::hash_object(d)
[1] "c1acf54690ec511801477ab3e83f0b95"
> d <- baad_data("1.0.0", path)
|======================================================================================================| 100%
> storr:::hash_object(d)
[1] "21aaa8f37193d849b36c7346476afa13"
> baad_data_del("1.0.0", path)
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252
[4] LC_NUMERIC=C LC_TIME=English_Australia.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] baad.data_1.0.1
loaded via a namespace (and not attached):
[1] httr_1.2.1 R6_2.1.2 rsconnect_0.4.3 tools_3.3.1 curl_0.9.7 rappdirs_0.3.1
[7] datastorr_0.0.3 jsonlite_1.0 digest_0.6.9 bibtex_0.4.0 storr_1.0.1
Hi @richfitz, I need your help to resolve why one of our tests are failing. The origins of the problem seem to trace to deep within storr and datastorr, in particular unpacking of downloaded products, so requires the main architect's eyes.
The test that fails is Line 12 is the test for object hash. This test was added presumably to verify the integrity of the products, i.e. whether there are significant changes. Well, a breaking change has arisen, so I want to know whether it is serious.
I suspect the problem may have arisen following replumbing of storr of datastorr (e.g. in history here), but only surfaced now when I reran the tests locally and on a windows machine -- they still pass on travis.
Anyway, here is the issue. Run the following:
d <- baad_data("1.0.0")
storr:::hash_object(d)
We were expecting a value of "7c59e15a5d56752775e8f8e9748e3556". We do get this value on Travis, also in a docker container.
But on my mac (and also James's mac) we get:
> d <- baad_data("1.0.0")
> storr:::hash_object(d)
[1] "a8c493844b2054aba696fce3f13ddd9d"
> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)
locale:
[1] en_US.UTF-8/en_AU.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] baad.data_1.0.1
loaded via a namespace (and not attached):
[1] R6_2.1.2 tools_3.3.0 rappdirs_0.3.1 datastorr_0.0.3
[5] digest_0.6.9 bibtex_0.4.0 storr_1.0.1
Then on Remko's Window's machine we get a value of "21aaa8f37193d849b36c7346476afa13", which is different to the value obtained on Appveyor of "e4c3df9544f6312ba6b77bc0909ed8a4". (see previous comment for sessionInfo).
It's also worth noting that I USED to get the same result as linux machines on my mac. You can see that results have also changed for Remko's - his comment above shows a different hash between the version cached on his machine (probably created some time ago) and a fresh download.
So in summary:
(and for posterity, here is code for running in docker)
Launch docker container
docker run -it dfalster/baad
Then inside the container, launch R and run
devtools::install_github("richfitz/datastorr")
devtools::install_github("traitecoevo/baad.data")
d <- baad.data::baad_data("1.0.0")
storr:::hash_object(d)
Interesting, and quite worrying. I'll try and replicate this today, and hopefully get onto it this week. My bet, if it involves unzipping, is that there's been some changes to R's unzip functions. The digest stuff should be pretty solid as it's depended on by heaps of packages
Thanks, I'll await your findings.
I get the expected hash on my Linux machine and on OS/X with R 3.2.3.
On windows with R 3.3.1 I get the same hash as remko.
traced the issue so far to a change in what bibtex::read.bib(bib_file)
has produced. The error is coming from the baad_unpack
function. You can explore this with:
debug(baad.data:::baad_unpack)
path <- tempfile()
d <- baad.data::baad_data("1.0.0", path)
and step through (with n
) until you've got to the point where baad[["bib"]]
has been created.
Here is the culprit:
[1] "Component “Markesteijn2009”: Component “abstract”: 1 string mismatch"
So, on windows bibtex is (I think) reading the \r\n
to \n
in the abstract for Markesteijn2009
As to what to do: you could try tweaking the baad_data
function so that it agrees across all platforms? You're highly unlikely to recover the exact hash as before, but it should not matter that much as you won't change the upstream data at all.
Wow, well found rich. Thanks very much for investigating. I'll implement a fix next week.
Indeed, there are some bad line endings in bib files. Taking hash of just the data component of 1.0.0 produces consistent results for me on OSX and linux (via docker):
> d <- baad.data::baad_data("1.0.0")
|======================================================================| 100%
> storr:::hash_object(d[["data"]])
[1] "16e346bcc5a49c10a3974b6ac149749f"
So I will adjust test to check that on v 1.0.0. and add a check on hash for entire object on a later release.
We had some routines for cleaning lines endings in baad, but these were only being applied to csv files.
But appears (to me) that it is not in fact the line endings (or possibly there are two different things going on), but rather bibtex's formatting of authors, which is behaving differently on OSX and linux. In particular, the way names like "van Breugel", "De Reffye", "von Lüpke" are handled.
I've just been comparing contents of d <- baad.data::baad_data("1.0.0")
obatined on my mac and in docker (i.e. linux):
> d <- baad.data::baad_data("1.0.0") #docker
|======================================================================| 100%
> storr:::hash_object(d)
[1] "7c59e15a5d56752775e8f8e9748e3556"
> d2 <- readRDS("/root/data/1.0.0.OSX.rds") #OSX
> storr:::hash_object(d2)
[1] "a8c493844b2054aba696fce3f13ddd9d"
> storr:::hash_object(d[["data"]])
[1] "16e346bcc5a49c10a3974b6ac149749f"
> storr:::hash_object(d2[["data"]])
[1] "16e346bcc5a49c10a3974b6ac149749f"
> all.equal(d2, d)
[1] "Component “bib”: Component “Petritan2009”: Component “author”: Component 2: Component 1: Lengths (1, 2) differ (string compare on first 1)"
[2] "Component “bib”: Component “Petritan2009”: Component “author”: Component 2: Component 2: 1 string mismatch"
[3] "Component “bib”: Component “vanBreugel2011”: Component “author”: Component 1: Component 1: Lengths (1, 2) differ (string compare on first 1)"
[4] "Component “bib”: Component “vanBreugel2011”: Component “author”: Component 1: Component 2: 1 string mismatch"
[5] "Component “bib”: Component “Wang2011”: Component “author”: Component 7: Component 1: Lengths (1, 2) differ (string compare on first 1)"
[6] "Component “bib”: Component “Wang2011”: Component “author”: Component 7: Component 2: 1 string mismatch"
> compare_hash_elements <- function(x1, x2) {
+
+ i <- sapply(x1, storr:::hash_object) == sapply(x2, storr:::hash_object)
+ names(x1)[!i]
+ }
> compare_hash_elements(d, d2)
[1] "bib"
> compare_hash_elements(d[["bib"]], d2[["bib2"]])
character(0)
> compare_hash_elements(d[["bib"]], d2[["bib"]])
[1] "Petritan2009" "vanBreugel2011" "Wang2011"
> d[["bib"]][["Wang2011"]]$author
[1] "Feng Wang" "Mengzhen Kang" "Qi Lu"
[4] "Véronique Letort" "Hui Han" "Yan Guo"
[7] "Philippe De Reffye" "Baoguo Li"
> d2[["bib"]][["Wang2011"]]$author
[1] "Feng Wang" "Mengzhen Kang" "Qi Lu"
[4] "Véronique Letort" "Hui Han" "Yan Guo"
[7] "Philippe De Reffye" "Baoguo Li"
> unlist(d2[["bib"]][["Wang2011"]]$author)
given family given family given family
"Feng" "Wang" "Mengzhen" "Kang" "Qi" "Lu"
given family given family given family
"Véronique" "Letort" "Hui" "Han" "Yan" "Guo"
given family given family
"Philippe" "De Reffye" "Baoguo" "Li"
> unlist(d[["bib"]][["Wang2011"]]$author)
given family given family given family
"Feng" "Wang" "Mengzhen" "Kang" "Qi" "Lu"
given family given family given family
"Véronique" "Letort" "Hui" "Han" "Yan" "Guo"
given1 given2 family given family
"Philippe" "De" "Reffye" "Baoguo" "Li"
> unlist(d[["bib"]][["vanBreugel2011"]]$author)
given1 given2 family given family given
"Michiel" "van" "Breugel" "Johannes" "Ransijn" "Dylan"
family given family given1 given2 family
"Craven" "Frans" "Bongers" "Jefferson" "S." "Hall"
> unlist(d2[["bib"]][["vanBreugel2011"]]$author)
given family given family given
"Michiel" "van Breugel" "Johannes" "Ransijn" "Dylan"
family given family given1 given2
"Craven" "Frans" "Bongers" "Jefferson" "S."
family
"Hall"
> unlist(d[["bib"]][["Petritan2009"]]$author)
given family given1 given2 family given family
"Any" "Petriţan" "Burghard" "von" "Lüpke" "Ion" "Petriţan"
> unlist(d2[["bib"]][["Petritan2009"]]$author)
given family given family given family
"Any" "Petriţan" "Burghard" "von Lüpke" "Ion" "Petriţan"
Alas, even when just hashing on data
element of version 1.0.0, the has on windows differs to mac and osx (appveyor test still failing).
@RemkoDuursma can you please run the following and upload the two RDS files in a zip below? (github doesn't like rds but accepts zip)
library(baad.data)
d <- baad_data("1.0.0")
storr:::hash_object(d)
saveRDS(d, "win1.rds")
path <- tempfile()
d2 <- baad_data("1.0.0", path)
storr:::hash_object(d2)
saveRDS(d2, "win2.rds")
Done. win12.zip
Thanks heaps! Can you also confirm the output of following:
library(baad.data)
d <- baad_data("1.0.0")
storr:::hash_object(d[["data"]])
path <- tempfile()
d2 <- baad_data("1.0.0", path)
storr:::hash_object(d2[["data"]])
> library(baad.data)
> d <- baad_data("1.0.0")
> storr:::hash_object(d[["data"]])
[1] "16e346bcc5a49c10a3974b6ac149749f"
> path <- tempfile()
> d2 <- baad_data("1.0.0", path)
|=========================================================================================| 100%
> storr:::hash_object(d2[["data"]])
[1] "16e346bcc5a49c10a3974b6ac149749f"
Good, that's what we get everywhere else! So why is appveyor giving a different result? (It's returning "bbb1a095d02a931e852d75e057340c71")
And just to confirm, @richfitz was right -- there is an issue with line endings comparing windows output to linux and mac:
> d3 <- readRDS("/root/data/win2.rds") #windows fresh download
> storr:::hash_object(d3)
[1] "21aaa8f37193d849b36c7346476afa13"
> storr:::hash_object(d3[["data"]])
[1] "16e346bcc5a49c10a3974b6ac149749f"
> all.equal(d3, d)
[1] "Component “bib”: Component “Markesteijn2009”: Component “abstract”: 1 string mismatch"
[2] "Component “bib”: Component “Petritan2009”: Component “author”: Component 2: Component 1: Lengths (1, 2) differ (string compare on first 1)"
[3] "Component “bib”: Component “Petritan2009”: Component “author”: Component 2: Component 2: 1 string mismatch"
[4] "Component “bib”: Component “vanBreugel2011”: Component “author”: Component 1: Component 1: Lengths (1, 2) differ (string compare on first 1)"
[5] "Component “bib”: Component “vanBreugel2011”: Component “author”: Component 1: Component 2: 1 string mismatch"
[6] "Component “bib”: Component “Wang2011”: Component “author”: Component 7: Component 1: Lengths (1, 2) differ (string compare on first 1)"
[7] "Component “bib”: Component “Wang2011”: Component “author”: Component 7: Component 2: 1 string mismatch"
> all.equal(d3, d2)
[1] "Component “bib”: Component “Markesteijn2009”: Component “abstract”: 1 string mismatch"
> compare_hash_elements(d, d3)
[1] "bib"
> compare_hash_elements(d2, d3)
[1] "bib"
> compare_hash_elements(d2[["bib"]], d3[["bib"]])
[1] "Markesteijn2009"
> compare_hash_elements(d[["bib"]], d3[["bib"]])
[1] "Markesteijn2009" "Petritan2009" "vanBreugel2011" "Wang2011"
> unlist(d3[["bib"]][["Wang2011"]]$author)
given family given family given family
"Feng" "Wang" "Mengzhen" "Kang" "Qi" "Lu"
given family given family given family
"Véronique" "Letort" "Hui" "Han" "Yan" "Guo"
given family given family
"Philippe" "De Reffye" "Baoguo" "Li"
>
> all.equal(d[["data"]], d3[["data"]])
[1] TRUE
> all.equal(d2[["data"]], d3[["data"]])
[1] TRUE
@richfitz Can I get your opinion on the solution implemented here.
A summary of the problem so far is
bib
element between windows and other platforms. In addition, names like "van Breugel", "De Reffye", "von Lüpke" are handled differently on linux, giving a second reason for hashes to differ. data
component. But this was behaving differently on appveyor to all other platforms. (I get same result for storr:::hash_object(baad[["data"]])
on my machine, Remko's machine, and travis). The code below shows that differences arise in columns where NA's are present, on appveyor platform compared to others.The solution implemented was to hash just the 'data' component, after converting to character.
By saving artefacts on appeveyor, I compared the data as loaded on appveyor to what i was getting on my machine:
compare_hash_elements <- function(x1, x2) {
i <- sapply(x1, storr:::hash_object) == sapply(x2, storr:::hash_object)
names(x1)[!i]
}
compare_hash_vals <- function(x1, x2) {
i <- sapply(x1, storr:::hash_object) == sapply(x2, storr:::hash_object)
x1[!i]
}
# download https://ci.appveyor.com/api/buildjobs/qjnops93iewg2xw2/artifacts/baad.data.Rcheck/tests/testthat/baad_1.0.0.rds
library(baad.data)
library(testthat)
d3 <- readRDS("~/Downloads/baad_1.0.0.rds") # from appveyor
d <- baad_data("1.0.0") #local mac
for(x in names(d)){
expect_identical(d[[x]], d3[[x]])
}
# note only differences noted is in the bib, due to both line endings and processing of names like "van ..., de ..."
# Now compare hashes of the
compare_hash_elements(d, d3)
# Note hash for `data` differs
storr:::hash_object(d[["data"]])
storr:::hash_object(d3[["data"]])
# But R tells us the contents are identical
all.equal(d[["data"]], d3[["data"]])
# Looking at elements of "data" we can see that it's all the numerical values
compare_hash_elements(d[["data"]], d3[["data"]])
# And same with "dictionary" -- it's the numerical vals that differ
compare_hash_elements(d[["dictionary"]], d3[["dictionary"]])
# here's an example
storr:::hash_object(d[["data"]][["studyName"]])
storr:::hash_object(d3[["data"]][["studyName"]])
storr:::hash_object(d[["data"]][["latitude"]])
storr:::hash_object(d3[["data"]][["latitude"]])
# Now let's look at specific elements, seems NA's get treated differently
compare_hash_vals(d[["data"]][["latitude"]], d3[["data"]][["latitude"]])
i <- !is.na(d[["data"]][["latitude"]])
compare_hash_vals(d[["data"]][["latitude"]][!i], d3[["data"]][["latitude"]][!i])
compare_hash_vals(d[["data"]][["latitude"]][i], d3[["data"]][["latitude"]][i])
# Just to verify, let's check another variable
xvar <- "h.t"
i <- !is.na(d[["data"]][[xvar]])
compare_hash_vals(d[["data"]][[xvar]][!i], d3[["data"]][[xvar]][!i])
compare_hash_vals(d[["data"]][[xvar]][i], d3[["data"]][[xvar]][i])
# Now let's try it as a character (they're the same!)
xvar <- "h.t"
i <- !is.na(d[["data"]][[xvar]])
compare_hash_vals(as.character(d[["data"]][[xvar]][!i]), as.character(d3[["data"]][[xvar]][!i]))
compare_hash_vals(as.character(d[["data"]][[xvar]][i]), as.character(d3[["data"]][[xvar]][i]))
And here it is with output:
> compare_hash_elements <- function(x1, x2) {
+ i <- sapply(x1, storr:::hash_object) == sapply(x2, storr:::hash_object)
+ names(x1)[!i]
+ }
>
> compare_hash_vals <- function(x1, x2) {
+ i <- sapply(x1, storr:::hash_object) == sapply(x2, storr:::hash_object)
+ x1[!i]
+ }
>
>
> # download https://ci.appveyor.com/api/buildjobs/qjnops93iewg2xw2/artifacts/baad.data.Rcheck/tests/testthat/baad_1.0.0.rds
>
> library(baad.data)
> library(testthat)
>
> d3 <- readRDS("~/Downloads/baad_1.0.0.rds")
d <- baad_data("1.0.0")
for(x in names(d)){
expect_identical(d[[x]], d3[[x]])
}
# note only differences noted is in the bib, due to both line endings and processing of names like "van ..., de ..."
# Now compare hashes of the
compare_hash_elements(d, d3)
# Note hash for `data` differs
storr:::hash_object(d[["data"]])
storr:::hash_object(d3[["data"]])
# But R tells us the contents are identical
all.equal(d[["data"]], d3[["data"]])
# Looking at elements of "data" we can see that it's all the numerical values
compare_hash_elements(d[["data"]], d3[["data"]])
# And same with "dictionary" -- it's the numerical vals that differ
compare_hash_elements(d[["dictionary"]], d3[["dictionary"]])
# here's an example
storr:::hash_object(d[["data"]][["studyName"]])
storr:::hash_object(d3[["data"]][["studyName"]])
storr:::hash_object(d[["data"]][["latitude"]])
storr:::hash_object(d3[["data"]][["latitude"]])> d <- baad_data("1.0.0")
>
>
> for(x in names(d)){
+ expect_identical(d[[x]], d3[[x]])
+ }
Error: d[[x]] not identical to d3[[x]].
Component “Markesteijn2009”: Component “abstract”: 1 string mismatch
Component “Petritan2009”: Component “author”: Component 2: Component 1: Lengths (2, 1) differ (string compare on first 1)
Component “Petritan2009”: Component “author”: Component 2: Component 2: 1 string mismatch
Component “vanBreugel2011”: Component “author”: Component 1: Component 1: Lengths (2, 1) differ (string compare on first 1)
Component “vanBreugel2011”: Component “author”: Component 1: Component 2: 1 string mismatch
Component “Wang2011”: Component “author”: Component 7: Component 1: Lengths (2, 1) differ (string compare on first 1)
Component “Wang2011”: Component “author”: Component 7: Component 2: 1 string mismatch
>
> # note only differences noted is in the bib, due to both line endings and processing of names like "van ..., de ..."
>
> # Now compare hashes of the
> compare_hash_elements(d, d3)
[1] "data" "dictionary" "bib"
>
> # Note hash for `data` differs
> storr:::hash_object(d[["data"]])
[1] "16e346bcc5a49c10a3974b6ac149749f"
> storr:::hash_object(d3[["data"]])
[1] "bbb1a095d02a931e852d75e057340c71"
>
> # But R tells us the contents are identical
> all.equal(d[["data"]], d3[["data"]])
[1] TRUE
>
> # Looking at elements of "data" we can see that it's all the numerical values
> compare_hash_elements(d[["data"]], d3[["data"]])
[1] "latitude" "longitude" "map" "mat" "lai" "age"
[7] "a.lf" "a.ssba" "a.ssbh" "a.ssbc" "a.shba" "a.shbh"
[13] "a.shbc" "a.sbbh" "a.stba" "a.stbh" "a.stbc" "a.cp"
[19] "a.cs" "h.t" "h.c" "d.ba" "d.bh" "h.bh"
[25] "d.cr" "c.d" "m.lf" "m.ss" "m.sh" "m.sb"
[31] "m.st" "m.so" "m.br" "m.rf" "m.rc" "m.rt"
[37] "m.to" "a.ilf" "ma.ilf" "r.st" "n.lf" "n.ss"
[43] "n.sb" "n.sh" "n.rf" "n.rc"
>
> # And same with "dictionary" -- it's the numerical vals that differ
> compare_hash_elements(d[["dictionary"]], d3[["dictionary"]])
[1] "minValue" "maxValue"
>
> # here's an example
>
> storr:::hash_object(d[["data"]][["studyName"]])
[1] "d5fffd896d2d985ecb4d2c9b22a0f6d7"
> storr:::hash_object(d3[["data"]][["studyName"]])
[1] "d5fffd896d2d985ecb4d2c9b22a0f6d7"
>
> storr:::hash_object(d[["data"]][["latitude"]])
[1] "03286f419bc93e0755b4d7f48d3b7ade"
> storr:::hash_object(d3[["data"]][["latitude"]])
[1] "58350df6d4342fd7d8c020e2ba7156bc"
>
>
> # Now let's look at specific elements, seems NA's get treated differently
>
> compare_hash_vals(d[["data"]][["latitude"]], d3[["data"]][["latitude"]])
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[51] NA NA NA NA NA NA NA NA NA NA NA NA NA
>
> i <- !is.na(d[["data"]][["latitude"]])
> compare_hash_vals(d[["data"]][["latitude"]][!i], d3[["data"]][["latitude"]][!i])
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[51] NA NA NA NA NA NA NA NA NA NA NA NA NA
>
> compare_hash_vals(d[["data"]][["latitude"]][i], d3[["data"]][["latitude"]][i])
numeric(0)
>
> # Just to verify, let's check another variable
> xvar <- "h.t"
> i <- !is.na(d[["data"]][[xvar]])
> compare_hash_vals(d[["data"]][[xvar]][!i], d3[["data"]][[xvar]][!i])
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[25] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
...
> compare_hash_vals(d[["data"]][[xvar]][i], d3[["data"]][[xvar]][i])
numeric(0)
>
> # Now let's try it as a character (they're the same!)
> xvar <- "h.t"
> i <- !is.na(d[["data"]][[xvar]])
> compare_hash_vals(as.character(d[["data"]][[xvar]][!i]), as.character(d3[["data"]][[xvar]][!i]))
character(0)
> compare_hash_vals(as.character(d[["data"]][[xvar]][i]), as.character(d3[["data"]][[xvar]][i]))
character(0)
>
Tests of baad.data are failing on windows machines (via Appveyor). In particular, the test checking the objects return the same hash is failing. From Appveyor build1.07, we have
Line 12 is the test for object hash.
Running on my machine I get the same output as is encoded in the test"
@RemkoDuursma can you confirm what you get on your windows machine when you run the above two lines?
It would be great if you could also download the baad.data repo and run the tests to see if they pass.