Closed florisvdh closed 3 years ago
BTW openssl
is an imported package, assuming it will eventually be called (indirectly) by most functions.
@hansvancalster I just saw that n2khab::download_zenodo()
- yep the original one, which is much quicker than zen4R and I prefer the most - uses tools::md5sum()
. What's your opinion for this PR?
openssl
, which uses the OpenSSL library, can be used to calculate other file checksums as well in R, such as sha256 which I also would like to store within the package next to md5 (independently of Zenodo), even though we'll by default use md5 to check.digest
package refers to OpenSSL for most rigorous applicationtools::md5sum()
is a hardcoded, standalone implementation of the md5 algorithm in R itself from 2001. Its docs say on Windows the binary mode is used and text mode on other platforms, but as I understand from here, for *nix systems there should be no difference between the binary and text mode. Indeed, we had no problems using download_zenodo()
on various systems. Still, reading the docs of tools::md5sum()
feels a bit like the function is a tailor-made approach.Now, I was intending to suggest switching download_zenodo()
to using openssl::md5()
as well (through n2khab::md5sum()
), for the sake of conformity. However I now see tools::md5sum()
is a tiny bit speedier, for a 1.2 GB file (see below). That keeps me hesitating to use n2khab::md5sum()
so I'm interested to hear you.
Note, the reason for not using openssl::md5()
or openssl::sha256()
directly is because those work on a single connection object, not a vector of file paths, and also because their output format is a little different. These things are tackled by n2khab
in this PR.
soilmapdbf <- file.path(n2khab::fileman_up("n2khab_data"),
"10_raw/soilmap/soilmap.dbf")
invisible(n2khab::md5sum(soilmapdbf)) # load in memory
system.time(openssl::md5(file(soilmapdbf)))
#> user system elapsed
#> 2.885 0.292 3.180
system.time(openssl::md5(file(soilmapdbf)))
#> user system elapsed
#> 2.699 0.236 2.935
system.time(n2khab::md5sum(soilmapdbf))
#> user system elapsed
#> 2.745 0.240 2.986
system.time(tools::md5sum(soilmapdbf))
#> user system elapsed
#> 2.516 0.220 2.737
system.time(n2khab::md5sum(soilmapdbf))
#> user system elapsed
#> 2.668 0.252 2.921
system.time(tools::md5sum(soilmapdbf))
#> user system elapsed
#> 2.555 0.199 2.755
system.time(n2khab::md5sum(soilmapdbf))
#> user system elapsed
#> 2.651 0.315 2.967
system.time(tools::md5sum(soilmapdbf))
#> user system elapsed
#> 2.513 0.256 2.769
system.time(openssl::md5(file(soilmapdbf)))
#> user system elapsed
#> 2.693 0.312 3.005
system.time(openssl::md5(file(soilmapdbf)))
#> user system elapsed
#> 2.769 0.244 3.013
Created on 2021-03-19 by the reprex package (v1.0.0)
One way could be to allocate tools::md5sum
to n2khab::md5sum
in order to provide direct access to the function, and leave n2khab::sha256sum
as is. And because n2khab
functions will call the former, drop openssl
as a dependency - move it to Suggests (just like tools
, but anyone has tools
installed with R I think). That leaves us with a mixed approach for checksum calculations in n2khab (tools
+ openssl
), but one package less to load and fastest implementation.
If speed is a concern, you could also consider https://github.com/Cyan4973/xxHash, which is implemented via digest, see https://github.com/ropensci/targets/blob/main/R/utils_digest.R (I always check the targets package - the author of the package choses often the best performing tools).
EDIT: but that algorithm is not available via openssl
Note that openssl::multihash()
could be useful:
When using a connection input in the multihash function, the data is only read only once while streaming to multiple hash functions simultaneously. Therefore several hashes are calculated simultanously, without the need to store any data or download it multiple times.
Thanks @hansvancalster! I'll have another look at xxHash - I've encountered it when exploring this subject, but missed its R implementation. Indeed I tend to choose more established (more widely screened) ones such as md5 / sha256. For verifying larger amounts of files (e.g. the whole n2khab_data
folder) it may be worth it though. On the other hand I estimate > 80% time goes to reading from disk (with my HDD; will be much less on your fast SSD), so we cannot improve that part.
openssl::multihash()
is indeed interesting, thanks for investigating! We'll only use one hash for checking files though, but it could be useful to generate all hashes at once after creating new files.
Note that download_zenodo()
uses md5 by design: it checks against the md5 metadata from Zenodo. So there remains the question about tools
vs. openssl
.
Note that
download_zenodo()
uses md5 by design: it checks against the md5 metadata from Zenodo. So there remains the question abouttools
vs.openssl
.
Indeed; for that case, the speed difference is too minor IMO to decide among them. So I have no strong opinion. Maybe you are right that openssl
is a better implementation.
OK, it appears that digest
is also equipped to hash files, next to R objects (it advertises the capability for R objects) :+1:. Didn't catch that - it makes one package more to consider.
See also https://github.com/inbo/n2khab/issues/112#issuecomment-804123547, where some findings about xxHash have been added.
Some timings:
soilmapdbf <- file.path(n2khab::fileman_up("n2khab_data"),
"10_raw/soilmap/soilmap.dbf")
invisible(n2khab::md5sum(soilmapdbf)) # load in memory
system.time(digest::digest(soilmapdbf, algo = "xxhash64", file = TRUE))
#> user system elapsed
#> 0.285 0.208 0.494
system.time(digest::digest(soilmapdbf, algo = "md5", file = TRUE))
#> user system elapsed
#> 3.539 0.280 3.819
system.time(n2khab::md5sum(soilmapdbf))
#> user system elapsed
#> 2.690 0.292 2.983
system.time(tools::md5sum(soilmapdbf))
#> user system elapsed
#> 2.521 0.264 2.786
system.time(digest::digest(soilmapdbf, algo = "xxhash64", file = TRUE))
#> user system elapsed
#> 0.296 0.200 0.496
system.time(digest::digest(soilmapdbf, algo = "md5", file = TRUE))
#> user system elapsed
#> 3.663 0.152 3.816
system.time(n2khab::md5sum(soilmapdbf))
#> user system elapsed
#> 2.715 0.272 2.987
system.time(tools::md5sum(soilmapdbf))
#> user system elapsed
#> 2.521 0.264 2.785
Created on 2021-03-22 by the reprex package (v1.0.0)
Sidenote: the md5 hashing by digest
is clearly slower than the implementations of tools
and openssl
.
I am quite attracted by the much higher speed of xxHash. So I might want to take the small risk of using XXH64 for now, over MD5 or SHA256. ('Risk': its hash is less unique, its R implementation is tailor-made - not linked, and the technology is not as well-known). Even though disk reading time still matters more. We can still continue to store the MD5/SHA256 as a backup, and perhaps expose the algorithm choice as a user setting.
In order to maintain future flexibility with reference to hash algorithm choice, I intend to rewrite this PR a bit so that a more generic hash function is provided e.g. with xxh64, md5 and sha256 as options, and one set as default. When using this function throughout n2khab, we can code it such that it just needs a change in its default, to let the whole package switch to using another hash (by default).
Thanks for the useful addition!
I cannot comment on the choice of the package since I am not experienced enough, but I tested the new functions.
I tried with single and multiple files, and with different file types, and md5sum()
and sha256sum()
seem to work as expected, except for the assert_that_allfiles_exist
part.
As far as I understand assert_that
checks whether a file or a folder exists, but this can be a problem in this case because md5sum() and sha256sum() expect files and will give an error if used with folders.
myfile1 <- file.path(fileman_up("n2khab_data"), "10_raw/watersurfaces/watersurfaces.gpkg")
myfile2 <- file.path(fileman_up("n2khab_data"), "10_raw/habitatstreams")
# a folder with a shp (would be recognized by read_sf but will not work with md5sum() and sha256sum() )
md5sum(c(myfile1, myfile2))
# Error in open.connection(con, "rb") : cannot open the connection
# do not trigger assert_that_allfiles_exist (because assert_that_allfiles_exist(myfile2) = TRUE)
A side note: because of this difference in usage between md5sum()
and sha256sum()
and the files/paths recognized by read_sf in the read_xxx functions (with read_sf
you don't need to mention n2khab_data/10_raw/watersurfaces/watersurfaces.shp, n2khab_data/10_raw/watersurfaces will also work), we might have to be more explicit in the defintion of the file argument in the read_xx functions.
When I encounter an error (in this case on purpose), the next time I use the function I will get a warning (even though the second try works smoothly). Not really a problem, but it might be intriguing/disturbing for users.
# trigger an error:
myfile <- file.path(fileman_up("n2khab_data"), "10_raw/watersurfaces")
md5sum(myfile)
# Error in open.connection(con, "rb") : cannot open the connection
# run for an existing file:
myfile <- file.path(fileman_up("n2khab_data"), "10_raw/watersurfaces/watersurfaces.shp")
md5sum(myfile)
# watersurfaces.shp
# "935ab52ef0dc32dd8b028894c08c3e79"
#Warning message:
#In match.fun(FUN) :
# closing unused connection 3 (C:\Users\cecile_herr\GitHubRepos\n2khab_data/10_raw/watersurfaces)
Thank you for testing @cecileherr!
md5sum() and sha256sum() expect files and will give an error if used with folders
Yes, that's intended behaviour. File hashes are calculated for files only. I'll add a check to error on folders.
When I encounter an error (in this case on purpose), the next time I use the function I will get a warning
Well spotted :+1:. I overlooked the case where the function would error because a folder is provided instead of a file. Although the hashing functions are primarily intended for internal usage, I exported them so it's better to harden them as well. Will try to avoid dangling connections (in case of an error) by re-adopting dropped parts of 1460b88.
we might have to be more explicit in the definition of the file argument in the read_xx functions
For shapefiles, we could indeed replace the folder with the *.shp file as default - that should also work for sf
and it may appear more logical. Nevertheless the approach cannot cover cases where a multi-layer data source is actually defined by different files (GRTSmh_diffres
is in that case) - so for that reason we could as well leave all current default folder arguments as-is. I would rather not invest too much in that, primarily because of the following.
For shapefiles (and things like GRTSmh_diffres
) you would need to provide all files as default argument if you also want to use that argument for file hashing. I believe that is not ideal - for now it would unnecessarily complicate things IMHO. It would duplicate the information in https://docs.google.com/spreadsheets/d/1E8ERlfYwP3OjluL8d7_4rR1W34ka4LRCE35JTxf3WMI/edit#gid=2100595853. The latter is intended to end up as an internal table in n2khab
, and will already be queried with an (internal) function to know which files (and expected hashes) are linked to which data source versions (the latter also need definition in a local table). Hardcoding those data and doing file integrity checking inside an n2khab
function should now be limited to cases where it is rather necessary, as it will become obsolete once package-wide file/version handling is in place. I.e. the case you referred to where one read_xxx()
function calls another - it is probably a single case in n2khab
.
So, in the end all functions should rather have a file = NULL
argument (just kept for backward compatibility). The file integrity check will happen in the context of another generic check function, which all functions can call while passing data source and version as argument to it - something along that approach (and more). We can discuss it more elaborately if you want.
md5sum() and sha256sum() expect files and will give an error if used with folders
Should be solved by 0cec424.
Will try to avoid dangling connections (in case of an error) by re-adopting dropped parts of 1460b88.
Update: I think this is not needed. Currently only actual files are accepted, by 0cec424. So the connection error should never happen, unless on rare occasions like I/O errors or something like that (but that would yield more error messages anyway). Do you agree @cecileherr?
General function checksum()
has been added, this is the one to be used in other functions - then the default hash algorithm can be set in the source code of checksum()
.
Currently defaults to the XXH64 hash function.
library(n2khab)
# creating two different temporary files:
file1 <- tempfile()
file2 <- tempfile()
files <- c(file1, file2)
file.create(files)
#> [1] TRUE TRUE
con <- file(file2)
writeLines("some text", con)
close(con)
# computing alternative checksums:
checksum(files)
#> filebd3f1b8e63c4 filebd3f116546f0
#> "ef46db3751d8e999" "b563efb2061ae502"
xxh64sum(files)
#> filebd3f1b8e63c4 filebd3f116546f0
#> "ef46db3751d8e999" "b563efb2061ae502"
md5sum(files)
#> filebd3f1b8e63c4 filebd3f116546f0
#> "d41d8cd98f00b204e9800998ecf8427e" "4d93d51945b88325c213640ef59fc50b"
sha256sum(files)
#> filebd3f1b8e63c4
#> "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
#> filebd3f116546f0
#> "a23e5fdcd7b276bdd81aa1a0b7b963101863dd3f61ff57935f8c5ba462681ea6"
Created on 2021-03-24 by the reprex package (v1.0.0)
@cecileherr here are the XXH64 hashes that you'll probably need (MD5 provided as check against reference list / Zenodo):
library(n2khab)
library(magrittr)
hms_2018 <- file.path(n2khab::fileman_up("n2khab_data"),
"20_processed/habitatmap_stdized_2018_v2/habitatmap_stdized.gpkg")
hms_2020 <- file.path(n2khab::fileman_up("n2khab_data"),
"20_processed/habitatmap_stdized/habitatmap_stdized.gpkg")
files <- c(hms_2018, hms_2020)
md5sum(files) %>% set_names(c(2018, 2020))
#> 2018 2020
#> "7f89d4a6bc0b2080a0510169497259ff" "e2f0bff28016bd525ab652349c555141"
xxh64sum(files) %>% set_names(c(2018, 2020))
#> 2018 2020
#> "b80f469f33636c8b" "3109c26f0a27a0f3"
Thanks for your review @cecileherr.
@cecileherr can you test these, e.g. by running the example or for a Zenodo file?
Adding these functions was more of an adventure than originally expected. After writing, I discovered there is also
tools::md5sum()
:thinking:. However in general I believe it's best to lean on OpenSSL as a standard;tools::md5sum()
seems to be hardcoded in R ~, and from its docs, may not always work the same on all platforms~. Also have struggled with silently closing connections until I found the function itself was not the cause, but the example I ran... At least I think that now (please look for warnings).sha256sum()
can be used to generate and store sha256sums as file metadata. I think this implementation is not available in core R packages.