inbo / n2khab

R package with preprocessing functions and standard reference data for Flemish Natura 2000 (N2K) habitat (HAB) analyses
https://inbo.github.io/n2khab
GNU General Public License v3.0
2 stars 1 forks source link

Adding checksum functions #122

Closed florisvdh closed 3 years ago

florisvdh commented 3 years ago

@cecileherr can you test these, e.g. by running the example or for a Zenodo file?

Adding these functions was more of an adventure than originally expected. After writing, I discovered there is also tools::md5sum() :thinking:. However in general I believe it's best to lean on OpenSSL as a standard; tools::md5sum() seems to be hardcoded in R ~, and from its docs, may not always work the same on all platforms~. Also have struggled with silently closing connections until I found the function itself was not the cause, but the example I ran... At least I think that now (please look for warnings).

sha256sum() can be used to generate and store sha256sums as file metadata. I think this implementation is not available in core R packages.

florisvdh commented 3 years ago

BTW openssl is an imported package, assuming it will eventually be called (indirectly) by most functions.

florisvdh commented 3 years ago

@hansvancalster I just saw that n2khab::download_zenodo() - yep the original one, which is much quicker than zen4R and I prefer the most - uses tools::md5sum(). What's your opinion for this PR?

Now, I was intending to suggest switching download_zenodo() to using openssl::md5() as well (through n2khab::md5sum()), for the sake of conformity. However I now see tools::md5sum() is a tiny bit speedier, for a 1.2 GB file (see below). That keeps me hesitating to use n2khab::md5sum() so I'm interested to hear you.

Note, the reason for not using openssl::md5() or openssl::sha256() directly is because those work on a single connection object, not a vector of file paths, and also because their output format is a little different. These things are tackled by n2khab in this PR.

soilmapdbf <- file.path(n2khab::fileman_up("n2khab_data"),
                        "10_raw/soilmap/soilmap.dbf")
invisible(n2khab::md5sum(soilmapdbf)) # load in memory

system.time(openssl::md5(file(soilmapdbf)))
#>    user  system elapsed 
#>   2.885   0.292   3.180
system.time(openssl::md5(file(soilmapdbf)))
#>    user  system elapsed 
#>   2.699   0.236   2.935

system.time(n2khab::md5sum(soilmapdbf))
#>    user  system elapsed 
#>   2.745   0.240   2.986
system.time(tools::md5sum(soilmapdbf))
#>    user  system elapsed 
#>   2.516   0.220   2.737

system.time(n2khab::md5sum(soilmapdbf))
#>    user  system elapsed 
#>   2.668   0.252   2.921
system.time(tools::md5sum(soilmapdbf))
#>    user  system elapsed 
#>   2.555   0.199   2.755

system.time(n2khab::md5sum(soilmapdbf))
#>    user  system elapsed 
#>   2.651   0.315   2.967
system.time(tools::md5sum(soilmapdbf))
#>    user  system elapsed 
#>   2.513   0.256   2.769

system.time(openssl::md5(file(soilmapdbf)))
#>    user  system elapsed 
#>   2.693   0.312   3.005
system.time(openssl::md5(file(soilmapdbf)))
#>    user  system elapsed 
#>   2.769   0.244   3.013

Created on 2021-03-19 by the reprex package (v1.0.0)

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.0.4 (2021-02-15) #> os Linux Mint 20 #> system x86_64, linux-gnu #> ui X11 #> language nl_BE:nl #> collate nl_BE.UTF-8 #> ctype nl_BE.UTF-8 #> tz Europe/Brussels #> date 2021-03-19 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date lib source #> askpass 1.1 2019-01-13 [1] CRAN (R 4.0.2) #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2) #> class 7.3-18 2021-01-24 [4] CRAN (R 4.0.3) #> classInt 0.4-3 2020-04-07 [1] CRAN (R 4.0.2) #> cli 2.3.0 2021-01-31 [1] CRAN (R 4.0.3) #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.3) #> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.0.3) #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.3) #> dplyr 1.0.4 2021-02-02 [1] CRAN (R 4.0.3) #> e1071 1.7-4 2020-10-14 [1] CRAN (R 4.0.3) #> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.2) #> forcats 0.5.1 2021-01-27 [1] CRAN (R 4.0.3) #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2) #> generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.3) #> git2r 0.28.0 2021-01-10 [1] CRAN (R 4.0.3) #> git2rdata 0.3.1 2021-01-21 [1] CRAN (R 4.0.3) #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2) #> highr 0.8 2019-03-20 [1] CRAN (R 4.0.2) #> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.3) #> KernSmooth 2.23-18 2020-10-29 [4] CRAN (R 4.0.3) #> knitr 1.31 2021-01-27 [1] CRAN (R 4.0.3) #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.4) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.3) #> n2khab 0.4.0.9000 2021-03-19 [1] Github (inbo/n2khab@395e504) #> openssl 1.4.3 2020-09-18 [1] CRAN (R 4.0.2) #> pillar 1.4.7 2020-11-20 [1] CRAN (R 4.0.3) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2) #> plyr 1.8.6 2020-03-03 [1] CRAN (R 4.0.2) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2) #> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.3) #> Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.0.3) #> reprex 1.0.0 2021-01-27 [1] CRAN (R 4.0.3) #> rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.3) #> rmarkdown 2.6 2020-12-14 [1] CRAN (R 4.0.3) #> rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.3) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.3) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2) #> sf 0.9-8 2021-03-04 [1] Github (florisvdh/sf@34434d1) #> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2) #> tibble 3.0.6 2021-01-29 [1] CRAN (R 4.0.3) #> tidyr 1.1.2 2020-08-27 [1] CRAN (R 4.0.2) #> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.2) #> units 0.6-7 2020-06-13 [1] CRAN (R 4.0.2) #> vctrs 0.3.6 2020-12-17 [1] CRAN (R 4.0.3) #> withr 2.4.1 2021-01-26 [1] CRAN (R 4.0.3) #> xfun 0.21 2021-02-10 [1] CRAN (R 4.0.4) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2) #> #> [1] /home/floris/lib/R/library #> [2] /usr/local/lib/R/site-library #> [3] /usr/lib/R/site-library #> [4] /usr/lib/R/library ```
florisvdh commented 3 years ago

One way could be to allocate tools::md5sum to n2khab::md5sum in order to provide direct access to the function, and leave n2khab::sha256sum as is. And because n2khab functions will call the former, drop openssl as a dependency - move it to Suggests (just like tools, but anyone has tools installed with R I think). That leaves us with a mixed approach for checksum calculations in n2khab (tools + openssl), but one package less to load and fastest implementation.

hansvancalster commented 3 years ago

If speed is a concern, you could also consider https://github.com/Cyan4973/xxHash, which is implemented via digest, see https://github.com/ropensci/targets/blob/main/R/utils_digest.R (I always check the targets package - the author of the package choses often the best performing tools).

EDIT: but that algorithm is not available via openssl

hansvancalster commented 3 years ago

Note that openssl::multihash() could be useful:

When using a connection input in the multihash function, the data is only read only once while streaming to multiple hash functions simultaneously. Therefore several hashes are calculated simultanously, without the need to store any data or download it multiple times.

florisvdh commented 3 years ago

Thanks @hansvancalster! I'll have another look at xxHash - I've encountered it when exploring this subject, but missed its R implementation. Indeed I tend to choose more established (more widely screened) ones such as md5 / sha256. For verifying larger amounts of files (e.g. the whole n2khab_data folder) it may be worth it though. On the other hand I estimate > 80% time goes to reading from disk (with my HDD; will be much less on your fast SSD), so we cannot improve that part.

openssl::multihash() is indeed interesting, thanks for investigating! We'll only use one hash for checking files though, but it could be useful to generate all hashes at once after creating new files.

florisvdh commented 3 years ago

Note that download_zenodo() uses md5 by design: it checks against the md5 metadata from Zenodo. So there remains the question about tools vs. openssl.

hansvancalster commented 3 years ago

Note that download_zenodo() uses md5 by design: it checks against the md5 metadata from Zenodo. So there remains the question about tools vs. openssl.

Indeed; for that case, the speed difference is too minor IMO to decide among them. So I have no strong opinion. Maybe you are right that openssl is a better implementation.

florisvdh commented 3 years ago

OK, it appears that digest is also equipped to hash files, next to R objects (it advertises the capability for R objects) :+1:. Didn't catch that - it makes one package more to consider.

See also https://github.com/inbo/n2khab/issues/112#issuecomment-804123547, where some findings about xxHash have been added.

Some timings:

soilmapdbf <- file.path(n2khab::fileman_up("n2khab_data"),
                        "10_raw/soilmap/soilmap.dbf")
invisible(n2khab::md5sum(soilmapdbf)) # load in memory

system.time(digest::digest(soilmapdbf, algo = "xxhash64", file = TRUE))
#>    user  system elapsed 
#>   0.285   0.208   0.494
system.time(digest::digest(soilmapdbf, algo = "md5", file = TRUE))
#>    user  system elapsed 
#>   3.539   0.280   3.819
system.time(n2khab::md5sum(soilmapdbf))
#>    user  system elapsed 
#>   2.690   0.292   2.983
system.time(tools::md5sum(soilmapdbf))
#>    user  system elapsed 
#>   2.521   0.264   2.786

system.time(digest::digest(soilmapdbf, algo = "xxhash64", file = TRUE))
#>    user  system elapsed 
#>   0.296   0.200   0.496
system.time(digest::digest(soilmapdbf, algo = "md5", file = TRUE))
#>    user  system elapsed 
#>   3.663   0.152   3.816
system.time(n2khab::md5sum(soilmapdbf))
#>    user  system elapsed 
#>   2.715   0.272   2.987
system.time(tools::md5sum(soilmapdbf))
#>    user  system elapsed 
#>   2.521   0.264   2.785

Created on 2021-03-22 by the reprex package (v1.0.0)

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.0.4 (2021-02-15) #> os Linux Mint 20 #> system x86_64, linux-gnu #> ui X11 #> language nl_BE:nl #> collate nl_BE.UTF-8 #> ctype nl_BE.UTF-8 #> tz Europe/Brussels #> date 2021-03-22 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date lib source #> askpass 1.1 2019-01-13 [1] CRAN (R 4.0.2) #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2) #> class 7.3-18 2021-01-24 [4] CRAN (R 4.0.3) #> classInt 0.4-3 2020-04-07 [1] CRAN (R 4.0.2) #> cli 2.3.0 2021-01-31 [1] CRAN (R 4.0.3) #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.3) #> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.0.3) #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.3) #> dplyr 1.0.4 2021-02-02 [1] CRAN (R 4.0.3) #> e1071 1.7-4 2020-10-14 [1] CRAN (R 4.0.3) #> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.2) #> forcats 0.5.1 2021-01-27 [1] CRAN (R 4.0.3) #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2) #> generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.3) #> git2r 0.28.0 2021-01-10 [1] CRAN (R 4.0.3) #> git2rdata 0.3.1 2021-01-21 [1] CRAN (R 4.0.3) #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2) #> highr 0.8 2019-03-20 [1] CRAN (R 4.0.2) #> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.3) #> KernSmooth 2.23-18 2020-10-29 [4] CRAN (R 4.0.3) #> knitr 1.31 2021-01-27 [1] CRAN (R 4.0.3) #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.4) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.3) #> n2khab 0.4.0.9000 2021-03-19 [1] Github (inbo/n2khab@395e504) #> openssl 1.4.3 2020-09-18 [1] CRAN (R 4.0.2) #> pillar 1.4.7 2020-11-20 [1] CRAN (R 4.0.3) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2) #> plyr 1.8.6 2020-03-03 [1] CRAN (R 4.0.2) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2) #> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.3) #> Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.0.3) #> reprex 1.0.0 2021-01-27 [1] CRAN (R 4.0.3) #> rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.3) #> rmarkdown 2.6 2020-12-14 [1] CRAN (R 4.0.3) #> rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.3) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.3) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2) #> sf 0.9-8 2021-03-04 [1] Github (florisvdh/sf@34434d1) #> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2) #> tibble 3.0.6 2021-01-29 [1] CRAN (R 4.0.3) #> tidyr 1.1.2 2020-08-27 [1] CRAN (R 4.0.2) #> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.2) #> units 0.6-7 2020-06-13 [1] CRAN (R 4.0.2) #> vctrs 0.3.6 2020-12-17 [1] CRAN (R 4.0.3) #> withr 2.4.1 2021-01-26 [1] CRAN (R 4.0.3) #> xfun 0.21 2021-02-10 [1] CRAN (R 4.0.4) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2) #> #> [1] /home/floris/lib/R/library #> [2] /usr/local/lib/R/site-library #> [3] /usr/lib/R/site-library #> [4] /usr/lib/R/library ```

Sidenote: the md5 hashing by digest is clearly slower than the implementations of tools and openssl.

I am quite attracted by the much higher speed of xxHash. So I might want to take the small risk of using XXH64 for now, over MD5 or SHA256. ('Risk': its hash is less unique, its R implementation is tailor-made - not linked, and the technology is not as well-known). Even though disk reading time still matters more. We can still continue to store the MD5/SHA256 as a backup, and perhaps expose the algorithm choice as a user setting.

In order to maintain future flexibility with reference to hash algorithm choice, I intend to rewrite this PR a bit so that a more generic hash function is provided e.g. with xxh64, md5 and sha256 as options, and one set as default. When using this function throughout n2khab, we can code it such that it just needs a change in its default, to let the whole package switch to using another hash (by default).

cecileherr commented 3 years ago

Thanks for the useful addition!

I cannot comment on the choice of the package since I am not experienced enough, but I tested the new functions.

I tried with single and multiple files, and with different file types, and md5sum()and sha256sum() seem to work as expected, except for the assert_that_allfiles_exist part.

As far as I understand assert_that checks whether a file or a folder exists, but this can be a problem in this case because md5sum() and sha256sum() expect files and will give an error if used with folders.

myfile1 <- file.path(fileman_up("n2khab_data"), "10_raw/watersurfaces/watersurfaces.gpkg")
myfile2 <- file.path(fileman_up("n2khab_data"), "10_raw/habitatstreams") 
# a folder with a shp (would be recognized by read_sf but will not work with md5sum() and sha256sum() )

md5sum(c(myfile1, myfile2))
# Error in open.connection(con, "rb") : cannot open the connection 
# do not trigger assert_that_allfiles_exist (because assert_that_allfiles_exist(myfile2) = TRUE)

A side note: because of this difference in usage between md5sum() and sha256sum()and the files/paths recognized by read_sf in the read_xxx functions (with read_sf you don't need to mention n2khab_data/10_raw/watersurfaces/watersurfaces.shp, n2khab_data/10_raw/watersurfaces will also work), we might have to be more explicit in the defintion of the file argument in the read_xx functions.

When I encounter an error (in this case on purpose), the next time I use the function I will get a warning (even though the second try works smoothly). Not really a problem, but it might be intriguing/disturbing for users.

# trigger an error:
myfile <- file.path(fileman_up("n2khab_data"), "10_raw/watersurfaces")
md5sum(myfile)
 # Error in open.connection(con, "rb") : cannot open the connection 

# run for an existing file:
myfile <- file.path(fileman_up("n2khab_data"), "10_raw/watersurfaces/watersurfaces.shp")
md5sum(myfile)
#                 watersurfaces.shp 
# "935ab52ef0dc32dd8b028894c08c3e79"
#Warning message:
#In match.fun(FUN) :
#  closing unused connection 3 (C:\Users\cecile_herr\GitHubRepos\n2khab_data/10_raw/watersurfaces)
florisvdh commented 3 years ago

Thank you for testing @cecileherr!

md5sum() and sha256sum() expect files and will give an error if used with folders

Yes, that's intended behaviour. File hashes are calculated for files only. I'll add a check to error on folders.

When I encounter an error (in this case on purpose), the next time I use the function I will get a warning

Well spotted :+1:. I overlooked the case where the function would error because a folder is provided instead of a file. Although the hashing functions are primarily intended for internal usage, I exported them so it's better to harden them as well. Will try to avoid dangling connections (in case of an error) by re-adopting dropped parts of 1460b88.

we might have to be more explicit in the definition of the file argument in the read_xx functions

For shapefiles, we could indeed replace the folder with the *.shp file as default - that should also work for sf and it may appear more logical. Nevertheless the approach cannot cover cases where a multi-layer data source is actually defined by different files (GRTSmh_diffres is in that case) - so for that reason we could as well leave all current default folder arguments as-is. I would rather not invest too much in that, primarily because of the following.

For shapefiles (and things like GRTSmh_diffres) you would need to provide all files as default argument if you also want to use that argument for file hashing. I believe that is not ideal - for now it would unnecessarily complicate things IMHO. It would duplicate the information in https://docs.google.com/spreadsheets/d/1E8ERlfYwP3OjluL8d7_4rR1W34ka4LRCE35JTxf3WMI/edit#gid=2100595853. The latter is intended to end up as an internal table in n2khab, and will already be queried with an (internal) function to know which files (and expected hashes) are linked to which data source versions (the latter also need definition in a local table). Hardcoding those data and doing file integrity checking inside an n2khab function should now be limited to cases where it is rather necessary, as it will become obsolete once package-wide file/version handling is in place. I.e. the case you referred to where one read_xxx() function calls another - it is probably a single case in n2khab.

So, in the end all functions should rather have a file = NULL argument (just kept for backward compatibility). The file integrity check will happen in the context of another generic check function, which all functions can call while passing data source and version as argument to it - something along that approach (and more). We can discuss it more elaborately if you want.

florisvdh commented 3 years ago

md5sum() and sha256sum() expect files and will give an error if used with folders

Should be solved by 0cec424.

Will try to avoid dangling connections (in case of an error) by re-adopting dropped parts of 1460b88.

Update: I think this is not needed. Currently only actual files are accepted, by 0cec424. So the connection error should never happen, unless on rare occasions like I/O errors or something like that (but that would yield more error messages anyway). Do you agree @cecileherr?

florisvdh commented 3 years ago

General function checksum() has been added, this is the one to be used in other functions - then the default hash algorithm can be set in the source code of checksum().

Currently defaults to the XXH64 hash function.

library(n2khab)

# creating two different temporary files:
file1 <- tempfile()
file2 <- tempfile()
files <- c(file1, file2)
file.create(files)
#> [1] TRUE TRUE
con <- file(file2)
writeLines("some text", con)
close(con)

# computing alternative checksums:
checksum(files)
#>   filebd3f1b8e63c4   filebd3f116546f0 
#> "ef46db3751d8e999" "b563efb2061ae502"
xxh64sum(files)
#>   filebd3f1b8e63c4   filebd3f116546f0 
#> "ef46db3751d8e999" "b563efb2061ae502"
md5sum(files)
#>                   filebd3f1b8e63c4                   filebd3f116546f0 
#> "d41d8cd98f00b204e9800998ecf8427e" "4d93d51945b88325c213640ef59fc50b"
sha256sum(files)
#>                                                   filebd3f1b8e63c4 
#> "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855" 
#>                                                   filebd3f116546f0 
#> "a23e5fdcd7b276bdd81aa1a0b7b963101863dd3f61ff57935f8c5ba462681ea6"

Created on 2021-03-24 by the reprex package (v1.0.0)

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.0.4 (2021-02-15) #> os Linux Mint 20 #> system x86_64, linux-gnu #> ui X11 #> language nl_BE:nl #> collate nl_BE.UTF-8 #> ctype nl_BE.UTF-8 #> tz Europe/Brussels #> date 2021-03-24 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date lib source #> askpass 1.1 2019-01-13 [1] CRAN (R 4.0.2) #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2) #> class 7.3-18 2021-01-24 [4] CRAN (R 4.0.3) #> classInt 0.4-3 2020-04-07 [1] CRAN (R 4.0.2) #> cli 2.3.0 2021-01-31 [1] CRAN (R 4.0.3) #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.3) #> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.0.3) #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.3) #> dplyr 1.0.4 2021-02-02 [1] CRAN (R 4.0.3) #> e1071 1.7-4 2020-10-14 [1] CRAN (R 4.0.3) #> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.2) #> forcats 0.5.1 2021-01-27 [1] CRAN (R 4.0.3) #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2) #> generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.3) #> git2r 0.28.0 2021-01-10 [1] CRAN (R 4.0.3) #> git2rdata 0.3.1 2021-01-21 [1] CRAN (R 4.0.3) #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2) #> highr 0.8 2019-03-20 [1] CRAN (R 4.0.2) #> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.3) #> KernSmooth 2.23-18 2020-10-29 [4] CRAN (R 4.0.3) #> knitr 1.31 2021-01-27 [1] CRAN (R 4.0.3) #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.4) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.3) #> n2khab * 0.4.0.9000 2021-03-24 [1] local #> openssl 1.4.3 2020-09-18 [1] CRAN (R 4.0.2) #> pillar 1.4.7 2020-11-20 [1] CRAN (R 4.0.3) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2) #> plyr 1.8.6 2020-03-03 [1] CRAN (R 4.0.2) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2) #> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.3) #> Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.0.3) #> reprex 1.0.0 2021-01-27 [1] CRAN (R 4.0.3) #> rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.3) #> rmarkdown 2.6 2020-12-14 [1] CRAN (R 4.0.3) #> rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.3) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.3) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2) #> sf 0.9-8 2021-03-04 [1] Github (florisvdh/sf@34434d1) #> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2) #> tibble 3.0.6 2021-01-29 [1] CRAN (R 4.0.3) #> tidyr 1.1.2 2020-08-27 [1] CRAN (R 4.0.2) #> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.2) #> units 0.6-7 2020-06-13 [1] CRAN (R 4.0.2) #> vctrs 0.3.6 2020-12-17 [1] CRAN (R 4.0.3) #> withr 2.4.1 2021-01-26 [1] CRAN (R 4.0.3) #> xfun 0.21 2021-02-10 [1] CRAN (R 4.0.4) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2) #> #> [1] /home/floris/lib/R/library #> [2] /usr/local/lib/R/site-library #> [3] /usr/lib/R/site-library #> [4] /usr/lib/R/library ```

@cecileherr here are the XXH64 hashes that you'll probably need (MD5 provided as check against reference list / Zenodo):

library(n2khab)
library(magrittr)
hms_2018 <- file.path(n2khab::fileman_up("n2khab_data"),
                      "20_processed/habitatmap_stdized_2018_v2/habitatmap_stdized.gpkg")
hms_2020 <- file.path(n2khab::fileman_up("n2khab_data"),
                      "20_processed/habitatmap_stdized/habitatmap_stdized.gpkg")
files <- c(hms_2018, hms_2020)
md5sum(files) %>% set_names(c(2018, 2020))
#>                               2018                               2020 
#> "7f89d4a6bc0b2080a0510169497259ff" "e2f0bff28016bd525ab652349c555141"
xxh64sum(files) %>% set_names(c(2018, 2020))
#>               2018               2020 
#> "b80f469f33636c8b" "3109c26f0a27a0f3"
florisvdh commented 3 years ago

Thanks for your review @cecileherr.