Open florisvdh opened 3 years ago
@w-jan @hansvancalster @cecileherr @ToonHub any preference? Zenodo metadata provide md5, Git LFS uses sha256; we can compute and test both while one suffices. Most important difference is with respect to security (hash collision resistance), see above, while both do a perfect job with respect to file integrity verification. But verifying both checksums is absolute overkill, then you can as well restrict to sha256 alone, the most secure.
What we could do, is store both checksums in the future built-in checksums table of n2khab
(as we already do in googlesheet for now), in order to have independent data ourselves if ever needed, but only use md5 in functions and (data-generating/checking) scripts because md5 is also the one used by Zenodo. That's my current suggestion.
I vote for md5 as security is not an issue here - only file integrity - and difference in speed of the algorithms is also not important here.
I have not really a hard meaning about this question, I will follow the meaning of the majority
Continuation of @hansvancalster 's suggestion to have a look at xxHash. The specifics of xxHash fit better in this issue.
Independent documentation and testing of xxHash appears to be limited. Most information can be found from the project itself at https://github.com/Cyan4973/xxHash.
xxHash provides a collection of non-cryptographic hash functions (XXH32, XXH64, and recently, XXH3 and its 128-bit variant XXH128). Compared to md5 and sha256 (cryptographic hash functions) this means the absence of introducing 'obscurity' in the hash with referral to revealing the original data, IIUC. For verifying file integrity, this doesn't matter.
What matters more, is the sensitivity to small changes, and uniqueness.
In R, XXH3 and XXH128 are not yet implemented in the digest
package. BTW, they will never be in openssl
: OpenSSL implements cryptographic hash functions.
Importantly, xxHash has far superior speed to md5 and sha256 (with XXH3/XXH128 even much faster, but not available in R for file hashing). See https://github.com/inbo/n2khab/pull/122#issuecomment-804125207 for timings in R.
libxxhash
), either in digest
(last update: 3 years ago: xxHash 0.6.5, providing XXH32 & XXH64) or in the very recent xxhashlite
, which is on par with xxHash (0.8.0). The latter has the XXH128 bits algorithm, but is only implemented for R objects, not for files, and to me it's unclear how long this will remain maintained.On the other hand openssl
binds to the (external) libssl
library, so it does not hardcode the algorithms itself, which is a more robust approach IMHO.
If we'd have an implementation of xxHash in R, analogous to openssl
, that actually binds to the libxxhash
library rather than hardcoding an older version of xxHash, I would hesitate less, with a preference for the XXH128 algorithm (currently not in digest
).
So in the end, more confusion :thinking:?
I think we should at least support the usage of more recent and faster file hashing algorithms, if not use them by default. And take into account that our preferred choice in n2khab
may evolve, i.e. not hardcoding that choice in more than one place.
Update after a small experiment with XXH64 and XXH128, using a large file of 2.7 GB, to test how sensitive the hash function is to the smallest possible change (one bit) in a large file. Seems very promising: once file is loaded from disk, xxh64 and xxh128 calculations not only take approximately just a second (or less); the hash value is completely different after flipping one bit (tried three times, with different bits). Also, it is demonstrated how flipping back restores original hash values.
The bitflip
utility is the executable python script from https://unix.stackexchange.com/a/196263 and its numeric argument is the number of the selected bit.
A bit elaboration of the file checksum topic as a reference for later - comments most welcome! Discussion originated in https://github.com/inbo/n2khab-preprocessing/pull/50 but is of broader relevance given the future n2khab intentions. Currently we still keep track of both checksums, e.g. with compute_filehashes.R at a7fafb8.
Experiments
First, a small experiment, repeated 3 times, on a 1.2 GiB file. Not shown is the first run, where md5 (because it is run as the first) took about 15 s, which is simply due to reading the file into memory for the first (and only) time - from my slow HDD that is.
Created on 2021-02-11 by the reprex package (v1.0.0)
Created on 2021-02-11 by the reprex package (v1.0.0)
Created on 2021-02-11 by the reprex package (v1.0.0)
Session info
``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.0.3 (2020-10-10) #> os Linux Mint 20 #> system x86_64, linux-gnu #> ui X11 #> language nl_BE:nl #> collate nl_BE.UTF-8 #> ctype nl_BE.UTF-8 #> tz Europe/Brussels #> date 2021-02-11 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date lib source #> askpass 1.1 2019-01-13 [1] CRAN (R 4.0.2) #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2) #> class 7.3-18 2021-01-24 [4] CRAN (R 4.0.3) #> classInt 0.4-3 2020-04-07 [1] CRAN (R 4.0.2) #> cli 2.3.0 2021-01-31 [1] CRAN (R 4.0.3) #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.3) #> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.0.3) #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.3) #> dplyr 1.0.4 2021-02-02 [1] CRAN (R 4.0.3) #> e1071 1.7-4 2020-10-14 [1] CRAN (R 4.0.3) #> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.2) #> forcats 0.5.1 2021-01-27 [1] CRAN (R 4.0.3) #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2) #> generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.3) #> git2r 0.28.0 2021-01-10 [1] CRAN (R 4.0.3) #> git2rdata 0.3.1 2021-01-21 [1] CRAN (R 4.0.3) #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2) #> highr 0.8 2019-03-20 [1] CRAN (R 4.0.2) #> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.3) #> KernSmooth 2.23-18 2020-10-29 [4] CRAN (R 4.0.3) #> knitr 1.31 2021-01-27 [1] CRAN (R 4.0.3) #> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.2) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.3) #> n2khab * 0.3.1.9000 2021-02-02 [1] local #> openssl * 1.4.3 2020-09-18 [1] CRAN (R 4.0.2) #> pillar 1.4.7 2020-11-20 [1] CRAN (R 4.0.3) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2) #> plyr 1.8.6 2020-03-03 [1] CRAN (R 4.0.2) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2) #> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.3) #> Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.0.3) #> reprex 1.0.0 2021-01-27 [1] CRAN (R 4.0.3) #> rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.3) #> rmarkdown 2.6 2020-12-14 [1] CRAN (R 4.0.3) #> rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.3) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.3) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2) #> sf 0.9-7 2021-01-06 [1] CRAN (R 4.0.3) #> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2) #> tibble 3.0.6 2021-01-29 [1] CRAN (R 4.0.3) #> tidyr 1.1.2 2020-08-27 [1] CRAN (R 4.0.2) #> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.2) #> units 0.6-7 2020-06-13 [1] CRAN (R 4.0.2) #> vctrs 0.3.6 2020-12-17 [1] CRAN (R 4.0.3) #> withr 2.4.1 2021-01-26 [1] CRAN (R 4.0.3) #> xfun 0.20 2021-01-06 [1] CRAN (R 4.0.3) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2) #> #> [1] /home/floris/lib/R/library #> [2] /usr/local/lib/R/site-library #> [3] /usr/lib/R/site-library #> [4] /usr/lib/R/library ```So actually it won't make much difference in terms of time to calculate - you need a very large file (as above) to notice the difference (about 1.3 s). Compare with a 95.4 MiB file - difference not large (0.1 s).
Created on 2021-02-11 by the reprex package (v1.0.0)
Created on 2021-02-11 by the reprex package (v1.0.0)
So calculations will differ more between md5 and sha256 only when handling a bunch of (larger) files at once, or for a much larger file (which we currently don't use).
Which one to choose? Background information
Some background information comes from Wikipedia (especially here).
both are examples of cryptographic hash functions. The hash is not unique by design, so theoretically for a given hash it is possible - not necessarily feasible - to create different inputs that produce the same hash (= checksum): a hash collision.
for use in security applications (e.g. when hashing a password), it is important that it is infeasible to calculate a list of colliding inputs - hence to try guessing the possible input.
again for security purposes, SHA-2 (of which SHA-256 is one algorithm) has much better collision resistance (compared to MD5, but also to SHA-1), i.e. in terms of feasibility to find collisions. It is used where security is important, e.g.:
On the topic of data integrity:
See also https://en.wikipedia.org/wiki/File_verification, which describes the difference between integrity vs. authenticity verification.
Concluding thoughts
So, just for verifying file integrity in a trusted context (as ours) it does not actually matter.
Opinions do differ on this, e.g. in https://stackoverflow.com/q/14139727. It seems mainly a concern about: do you want it to be secure as well? The following rather states it well IMO: