inbo / n2khab

R package with preprocessing functions and standard reference data for Flemish Natura 2000 (N2K) habitat (HAB) analyses
https://inbo.github.io/n2khab
GNU General Public License v3.0
2 stars 1 forks source link

Checking file integrity: choosing between md5 and sha256 (or other options) #112

Open florisvdh opened 3 years ago

florisvdh commented 3 years ago

A bit elaboration of the file checksum topic as a reference for later - comments most welcome! Discussion originated in https://github.com/inbo/n2khab-preprocessing/pull/50 but is of broader relevance given the future n2khab intentions. Currently we still keep track of both checksums, e.g. with compute_filehashes.R at a7fafb8.

Experiments

First, a small experiment, repeated 3 times, on a 1.2 GiB file. Not shown is the first run, where md5 (because it is run as the first) took about 15 s, which is simply due to reading the file into memory for the first (and only) time - from my slow HDD that is.

suppressPackageStartupMessages(library(openssl))
library(n2khab)
soilmapdbf <- file.path(fileman_up("n2khab_data"),
                        "10_raw/soilmap/soilmap.dbf")
system.time(md5(file(soilmapdbf)))
#>    user  system elapsed 
#>   2.653   0.352   3.005
system.time(sha256(file(soilmapdbf)))
#>    user  system elapsed 
#>   4.029   0.300   4.330

Created on 2021-02-11 by the reprex package (v1.0.0)

suppressPackageStartupMessages(library(openssl))
library(n2khab)
soilmapdbf <- file.path(fileman_up("n2khab_data"),
                        "10_raw/soilmap/soilmap.dbf")
system.time(md5(file(soilmapdbf)))
#>    user  system elapsed 
#>   2.711   0.304   3.016
system.time(sha256(file(soilmapdbf)))
#>    user  system elapsed 
#>   4.104   0.304   4.410

Created on 2021-02-11 by the reprex package (v1.0.0)

suppressPackageStartupMessages(library(openssl))
library(n2khab)
soilmapdbf <- file.path(fileman_up("n2khab_data"),
                        "10_raw/soilmap/soilmap.dbf")
system.time(md5(file(soilmapdbf)))
#>    user  system elapsed 
#>   2.692   0.312   3.005
system.time(sha256(file(soilmapdbf)))
#>    user  system elapsed 
#>   4.027   0.312   4.341

Created on 2021-02-11 by the reprex package (v1.0.0)

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.0.3 (2020-10-10) #> os Linux Mint 20 #> system x86_64, linux-gnu #> ui X11 #> language nl_BE:nl #> collate nl_BE.UTF-8 #> ctype nl_BE.UTF-8 #> tz Europe/Brussels #> date 2021-02-11 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date lib source #> askpass 1.1 2019-01-13 [1] CRAN (R 4.0.2) #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2) #> class 7.3-18 2021-01-24 [4] CRAN (R 4.0.3) #> classInt 0.4-3 2020-04-07 [1] CRAN (R 4.0.2) #> cli 2.3.0 2021-01-31 [1] CRAN (R 4.0.3) #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.3) #> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.0.3) #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.3) #> dplyr 1.0.4 2021-02-02 [1] CRAN (R 4.0.3) #> e1071 1.7-4 2020-10-14 [1] CRAN (R 4.0.3) #> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.2) #> forcats 0.5.1 2021-01-27 [1] CRAN (R 4.0.3) #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2) #> generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.3) #> git2r 0.28.0 2021-01-10 [1] CRAN (R 4.0.3) #> git2rdata 0.3.1 2021-01-21 [1] CRAN (R 4.0.3) #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2) #> highr 0.8 2019-03-20 [1] CRAN (R 4.0.2) #> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.3) #> KernSmooth 2.23-18 2020-10-29 [4] CRAN (R 4.0.3) #> knitr 1.31 2021-01-27 [1] CRAN (R 4.0.3) #> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.2) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.3) #> n2khab * 0.3.1.9000 2021-02-02 [1] local #> openssl * 1.4.3 2020-09-18 [1] CRAN (R 4.0.2) #> pillar 1.4.7 2020-11-20 [1] CRAN (R 4.0.3) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2) #> plyr 1.8.6 2020-03-03 [1] CRAN (R 4.0.2) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2) #> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.3) #> Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.0.3) #> reprex 1.0.0 2021-01-27 [1] CRAN (R 4.0.3) #> rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.3) #> rmarkdown 2.6 2020-12-14 [1] CRAN (R 4.0.3) #> rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.3) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.3) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2) #> sf 0.9-7 2021-01-06 [1] CRAN (R 4.0.3) #> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2) #> tibble 3.0.6 2021-01-29 [1] CRAN (R 4.0.3) #> tidyr 1.1.2 2020-08-27 [1] CRAN (R 4.0.2) #> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.2) #> units 0.6-7 2020-06-13 [1] CRAN (R 4.0.2) #> vctrs 0.3.6 2020-12-17 [1] CRAN (R 4.0.3) #> withr 2.4.1 2021-01-26 [1] CRAN (R 4.0.3) #> xfun 0.20 2021-01-06 [1] CRAN (R 4.0.3) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2) #> #> [1] /home/floris/lib/R/library #> [2] /usr/local/lib/R/site-library #> [3] /usr/lib/R/site-library #> [4] /usr/lib/R/library ```

So actually it won't make much difference in terms of time to calculate - you need a very large file (as above) to notice the difference (about 1.3 s). Compare with a 95.4 MiB file - difference not large (0.1 s).

suppressPackageStartupMessages(library(openssl))
library(n2khab)
watercourse_100mseg_gpkg <- file.path(fileman_up("n2khab_data"),
                        "20_processed/watercourse_100mseg/watercourse_100mseg.gpkg")
system.time(md5(file(watercourse_100mseg_gpkg)))
#>    user  system elapsed 
#>   0.197   0.056   0.254
system.time(sha256(file(watercourse_100mseg_gpkg)))
#>    user  system elapsed 
#>   0.330   0.021   0.350

Created on 2021-02-11 by the reprex package (v1.0.0)

suppressPackageStartupMessages(library(openssl))
library(n2khab)
watercourse_100mseg_gpkg <- file.path(fileman_up("n2khab_data"),
                        "20_processed/watercourse_100mseg/watercourse_100mseg.gpkg")
system.time(md5(file(watercourse_100mseg_gpkg)))
#>    user  system elapsed 
#>   0.211   0.037   0.248
system.time(sha256(file(watercourse_100mseg_gpkg)))
#>    user  system elapsed 
#>   0.303   0.052   0.355

Created on 2021-02-11 by the reprex package (v1.0.0)

So calculations will differ more between md5 and sha256 only when handling a bunch of (larger) files at once, or for a much larger file (which we currently don't use).

Which one to choose? Background information

Some background information comes from Wikipedia (especially here).

Concluding thoughts

So, just for verifying file integrity in a trusted context (as ours) it does not actually matter.

Opinions do differ on this, e.g. in https://stackoverflow.com/q/14139727. It seems mainly a concern about: do you want it to be secure as well? The following rather states it well IMO:

So, if you are simply looking to check for file corruption or file differences, when the source of the file is trusted, MD5 should be sufficient. If you are looking to verify the integrity of a file coming from an untrusted source, or over from a trusted source over an unencrypted connection, MD5 is not sufficient.

florisvdh commented 3 years ago

@w-jan @hansvancalster @cecileherr @ToonHub any preference? Zenodo metadata provide md5, Git LFS uses sha256; we can compute and test both while one suffices. Most important difference is with respect to security (hash collision resistance), see above, while both do a perfect job with respect to file integrity verification. But verifying both checksums is absolute overkill, then you can as well restrict to sha256 alone, the most secure.

What we could do, is store both checksums in the future built-in checksums table of n2khab (as we already do in googlesheet for now), in order to have independent data ourselves if ever needed, but only use md5 in functions and (data-generating/checking) scripts because md5 is also the one used by Zenodo. That's my current suggestion.

hansvancalster commented 3 years ago

I vote for md5 as security is not an issue here - only file integrity - and difference in speed of the algorithms is also not important here.

cecileherr commented 3 years ago

I have not really a hard meaning about this question, I will follow the meaning of the majority

florisvdh commented 3 years ago

Continuation of @hansvancalster 's suggestion to have a look at xxHash. The specifics of xxHash fit better in this issue.

Discussion of some properties

Independent documentation and testing of xxHash appears to be limited. Most information can be found from the project itself at https://github.com/Cyan4973/xxHash.

xxHash provides a collection of non-cryptographic hash functions (XXH32, XXH64, and recently, XXH3 and its 128-bit variant XXH128). Compared to md5 and sha256 (cryptographic hash functions) this means the absence of introducing 'obscurity' in the hash with referral to revealing the original data, IIUC. For verifying file integrity, this doesn't matter.

What matters more, is the sensitivity to small changes, and uniqueness.

In R, XXH3 and XXH128 are not yet implemented in the digest package. BTW, they will never be in openssl: OpenSSL implements cryptographic hash functions.

Importantly, xxHash has far superior speed to md5 and sha256 (with XXH3/XXH128 even much faster, but not available in R for file hashing). See https://github.com/inbo/n2khab/pull/122#issuecomment-804125207 for timings in R.

Stability of the implementation in R

On the other hand openssl binds to the (external) libssl library, so it does not hardcode the algorithms itself, which is a more robust approach IMHO.

If we'd have an implementation of xxHash in R, analogous to openssl, that actually binds to the libxxhash library rather than hardcoding an older version of xxHash, I would hesitate less, with a preference for the XXH128 algorithm (currently not in digest).

Concluding remark

So in the end, more confusion :thinking:?

I think we should at least support the usage of more recent and faster file hashing algorithms, if not use them by default. And take into account that our preferred choice in n2khab may evolve, i.e. not hardcoding that choice in more than one place.

florisvdh commented 3 years ago

Update after a small experiment with XXH64 and XXH128, using a large file of 2.7 GB, to test how sensitive the hash function is to the smallest possible change (one bit) in a large file. Seems very promising: once file is loaded from disk, xxh64 and xxh128 calculations not only take approximately just a second (or less); the hash value is completely different after flipping one bit (tried three times, with different bits). Also, it is demonstrated how flipping back restores original hash values.

The bitflip utility is the executable python script from https://unix.stackexchange.com/a/196263 and its numeric argument is the number of the selected bit.

flipping bits of large file and effect on xxHash value ```bash $ xxhsum --version xxhsum 0.7.3 (64-bit x86_64 + SSE2 little endian), GCC 9.3.0, by Yann Collet $ $ ls -lh hirsute-desktop-amd64.iso -rw-rw---- 1 floris floris 2,7G mrt 29 16:52 hirsute-desktop-amd64.iso $ $ xxh64sum hirsute-desktop-amd64.iso # original checksum d4252122926c351a hirsute-desktop-amd64.iso $ xxh128sum hirsute-desktop-amd64.iso d8a90b17b19941634efd03e4633a31bb hirsute-desktop-amd64.iso $ $ bitflip hirsute-desktop-amd64.iso 73513453 # flip bit $ $ xxh64sum hirsute-desktop-amd64.iso b934ff48481a4fe9 hirsute-desktop-amd64.iso $ xxh128sum hirsute-desktop-amd64.iso dabffcb9b6ef8d9665d035af0af7d179 hirsute-desktop-amd64.iso $ $ bitflip hirsute-desktop-amd64.iso 73513453 # restore bit $ $ xxh64sum hirsute-desktop-amd64.iso d4252122926c351a hirsute-desktop-amd64.iso $ xxh128sum hirsute-desktop-amd64.iso d8a90b17b19941634efd03e4633a31bb hirsute-desktop-amd64.iso $ $ bitflip hirsute-desktop-amd64.iso 12345654 # flip bit $ $ xxh64sum hirsute-desktop-amd64.iso b594dc513f3c7321 hirsute-desktop-amd64.iso $ xxh128sum hirsute-desktop-amd64.iso 541819a95d06d4b5ffa8fca70d8e9971 hirsute-desktop-amd64.iso $ $ bitflip hirsute-desktop-amd64.iso 12345654 # restore bit $ $ xxh64sum hirsute-desktop-amd64.iso d4252122926c351a hirsute-desktop-amd64.iso $ xxh128sum hirsute-desktop-amd64.iso d8a90b17b19941634efd03e4633a31bb hirsute-desktop-amd64.iso $ $ bitflip hirsute-desktop-amd64.iso 147852369 # flip bit $ $ xxh64sum hirsute-desktop-amd64.iso 79f15e384d23459e hirsute-desktop-amd64.iso $ xxh128sum hirsute-desktop-amd64.iso d7692da02a17da15f8cb4779b8b68f8e hirsute-desktop-amd64.iso $ $ bitflip hirsute-desktop-amd64.iso 147852369 # restore bit $ $ xxh64sum hirsute-desktop-amd64.iso d4252122926c351a hirsute-desktop-amd64.iso $ xxh128sum hirsute-desktop-amd64.iso d8a90b17b19941634efd03e4633a31bb hirsute-desktop-amd64.iso ```