Checking file integrity: choosing between md5 and sha256 (or other options) - Githubissues

inbo / n2khab

R package with preprocessing functions and standard reference data for Flemish Natura 2000 (N2K) habitat (HAB) analyses

https://inbo.github.io/n2khab

GNU General Public License v3.0

2 stars 1 forks source link

Checking file integrity: choosing between md5 and sha256 (or other options) #112

Open florisvdh opened 3 years ago

florisvdh commented 3 years ago

A bit elaboration of the file checksum topic as a reference for later - comments most welcome! Discussion originated in https://github.com/inbo/n2khab-preprocessing/pull/50 but is of broader relevance given the future n2khab intentions. Currently we still keep track of both checksums, e.g. with compute_filehashes.R at a7fafb8.

Experiments

First, a small experiment, repeated 3 times, on a 1.2 GiB file. Not shown is the first run, where md5 (because it is run as the first) took about 15 s, which is simply due to reading the file into memory for the first (and only) time - from my slow HDD that is.

suppressPackageStartupMessages(library(openssl))
library(n2khab)
soilmapdbf <- file.path(fileman_up("n2khab_data"),
                        "10_raw/soilmap/soilmap.dbf")
system.time(md5(file(soilmapdbf)))
#>    user  system elapsed 
#>   2.653   0.352   3.005
system.time(sha256(file(soilmapdbf)))
#>    user  system elapsed 
#>   4.029   0.300   4.330

^{Created on 2021-02-11 by the reprex package (v1.0.0)}

suppressPackageStartupMessages(library(openssl))
library(n2khab)
soilmapdbf <- file.path(fileman_up("n2khab_data"),
                        "10_raw/soilmap/soilmap.dbf")
system.time(md5(file(soilmapdbf)))
#>    user  system elapsed 
#>   2.711   0.304   3.016
system.time(sha256(file(soilmapdbf)))
#>    user  system elapsed 
#>   4.104   0.304   4.410

^{Created on 2021-02-11 by the reprex package (v1.0.0)}

suppressPackageStartupMessages(library(openssl))
library(n2khab)
soilmapdbf <- file.path(fileman_up("n2khab_data"),
                        "10_raw/soilmap/soilmap.dbf")
system.time(md5(file(soilmapdbf)))
#>    user  system elapsed 
#>   2.692   0.312   3.005
system.time(sha256(file(soilmapdbf)))
#>    user  system elapsed 
#>   4.027   0.312   4.341

^{Created on 2021-02-11 by the reprex package (v1.0.0)}

Session info

``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.0.3 (2020-10-10) #> os Linux Mint 20 #> system x86_64, linux-gnu #> ui X11 #> language nl_BE:nl #> collate nl_BE.UTF-8 #> ctype nl_BE.UTF-8 #> tz Europe/Brussels #> date 2021-02-11 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date lib source #> askpass 1.1 2019-01-13 [1] CRAN (R 4.0.2) #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2) #> class 7.3-18 2021-01-24 [4] CRAN (R 4.0.3) #> classInt 0.4-3 2020-04-07 [1] CRAN (R 4.0.2) #> cli 2.3.0 2021-01-31 [1] CRAN (R 4.0.3) #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.3) #> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.0.3) #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.3) #> dplyr 1.0.4 2021-02-02 [1] CRAN (R 4.0.3) #> e1071 1.7-4 2020-10-14 [1] CRAN (R 4.0.3) #> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.2) #> forcats 0.5.1 2021-01-27 [1] CRAN (R 4.0.3) #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2) #> generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.3) #> git2r 0.28.0 2021-01-10 [1] CRAN (R 4.0.3) #> git2rdata 0.3.1 2021-01-21 [1] CRAN (R 4.0.3) #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2) #> highr 0.8 2019-03-20 [1] CRAN (R 4.0.2) #> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.3) #> KernSmooth 2.23-18 2020-10-29 [4] CRAN (R 4.0.3) #> knitr 1.31 2021-01-27 [1] CRAN (R 4.0.3) #> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.2) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.3) #> n2khab * 0.3.1.9000 2021-02-02 [1] local #> openssl * 1.4.3 2020-09-18 [1] CRAN (R 4.0.2) #> pillar 1.4.7 2020-11-20 [1] CRAN (R 4.0.3) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2) #> plyr 1.8.6 2020-03-03 [1] CRAN (R 4.0.2) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2) #> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.3) #> Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.0.3) #> reprex 1.0.0 2021-01-27 [1] CRAN (R 4.0.3) #> rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.3) #> rmarkdown 2.6 2020-12-14 [1] CRAN (R 4.0.3) #> rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.3) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.3) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2) #> sf 0.9-7 2021-01-06 [1] CRAN (R 4.0.3) #> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2) #> tibble 3.0.6 2021-01-29 [1] CRAN (R 4.0.3) #> tidyr 1.1.2 2020-08-27 [1] CRAN (R 4.0.2) #> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.2) #> units 0.6-7 2020-06-13 [1] CRAN (R 4.0.2) #> vctrs 0.3.6 2020-12-17 [1] CRAN (R 4.0.3) #> withr 2.4.1 2021-01-26 [1] CRAN (R 4.0.3) #> xfun 0.20 2021-01-06 [1] CRAN (R 4.0.3) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2) #> #> [1] /home/floris/lib/R/library #> [2] /usr/local/lib/R/site-library #> [3] /usr/lib/R/site-library #> [4] /usr/lib/R/library ```

So actually it won't make much difference in terms of time to calculate - you need a very large file (as above) to notice the difference (about 1.3 s). Compare with a 95.4 MiB file - difference not large (0.1 s).

suppressPackageStartupMessages(library(openssl))
library(n2khab)
watercourse_100mseg_gpkg <- file.path(fileman_up("n2khab_data"),
                        "20_processed/watercourse_100mseg/watercourse_100mseg.gpkg")
system.time(md5(file(watercourse_100mseg_gpkg)))
#>    user  system elapsed 
#>   0.197   0.056   0.254
system.time(sha256(file(watercourse_100mseg_gpkg)))
#>    user  system elapsed 
#>   0.330   0.021   0.350

^{Created on 2021-02-11 by the reprex package (v1.0.0)}

suppressPackageStartupMessages(library(openssl))
library(n2khab)
watercourse_100mseg_gpkg <- file.path(fileman_up("n2khab_data"),
                        "20_processed/watercourse_100mseg/watercourse_100mseg.gpkg")
system.time(md5(file(watercourse_100mseg_gpkg)))
#>    user  system elapsed 
#>   0.211   0.037   0.248
system.time(sha256(file(watercourse_100mseg_gpkg)))
#>    user  system elapsed 
#>   0.303   0.052   0.355

^{Created on 2021-02-11 by the reprex package (v1.0.0)}

So calculations will differ more between md5 and sha256 only when handling a bunch of (larger) files at once, or for a much larger file (which we currently don't use).

Which one to choose? Background information

Some background information comes from Wikipedia (especially here).

both are examples of cryptographic hash functions. The hash is not unique by design, so theoretically for a given hash it is possible - not necessarily feasible - to create different inputs that produce the same hash (= checksum): a hash collision.

In computer science, a collision or clash is a situation that occurs when two distinct pieces of data have the same hash value, checksum, fingerprint, or cryptographic digest.
The ideal cryptographic hash function has the following main properties:
- it is deterministic, meaning that the same message always results in the same hash
- it is quick to compute the hash value for any given message
- it is infeasible to generate a message that yields a given hash value (i.e. to reverse the process that generated the given hash value)
- it is infeasible to find two different messages with the same hash value
- a small change to a message should change the hash value so extensively that the new hash value appears uncorrelated with the old hash value (avalanche effect)
for use in security applications (e.g. when hashing a password), it is important that it is infeasible to calculate a list of colliding inputs - hence to try guessing the possible input.
- Collisions against MD5 can be calculated within seconds which makes the algorithm unsuitable for most use cases where a cryptographic hash is required.
- MD5 can still be used as a checksum to verify data integrity, but only against unintentional corruption.
again for security purposes, SHA-2 (of which SHA-256 is one algorithm) has much better collision resistance (compared to MD5, but also to SHA-1), i.e. in terms of feasibility to find collisions. It is used where security is important, e.g.:
- SHA-256 partakes in the process of authenticating Debian software packages
- Microsoft announced that Internet Explorer and Edge would stop honoring public SHA-1-signed TLS certificates from February 2017.
On the topic of data integrity:

An important application of secure hashes is the verification of message integrity. Comparing message digests (hash digests over the message) calculated before, and after, transmission can determine whether any changes have been made to the message or file.

MD5, SHA-1, or SHA-2 hash digests are sometimes published on websites or forums to allow verification of integrity for downloaded files, including files retrieved using file sharing such as mirroring. This practice establishes a chain of trust as long as the hashes are posted on a trusted site - usually the originating site - authenticated by HTTPS. Using a cryptographic hash and a chain of trust detects malicious changes to the file.

See also https://en.wikipedia.org/wiki/File_verification, which describes the difference between integrity vs. authenticity verification.

Concluding thoughts

So, just for verifying file integrity in a trusted context (as ours) it does not actually matter.

I'm slightly inclined (but maybe tomorrow I think otherwise) to neglect malicious attacks and lean on the fact that MD5 is well fit to verify a file's integrity from trusted sources. It is the one preferred by many websites, including Zenodo, and we can even retrieve the md5sums from the Zenodo metadata in R. The fact it is also a little bit faster is a more negligible detail IMO.
But I have no strong opinion on this. If some of you would prefer SHA-256 in order to be on the 'safe' side in terms of security, please let know and we can go for it. Malicious Zenodo-hackers, be warned :wink:.

Opinions do differ on this, e.g. in https://stackoverflow.com/q/14139727. It seems mainly a concern about: do you want it to be secure as well? The following rather states it well IMO:

So, if you are simply looking to check for file corruption or file differences, when the source of the file is trusted, MD5 should be sufficient. If you are looking to verify the integrity of a file coming from an untrusted source, or over from a trusted source over an unencrypted connection, MD5 is not sufficient.

florisvdh commented 3 years ago

@w-jan @hansvancalster @cecileherr @ToonHub any preference? Zenodo metadata provide md5, Git LFS uses sha256; we can compute and test both while one suffices. Most important difference is with respect to security (hash collision resistance), see above, while both do a perfect job with respect to file integrity verification. But verifying both checksums is absolute overkill, then you can as well restrict to sha256 alone, the most secure.

What we could do, is store both checksums in the future built-in checksums table of n2khab (as we already do in googlesheet for now), in order to have independent data ourselves if ever needed, but only use md5 in functions and (data-generating/checking) scripts because md5 is also the one used by Zenodo. That's my current suggestion.

hansvancalster commented 3 years ago

I vote for md5 as security is not an issue here - only file integrity - and difference in speed of the algorithms is also not important here.

cecileherr commented 3 years ago

I have not really a hard meaning about this question, I will follow the meaning of the majority

florisvdh commented 3 years ago

Continuation of @hansvancalster 's suggestion to have a look at xxHash. The specifics of xxHash fit better in this issue.

Discussion of some properties

Independent documentation and testing of xxHash appears to be limited. Most information can be found from the project itself at https://github.com/Cyan4973/xxHash.

xxHash provides a collection of non-cryptographic hash functions (XXH32, XXH64, and recently, XXH3 and its 128-bit variant XXH128). Compared to md5 and sha256 (cryptographic hash functions) this means the absence of introducing 'obscurity' in the hash with referral to revealing the original data, IIUC. For verifying file integrity, this doesn't matter.

What matters more, is the sensitivity to small changes, and uniqueness.

From what I found it appears to be (nearly) as sensitive to tiny changes as md5 does, however I cannot be conclusive about this since this tests at xxHash are for very small data (sub-kilobyte level). Beside that, xxHash can also testify that it withstands an independent test (SMHasher) meant for non-cryptographic hash algorithms.
Uniqueness is (at least) reflected by hash length. md5 returns a 128-bit hash (32 hexadecimals), which is twice the length of the XXH64 and XXH3 hashes (64 bits - 16 hexadecimals), hence among (a LOT of) random datasets I expect on average md5 results in 2⁶⁴ times less collisions (and sha256 adds another factor of 2¹²⁸ since it yields 256-bit hashes). To obtain the same hash length with xxHash as with md5, one should turn to XXH128.

In R, XXH3 and XXH128 are not yet implemented in the digest package. BTW, they will never be in openssl: OpenSSL implements cryptographic hash functions.

Importantly, xxHash has far superior speed to md5 and sha256 (with XXH3/XXH128 even much faster, but not available in R for file hashing). See https://github.com/inbo/n2khab/pull/122#issuecomment-804125207 for timings in R.

Stability of the implementation in R

xxHash is young (more so in R), not used as widely as md5 (on websites as Zenodo) and sha256 (in git LFS), and this especially holds for XXH128 which appeared only recently.
xxHash implementation in R is hardcoded (not through binding with libxxhash), either in digest (last update: 3 years ago: xxHash 0.6.5, providing XXH32 & XXH64) or in the very recent xxhashlite, which is on par with xxHash (0.8.0). The latter has the XXH128 bits algorithm, but is only implemented for R objects, not for files, and to me it's unclear how long this will remain maintained.

On the other hand openssl binds to the (external) libssl library, so it does not hardcode the algorithms itself, which is a more robust approach IMHO.

If we'd have an implementation of xxHash in R, analogous to openssl, that actually binds to the libxxhash library rather than hardcoding an older version of xxHash, I would hesitate less, with a preference for the XXH128 algorithm (currently not in digest).

Concluding remark

So in the end, more confusion :thinking:?

I think we should at least support the usage of more recent and faster file hashing algorithms, if not use them by default. And take into account that our preferred choice in n2khab may evolve, i.e. not hardcoding that choice in more than one place.

florisvdh commented 3 years ago

Update after a small experiment with XXH64 and XXH128, using a large file of 2.7 GB, to test how sensitive the hash function is to the smallest possible change (one bit) in a large file. Seems very promising: once file is loaded from disk, xxh64 and xxh128 calculations not only take approximately just a second (or less); the hash value is completely different after flipping one bit (tried three times, with different bits). Also, it is demonstrated how flipping back restores original hash values.

The bitflip utility is the executable python script from https://unix.stackexchange.com/a/196263 and its numeric argument is the number of the selected bit.

flipping bits of large file and effect on xxHash value

```bash $ xxhsum --version xxhsum 0.7.3 (64-bit x86_64 + SSE2 little endian), GCC 9.3.0, by Yann Collet $ $ ls -lh hirsute-desktop-amd64.iso -rw-rw---- 1 floris floris 2,7G mrt 29 16:52 hirsute-desktop-amd64.iso $ $ xxh64sum hirsute-desktop-amd64.iso # original checksum d4252122926c351a hirsute-desktop-amd64.iso $ xxh128sum hirsute-desktop-amd64.iso d8a90b17b19941634efd03e4633a31bb hirsute-desktop-amd64.iso $ $ bitflip hirsute-desktop-amd64.iso 73513453 # flip bit $ $ xxh64sum hirsute-desktop-amd64.iso b934ff48481a4fe9 hirsute-desktop-amd64.iso $ xxh128sum hirsute-desktop-amd64.iso dabffcb9b6ef8d9665d035af0af7d179 hirsute-desktop-amd64.iso $ $ bitflip hirsute-desktop-amd64.iso 73513453 # restore bit $ $ xxh64sum hirsute-desktop-amd64.iso d4252122926c351a hirsute-desktop-amd64.iso $ xxh128sum hirsute-desktop-amd64.iso d8a90b17b19941634efd03e4633a31bb hirsute-desktop-amd64.iso $ $ bitflip hirsute-desktop-amd64.iso 12345654 # flip bit $ $ xxh64sum hirsute-desktop-amd64.iso b594dc513f3c7321 hirsute-desktop-amd64.iso $ xxh128sum hirsute-desktop-amd64.iso 541819a95d06d4b5ffa8fca70d8e9971 hirsute-desktop-amd64.iso $ $ bitflip hirsute-desktop-amd64.iso 12345654 # restore bit $ $ xxh64sum hirsute-desktop-amd64.iso d4252122926c351a hirsute-desktop-amd64.iso $ xxh128sum hirsute-desktop-amd64.iso d8a90b17b19941634efd03e4633a31bb hirsute-desktop-amd64.iso $ $ bitflip hirsute-desktop-amd64.iso 147852369 # flip bit $ $ xxh64sum hirsute-desktop-amd64.iso 79f15e384d23459e hirsute-desktop-amd64.iso $ xxh128sum hirsute-desktop-amd64.iso d7692da02a17da15f8cb4779b8b68f8e hirsute-desktop-amd64.iso $ $ bitflip hirsute-desktop-amd64.iso 147852369 # restore bit $ $ xxh64sum hirsute-desktop-amd64.iso d4252122926c351a hirsute-desktop-amd64.iso $ xxh128sum hirsute-desktop-amd64.iso d8a90b17b19941634efd03e4633a31bb hirsute-desktop-amd64.iso ```