hibari / hibari-brick-rs

A fast, embedded, ordered key-value store for big and small values
Other
10 stars 1 forks source link

Replace MD5 checksum on blob with a safe hasher #9

Closed tatsuya6502 closed 7 years ago

tatsuya6502 commented 7 years ago

In order to verify data integrity in blobs (large binary values), we currently use MD5 digest. MD5 algorithm was chosen because:

  1. Historical reason: Hibari's original storage layer (gdss-brick), which was developed between 2004 and 2010, employed MD5 for this purpose.
  2. MD5 is fast to calculate and small (128 bit).
  3. Calculated MD5 could be also used as Amazon S3 style Etag.

However, MD5 is no longer recommended as secure hash because hash collisions can be easily created today (example).

Replace MD5 with a safe hasher, such as SHA-2, SHA-3 or Blake2. Not decided yet but maybe we will prefer 256 bit digest over 512 bit digest to save storage spaces (so SHA-256, SHA-3-256, or Blake2s).

References:

tatsuya6502 commented 7 years ago

Not decided yet but maybe we will prefer 256 bit digest over 512 bit digest to save storage spaces (so SHA-256, SHA-3-256, or Blake2s).

I am not a security pro; so I will just assume these hash functions provide similar level of cryptographic strength. If this assumption is correct, then I would choose the fastest hash function. I wrote a simple benchmark program in Rust and ran it on hibari-brick-rs's main target platform, a modern x86_64 processor running 64-bit Linux. I found Blake2s is the fastest.

Input data length: 8.00 GB
SHA-256   - 38.16 seconds (38162777286 nano-seconds), digest: "69de2109a91cd2dccf6d1ec447fa3975c0c44ab32459e178e31569bff09ce9d1"
SHA-3-256 - 74.51 seconds (74507281950 nano-seconds), digest: "8e41cb790bc5c80b3fe339f17d1b1c8871c0afe888a0e2692efbdc7e55656203"
Blake2s   - 18.57 seconds (18571153963 nano-seconds), digest: "ccae52d11f9e49ba565635be5e88533444392f869f9d0fc1b66d76bc85b33042"
SHA-256: 1.00x, SHA-3-256: 0.51x, Blake2s: 2.05x

For SHA-256 implementation, I also tried its asm feature to enable hand-written assembly implementation. However, in my environment, the assembly implementation always ran slightly slower than Rust implementation.

% cargo run --release
   Compiling gcc v0.3.46
   Compiling sha2-asm v0.2.1
   Compiling sha2 v0.5.2
   Compiling hashes v0.1.0 (file:///home/tatsuya/.../hashes)
    Finished release [optimized] target(s) in 4.72 secs
     Running `target/release/hashes`
Input data length: 8.00 GB
SHA-256   - 39.89 seconds (39887634463 nano-seconds), digest: "69de2109a91cd2dccf6d1ec447fa3975c0c44ab32459e178e31569bff09ce9d1"

Blake2 algorithm is optimized for software implementation on modern processors. For example, the above Rust implementation utilizes SIMD instructions. I think that is why it was the fastest.

In contrast, it is said that SHA-3 (Keccak) algorithm is good for hardware implementation, such as ASIC and FPGA.

It would be interesting to make the hash function pluggable in this project (hibari-brich-rs). But for now, I will just pick Blake2s for the hash function for storage checksum.

Crates Used

They are written by the same author.

Environment

/home/tatsuya% uname -a
Linux mini-arch 4.10.13-1-ARCH #1 SMP PREEMPT Thu Apr 27 12:15:09 CEST 2017 x86_64 GNU/Linux
/home/tatsuya% nproc 
8
/home/tatsuya% cat /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 58
model name  : Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz
stepping    : 9
microcode   : 0x15
cpu MHz     : 3249.523
cache size  : 6144 KB
physical id : 0
siblings    : 8
core id     : 0
cpu cores   : 4
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts
bugs        :
bogomips    : 5190.21
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:
...
tatsuya6502 commented 7 years ago

I was just curious how fast Blake2s is compared to MD5. I added MD5 to the benchmark and got a competing result. Note that the hand-written assembly implementation of MD5 runs ~5% faster than Rust implementation of MD5.

MD5 Rust Implementation (md-5 crate)

% cargo run --release
    Updating registry `https://github.com/rust-lang/crates.io-index`
   Compiling md-5 v0.4.3
   Compiling hashes v0.1.0 (file:///home/tatsuya/.../hashes)
    Finished release [optimized] target(s) in 0.48 secs
     Running `target/release/hashes`
Input data length: 8.00 GB
MD5       - 17.96 seconds (17963788000 nano-seconds), digest: "705018a32b2a413ec7fd8dedc5dd101a"
SHA-256   - 39.00 seconds (38995530431 nano-seconds), digest: "69de2109a91cd2dccf6d1ec447fa3975c0c44ab32459e178e31569bff09ce9d1"
SHA-3-256 - 75.52 seconds (75517671669 nano-seconds), digest: "8e41cb790bc5c80b3fe339f17d1b1c8871c0afe888a0e2692efbdc7e55656203"
Blake2s   - 19.80 seconds (19796446068 nano-seconds), digest: "ccae52d11f9e49ba565635be5e88533444392f869f9d0fc1b66d76bc85b33042"
MD5: 2.17x, SHA-256: 1.00x, SHA-3-256: 0.52x, Blake2s: 1.97x

MD5 Hand-written Assembly Implementation (md-5 crate with md-5-asm crate)

% cargo run --release
 Downloading md5-asm v0.2.1
   Compiling md5-asm v0.2.1
   Compiling md-5 v0.4.3
   Compiling hashes v0.1.0 (file:///home/tatsuya/.../hashes)
    Finished release [optimized] target(s) in 1.20 secs
     Running `target/release/hashes`
Input data length: 8.00 GB
MD5       - 16.85 seconds (16850785623 nano-seconds), digest: "705018a32b2a413ec7fd8dedc5dd101a"
SHA-256   - 38.40 seconds (38404376383 nano-seconds), digest: "69de2109a91cd2dccf6d1ec447fa3975c0c44ab32459e178e31569bff09ce9d1"
SHA-3-256 - 75.77 seconds (75772153318 nano-seconds), digest: "8e41cb790bc5c80b3fe339f17d1b1c8871c0afe888a0e2692efbdc7e55656203"
Blake2s   - 18.64 seconds (18638122420 nano-seconds), digest: "ccae52d11f9e49ba565635be5e88533444392f869f9d0fc1b66d76bc85b33042"
MD5: 2.28x, SHA-256: 1.00x, SHA-3-256: 0.51x, Blake2s: 2.06x
tatsuya6502 commented 7 years ago

One more thing.

In addition to Blake2s, I tried Blake2b algorithm. Blake2b is optimized for 64-bit processors while Blake2s is optimized for 32-bit processors.

I tried two different Rust implementations, one from blake2 crate, and another from blake2-rfc crate. Blake2b specification supports output digest length from 1 byte to 64 bytes (512 bits). The one in blake2 crate supports only 64-byte digest. The one in blake2-rfc crate supports other digest lengths.

Here is the result.

Generating a random input data (length: 8.00 GB)
Generated the random input data in 55.60 seconds (55603870343 nano-seconds)

MD5 (asm)      - 16.69 seconds (16692586716 nano-seconds), digest: "705018a32b2a413ec7fd8dedc5dd101a"
SHA-256 (asm)  - 38.67 seconds (38669932913 nano-seconds), digest: "69de2109a91cd2dccf6d1ec447fa3975c0c44ab32459e178e31569bff09ce9d1"
SHA-3-256      - 74.75 seconds (74753649469 nano-seconds), digest: "8e41cb790bc5c80b3fe339f17d1b1c8871c0afe888a0e2692efbdc7e55656203"
Blake2s        - 18.61 seconds (18611652197 nano-seconds), digest: "ccae52d11f9e49ba565635be5e88533444392f869f9d0fc1b66d76bc85b33042"
Blake2b (256): - 11.71 seconds (11712793972 nano-seconds), digest: "7ce3ae3057db72f0061f6708917fe59dc3a78fc93621672371a4bf24aa491a34"
SHA-512 (asm)  - 23.83 seconds (23832185958 nano-seconds), digest: "f6c6b185590487c438e2edfc86e64b32839907514c2f05a70c1bb5715859003c3bd27d2f6fc3de6f2628458e2a99f4740f53dc212674c656ac8da450629e099c"
SHA-3-512      - 133.40 seconds (133402463870 nano-seconds), digest: "d5a730ca9d5a0081974c9dab0c86b8852b0ba8e056a645d10d1daded84eae35d2ad9e59369300d34400a55d30aa3bf5e3c1ba5d4aabbd2c4cdf679707f1cca27"
Blake2b        - 11.72 seconds (11721107983 nano-seconds), digest: "2fc585a49cc8e54f335498b13198768bb4c2cd26e55c0b72d2fb48cde49fb4847e20f0e1184b49492275abe643849f29c0eb3140cefd0ded2e3428df86492dec"

Speed ups:
  MD5 (asm):     2.32x
  SHA-256 (asm): 1.00x
  SHA-3-256:     0.52x
  Blake2s (256)  2.08x
  Blake2b (256): 3.30x
  SHA-512 (asm): 1.62x
  SHA-3-512:     0.29x
  Blake2b (512): 3.30x

Blake2b from both crates outperformed MD5. The digest length seems to have no impact to performance; they both 3.3 times faster than the assembly implementation of SHA-256.

Conclusion

We will use Blake2b algorithm for the best performance on 64-bit processors, with 32 bytes (256 bits) digest for space efficiency.