sahib / rmlint

Extremely fast tool to remove duplicates and other lint from your filesystem
http://rmlint.rtfd.org
GNU General Public License v3.0
1.91k stars 132 forks source link

should add the blake3 hash algorithm #424

Closed divinity76 closed 3 years ago

divinity76 commented 4 years ago

Blake3 is the latest iteration of the blake/blake2 hash, and it's *much* faster than the competition (except KangarooTwelve at least): image

i think it will be a good alternative for people who want something faster than blake2b but still cryptographically secure (of which metro/murmur/xxhash is not)

blake3's homepage: https://github.com/BLAKE3-team/BLAKE3/

it's a cryptographically-secure hash designed for speed, some highlights:

sahib commented 4 years ago

Sure, more hash options are always good. I would welcome a PR for this.

jernkuan commented 4 years ago

Initial test with an integrated blake3 is slower than a blake2b running on a 2015 mbp(SSE41 instructions) with a 4GB file. Will have to dig abit more on blake2b and blake3 as a standalone library to test.

divinity76 commented 4 years ago

@jernkuan well that was unexpected, the official repo has 3 different implementations, an optimized rust implementation, an optimized C implementation, and an unoptimized-but-easy-to-read "reference" rust implementation, any chance you're rolling the unoptimized version? also what do you get from

rm -f 4gtest;
truncate -s 4G 4gtest;
time b2sum 4gtest;
time b2sum 4gtest;
time b3sum 4gtest;
time b3sum 4gtest;

? on an ancient-ish 2x Xeon L5520 CPUs from 2009 with SSE4.2 (whoa, didn't know SSE4.2 was that old!) i get:

root@honningas:~/test# rm -f 4gtest;
root@honningas:~/test# truncate -s 4G 4gtest;
root@honningas:~/test# time b2sum 4gtest;
645572ca5756f9104329ed543735fc11904f0c18c4df8adf930f22d07f3094919a519ff34fd240ae3f5d5b4c8042225c109fb951036fdc99e7d2cd0c1d36b267  4gtest

real    0m11.721s
user    0m9.498s
sys     0m2.215s
root@honningas:~/test# time b2sum 4gtest;
645572ca5756f9104329ed543735fc11904f0c18c4df8adf930f22d07f3094919a519ff34fd240ae3f5d5b4c8042225c109fb951036fdc99e7d2cd0c1d36b267  4gtest

real    0m10.405s
user    0m9.502s
sys     0m0.900s
root@honningas:~/test# time b3sum 4gtest;
7dde7c9fed144013fedbe2b0bbf2d82f004b60b589485851cdec29b27be408d7  4gtest

real    0m0.635s
user    0m4.237s
sys     0m0.220s
root@honningas:~/test# time b3sum 4gtest;
7dde7c9fed144013fedbe2b0bbf2d82f004b60b589485851cdec29b27be408d7  4gtest

real    0m0.621s
user    0m4.217s
sys     0m0.216s

(could also point out that there's a huge difference between 0.6 seconds and 10.4 seconds, but it's not exactly fair because blake2b used only 1 core, but blake3 used all 8 cores, i can tell blake3 to only use 1 core for comparison if wanted)

on a modern-ish i7-8565U laptop CPU from 2018 with AVX2+SSE4.2 (but no AVX512) i get:

$ rm -f 4gtest;
$ truncate -s 4G 4gtest;
$ time b2sum 4gtest;
645572ca5756f9104329ed543735fc11904f0c18c4df8adf930f22d07f3094919a519ff34fd240ae3f5d5b4c8042225c109fb951036fdc99e7d2cd0c1d36b267 *4gtest

real    0m6.283s
user    0m5.624s
sys     0m0.625s
$ time b2sum 4gtest;
645572ca5756f9104329ed543735fc11904f0c18c4df8adf930f22d07f3094919a519ff34fd240ae3f5d5b4c8042225c109fb951036fdc99e7d2cd0c1d36b267 *4gtest

real    0m6.217s
user    0m5.531s
sys     0m0.671s
$ time b3sum 4gtest;
7dde7c9fed144013fedbe2b0bbf2d82f004b60b589485851cdec29b27be408d7  4gtest

real    0m1.925s
user    0m0.000s
sys     0m0.016s
$ time b3sum 4gtest;
7dde7c9fed144013fedbe2b0bbf2d82f004b60b589485851cdec29b27be408d7  4gtest

real    0m2.036s
user    0m0.000s
sys     0m0.016s
jernkuan commented 4 years ago

@divinity76
I finally got around testing the actual blake2b and blake3 code standalone, it was much faster on SSE4.1, but not so much without SSE4.1.

Digging a bit deeper, the blake3 algo when compiled(SSE41) together within rmlint is running faster compared to blake2b. However when integrated into rmlint, there seems to be some synchronization that causing it to be slower. Still digging into rmlint code to understand better what's happening.

The integration code is actually quite straightforward(assuming i did things correctly :) ), i can push to a fork if you want to dig in as well.

SeeSpotRun commented 3 years ago

Have done a trial implementation at https://github.com/SeeSpotRun/rmlint/tree/blake3

Comparing with the 256-bit BLAKE2 algorithms:

Quick benchmarks for 16 * truncate -s 1G files:

$ /usr/bin/time rmlint 1gtest* -o summary -a <algo>

BLAKE2S: 109.42user 8.90system 0:44.73elapsed 264%CPU (0avgtext+0avgdata 88044maxresident)k BLAKE2SP: 115.38user 3.88system 0:46.57elapsed 256%CPU (0avgtext+0avgdata 92464maxresident)k BLAKE3: 6.66user 4.15system 0:02.79elapsed 387%CPU (0avgtext+0avgdata 139644maxresident)k

so more than 10 times quicker in pure hashing speed.

Quick benchmarks for real-world I/O-bound (SSD) application:

$ sync && echo 3 | sudo tee /proc/sys/vm/drop_caches 
$ /usr/bin/time rmlint /usr -o summary -a <algo>
Average of 20 runs: ALGO Elapsed CPU% Max Res
BLAKE2S 27.2 140 397k
BLAKE2SP 26.1 174 851k
BLAKE3 26.6 71 895k

So in real-world I think it's fair to say that execution time is I/O bound (if the files aren't cached), but blake3 has a significantly lower cpu load.

I'm reluctant to switch the default hash to blake3 because it's only half the length of the current default blake2b. For comparison: ALGO Elapsed CPU% Max Res
BLAKE3 26.6 71 895k
BLAKE2B 26.4 119 446k

All of the parallelized variants seem pretty memory-hungry.

Also worth noting that we are currently using the 'reference' BLAKE2 implementations, not the SSE-optimised ones.

Edit: above benches on AMD Ryzen 5 5600X 6-Core Processor

SeeSpotRun commented 3 years ago

there seems to be some synchronization that causing it to be slower

I suspect the overhead for md-scheduler. This give significant gains when reading from rotational disks and/or files from multiple disks, but in simple cases it adds some overhead.

divinity76 commented 3 years ago

I'm reluctant to switch the default hash to blake3 because it's only half the length of the current default blake2b.

first off, blake3 is variable-length, and they are all supposed to be cryptographically secure, so the concern should be the birthday paradox, if we assume that someone is running rmlint on a maxed-out 64bit inode filesystem (like btrfs), here's the chance of a birthday collision with 2**64 files based on hash length:

2 bytes (16 bits): 100.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
3 bytes (24 bits): 100.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
4 bytes (32 bits): 100.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
5 bytes (40 bits): 100.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
6 bytes (48 bits): 100.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
7 bytes (56 bits): 100.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
8 bytes (64 bits): 100.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
9 bytes (72 bits): 100.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
10 bytes (80 bits): 100.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
11 bytes (88 bits): 100.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
12 bytes (96 bits): 100.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
13 bytes (104 bits): 100.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
14 bytes (112 bits): 100.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
15 bytes (120 bits): 99.9999999999999999999999999999999999999999999999999999974277906273575851562551628356447378966618610200
16 bytes (128 bits): 39.3469340287366576382767925051732039111443153312804614295448573882265327272181291266477621149279583500
17 bytes (136 bits): 0.1951218892524527289843252381047865603764957222291271541480405006250432867062615884385798395379814800
18 bytes (144 bits): 0.0007629365427493557992860364860221112540909113715577924612454205239852919524216540352421599644821300
19 bytes (152 bits): 0.0000029802321943606107060124716601695555770480242114789127059106834577558114278058404382818849390000
20 bytes (160 bits): 0.0000000116415321820158550869336177541161192708042329396354372821407046099857585335768804584013669800
21 bytes (168 bits): 0.0000000000454747350886360721356472707553077383794388148388467031092149898899242894843511891589892000
22 bytes (176 bits): 0.0000000000001776356839400248886867060739748432641003887840996012082625529387068758566288828246623200
23 bytes (184 bits): 0.0000000000000006938893903907228353122013439546439654554534861148024868828951568601096253132705644900
24 bytes (192 bits): 0.0000000000000000027105054312137610848504151844244226582645015901891537094464883977706983309067526600
25 bytes (200 bits): 0.0000000000000000000105879118406787542380559818153552580761590493204820090151213597342039664118779500
26 bytes (208 bits): 0.0000000000000000000000413590306276513837412019011322033428139921844672864993463944210874339175696900
27 bytes (216 bits): 0.0000000000000000000000001615587133892632177363699581708083450556373782285492737794813680820086690800
28 bytes (224 bits): 0.0000000000000000000000000006310887241768094442873768259672103428738043488813587964420920500179336500
29 bytes (232 bits): 0.0000000000000000000000000000024651903288156618917904277536152490969617930133521422178837855248989900
30 bytes (240 bits): 0.0000000000000000000000000000000096296497219361792646668335845222979331712244829793887159929935484900
31 bytes (248 bits): 0.0000000000000000000000000000000000376158192263132002532368596032914379779622178807755678094035179900
32 bytes (256 bits): 0.0000000000000000000000000000000000001469367938527859384926580837241064625725400753694854192313141500
33 bytes (264 bits): 0.0000000000000000000000000000000000000005739718509874450722276514189516525055697225162842123281183600
34 bytes (272 bits): 0.0000000000000000000000000000000000000000022420775429197073133502632535983656519454594026157465249600
35 bytes (280 bits): 0.0000000000000000000000000000000000000000000087581154020301066928862714348183649721268304345914487000
36 bytes (288 bits): 0.0000000000000000000000000000000000000000000000342113882891801042684266036993809630797496883272904900
37 bytes (296 bits): 0.0000000000000000000000000000000000000000000000001336382355046097823016484384004566655778216477125900
38 bytes (304 bits): 0.0000000000000000000000000000000000000000000000000005220243574398819621132515461399494807301758736400
39 bytes (312 bits): 0.0000000000000000000000000000000000000000000000000000020391576462495389144755614630195814541560310100
40 bytes (320 bits): 0.0000000000000000000000000000000000000000000000000000000079654595556622613846940286552600807168740500
41 bytes (328 bits): 0.0000000000000000000000000000000000000000000000000000000000311150763893057085342114209801516894280100
42 bytes (336 bits): 0.0000000000000000000000000000000000000000000000000000000000001215432671457254239609298286726606327700
43 bytes (344 bits): 0.0000000000000000000000000000000000000000000000000000000000000004747783872879899373468521962698428000
44 bytes (352 bits): 0.0000000000000000000000000000000000000000000000000000000000000000018546030753437106927740795504373200
45 bytes (360 bits): 0.0000000000000000000000000000000000000000000000000000000000000000000072445432630613698935821187794700
46 bytes (368 bits): 0.0000000000000000000000000000000000000000000000000000000000000000000000282989971213334761469181164800
47 bytes (376 bits): 0.0000000000000000000000000000000000000000000000000000000000000000000000001105429575052088912010822600
48 bytes (384 bits): 0.0000000000000000000000000000000000000000000000000000000000000000000000000004318084277547222312489700
49 bytes (392 bits): 0.0000000000000000000000000000000000000000000000000000000000000000000000000000016867516709168837158600
50 bytes (400 bits): 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000065888737145190770100
51 bytes (408 bits): 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000257377879473401400
52 bytes (416 bits): 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000001005382341693000
53 bytes (424 bits): 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000003927274772200
54 bytes (432 bits): 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000015340917100
55 bytes (440 bits): 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000059925500
56 bytes (448 bits): 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000234100
57 bytes (456 bits): 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000900
58 bytes (464 bits): 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
59 bytes (472 bits): 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
60 bytes (480 bits): 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
61 bytes (488 bits): 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
62 bytes (496 bits): 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
63 bytes (504 bits): 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
64 bytes (512 bits): 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

so, according to my calculations, there's a 0.0000000000000000000000000000000000001% chance of a birthday paradox collision with a 32 bytes (256bit) cryptographically secure hash (aka blake3) on a maxed-out 64bit filesystem, i think that's sufficient, the hash length is not a problem. (even if we went down to a 20 bytes/160 bit hash, there would only be a 0.00000001% chance)

SeeSpotRun commented 3 years ago

the hash length is not a problem

True, and pointed out in the manual. But it's still a backwards step from the current (admittedly overkill) position.

first off, blake3 is variable-length

Thanks, I'd missed that. So it looks like we can get a 512-bit BLAKE3 output for essentially the same computational cost as 256-bit BLAKE3.

divinity76 commented 3 years ago

admittedly overkill

to put it in age-of-the-universe terms, if you hashed a maxed-out 64bit-inode filesystem every millisecond, you would need about 780 quadrillion times the age of the universe to have a 50% chance of finding a single collision in a single filesystem at 256 bits. the only way it's ever going to happen is if there's a severe weakness in the algorithm.

Thanks, I'd missed that. So it looks like we can get a 512-bit BLAKE3 output for essentially the same computational cost as 256-bit BLAKE3.

yeah pretty much (i believe it does an extra pseudo-round for every multiple of 16 bytes requested, so it's somewhat slower to request 33-48 bytes, and again a bit slower to request 49-64 bytes, and so on)

SeeSpotRun commented 3 years ago

780 quadrillion times the age of the universe

Good context.

Will plan to move default hash to either BLAKE3 or BLAKE3_512 at the next major point release.

SeeSpotRun commented 3 years ago

Small update: with AVX and SSE extensions enabled, I'm seeing additional time reductions (when not IO-bound) of around 30%.