pkolaczk / fclones

Efficient Duplicate File Finder
MIT License
1.87k stars 70 forks source link

add crypto hash and by byte content checking - enhancement suggestion #153

Closed kapitainsky closed 1 year ago

kapitainsky commented 1 year ago

Overall I think that MetroHash you use is fine but some of us are more paranoid than others, also sometimes there are regulatory requirements we have to folllow.

What about to make it optional to use e.g. SHA512 (not sha256 as sha512 is faster on 64 bit CPUs) and for most paranoid simple byte by byte content compare?

ATM fclones defaults are perfectly reasonable but some of us would be happy to have even bigger choice.

kapitainsky commented 1 year ago

I have been thinking and reading a bit about hashing and I would like to argue here that fclones should allow other hash options.

  1. metro hash

It is working and it is fast, but its biggest weakness is that it is one person creation without easily available peer reviews. You google it and there is very little out there to support author's claims. It does not matter much for suffix, prefix matching phase (even CRC would do) - it is perfect place to use very fast hash. But not the final full file hash. To be honest this the biggest single objection I have with using fclones. I am not questioning metro hash quality as I am not qualified. But looks like not many qualified people investigated it.

I have also noticed on various forums chatter that this one thing many people question re fclones. People want to trust proven and tested technology.

  1. cryptographic hashes speed in context of deduplication

I was curious myself what is real life speed of various established hashes and run tests on my computer (2019 Intel CPU laptop)

single thread:

sha1        - 770 MB/sec
md5     - 630 MB/sec
sha512-256  - 625 MB/sec
sha512      - 625 MB/sec
blake2s256  - 450 MB/sec
sha256      - 420 MB/sec
sha3-256    - 360 MB/sec    

indeed, they are much slower than non-crypto fast hashes. I have tried e.g., xxh128 and it was like 5000 MB/sec. I do not have metro implementation, but I guess it is the same range.

Among them I think the most suitable candidate would be sha512-256. sha1 and md5 are broken and collisions can be created. Others are slower. sha512 output would increase dramatically memory usage with mathematically insignificant benefits.

Good explanation in this paper:

https://eprint.iacr.org/2010/548.pdf

Key message from it:

"From our performance analysis, and experimentation on 64 bit Intel Architecture, we have shown that the cost of implementing a SHA-512 algorithm delivers a 50% performance improvement over similar implementations of SHA-256. We also showed that the storage costs for implementing SHA-512 can be reduced by adding a small amount of one-off computation to compute the SHA-512 constants - which we believe will be useful for constrained implementation environment."

And truncated SHA512 is implemented in rust already:

https://docs.rs/sha2/0.4.1/sha2/struct.Sha512Trunc256.html

My computer has 8 cores CPU so I also tested multithreading:

8 threads sha512-256 - 4000 MB/sec

To put these numbers into context of deduplication - single thread is faster than any HDD can deliver and fast enough to crunch data from NAS over 10g network, multithread is faster than my NVME SSD. 256 bits hash would make the hashing index table size grow and lead to increased time both for hash generation and matching stages, but it would not be big penalty. Slower than metro? yes. But I think not so much on any modern computer for files deduplication.

I am only thinking here about last deduplication step - earlier stages can use whatever.

  1. by byte content checking

this is not really needed (unless for paranoid people who do not get math) when good hash is used

  1. competition

Other de-duplicators offer options e.g.: rmlint - uses blake2b by default but allows highway256, metro256, metro or by byte. yadf - default AHash but also offers Highway, MetroHash, SeaHash, XxHash rdfind - default sha1 but also offers md5, sha256

  1. summary

optional final stage files checking hash option is in my opinion most desirable missing fclones feature. Thanks to robust multithreading implementation it is clearly the fastest de-duplicator out there. This will also make much slower hashes impact on deduplication process speed minimal as with modern hardware still the bottleneck is mostly with disk I/O. Alternative hashes option in my opinion would convince many doubters and make fclones much more popular.

IMHO this is the low hanging fruit for significant functionality improvement. People will have option to balance their requirement.

kapitainsky commented 1 year ago

Tested fclones vs others - 100GB folder but with many duplicates - this is actually 2x 50GB the same dataset so put stress on grouping by contents phase as all data has to be checked by reading full file content.

Only finding duplicates without any action (fclones group style):

fclones (metro, multi thread)- 42s
rdfind (sha1, single thread) - 4m25s 
jdupes (byte by byte compare, single thread) - 1m2s

My thoughts:

disk I/O is bottleneck for this deduplication check phase- all programs have to read all files and compare them either using hash or directly - interestingly byte by byte comparison is not much slower than fast hash. But in this case it is only because all duplicates are doubles. Any bigger groups would make hash comaprison faster, e.g., when I quadruple duplicates hash comparison advantage becomes more apparent:

fclones (metro, multi thread)- 1m32s
rdfind (sha1, single thread) - 9m2s 
jdupes (byte by byte compare, single thread) - 3m2s

jdupes gets slower when fclones and rdfind scale nicely.

calculating crypto hashes has its cost but it should not make much difference if multithreading is utilised - fclones was 6x faster than rdfind but on my machine sha512-256 multithreding is 6x faster than single thread - means end result would be similar with much stronger hash function

fclones metro hash does not provide advantage when multithreading is used - and raises many questions. Fclones strenght is its multithreading. It makes looking for duplicates in massive dataset much faster (this test daset is small so advantage is not highlighted). You can see it in your NVME SSD benchmark where 1.5m files is used.

kapitainsky commented 1 year ago

surprise surprise yadf levaes everybody behind - for quadruple duplicates it finishes in 43s - 2 times faster than fclones (for doubles the same). But it is only proof of concept and they advice fclones for real job.

kapitainsky commented 1 year ago

and in addition SHA512-256 test on the latest MacBook Air (M2 ARM CPU)

single thread 1400 MB/sec

It is only to highlight that hash processing speed vs disk I/O is becoming more irrelevant.

pkolaczk commented 1 year ago

Yes, I agree adding more hashing algorithms is a good idea, in particular adding support for well known cryptographic hashes. Metrohash was used not because it was the fastest, but actually because it was fast enough and I could easily compile on most platforms without too much hassle with conditional compilation, including ARM. Earlier I used one of hashes from fasthash crate, but it turned out it had some issues with portability (see: https://github.com/pkolaczk/fclones/issues/3).

As far as performance is considered, hashing performance does not matter much for the wall clock time, and I agree with your conclusions that I/O is what affects the run time the most. However, there is a bit more to efficiency than just the amount of time you need to wait. Actually I don't want to load CPU too much while searching for dupes (particularly on fast SSD), because then it would render my machine useless for the time dupe search is running. This is actually what Ubuntu dejadup is doing whenever I'm backing up the system, and I literally hate it (yup, somehow it consumes all my CPU, and I/O not so much, fans go crazy, etc).

To summarize, there are two things here, and both would be really good to implement:

BTW: I'm curious how you got yadf so much faster. Can you share your repro steps? This is a surprise because in my testing yadf was actually quite much slower than fclones, so maybe there is some edge case I missed. Did you evict page cache before each run to make comparison fair?

kapitainsky commented 1 year ago

BTW: I'm curious how you got yadf so much faster. Can you share your repro steps? This is a surprise because in my >testing yadf was actually quite much slower than fclones, so maybe there is some edge case I missed. Did you evict page >cache before each run to make comparison fair?

I use 50GB folder with 24k files - mix of small and big. As I was interested about the last deduplication step speed I make two or four copies of this folder creating 24k duplicates groups forcing programs to read all content.

on your tests https://github.com/pkolaczk/fclones#ssd-benchmark yadf is actually second fastest on SSD and much slower on HDD

on yadf own tests https://github.com/jRimbault/yadf#benchmarks yadf is faster with warm cache and minimally second fastest with cold cache - they only test on SSD and explicitely say "I recommend fclones as it has more hardware heuristics. and in general more features. yadf on HDDs is terrible."

in my tests yadf is significantly faster than fclones both for warm and cold cache (I tested both, sudo purge in macOS to clean cache)

Why - no idea. Maybe it scales better with number of processes? You used 4 core CPU, yadf tests used 6 core and I used 8 core. Also I use macOS when two other tests were run in Ubuntu. We all most likely used different SSD. Too many factors to say for sure why:)

Also I tested the latest versions. In your tests yadf was v0.13 when now it is v1.0

kapitainsky commented 1 year ago

As far as performance is considered, hashing performance does not matter much for the wall clock time, and I agree with your conclusions that I/O is what affects the run time the most. However, there is a bit more to efficiency than just the amount of time you need to wait. Actually I don't want to load CPU too much while searching for dupes (particularly on fast SSD), because then it would render my machine useless for the time dupe search is running. This is actually what Ubuntu dejadup is doing whenever I'm backing up the system, and I literally hate it (yup, somehow it consumes all my CPU, and I/O not so much, fans go crazy, etc).

Agree regarding application impact on system performance. But clearly dejadup is not written well backup process to consume all resources when for such task it is abcolutely not needed. Now in general it is something what should be optional - run program by default with full speed but give user option to start it with lower priority. I do this with my own backup - runs during a day at very low priority but late night run uses all resources available and does extra tasks like verification.

kapitainsky commented 1 year ago

To summarize, there are two things here, and both would be really good to implement:

reconsidering some other more mature hash to be used as the default adding more hash choices, to allow the user to explore the tradeoffs by themselves

I would not change default - definitely not for prefix and suffix steps. metro seems to be good tradeoff - especially in light of your comment on overall system performance.

EDIT: unless BLAKE3 provides similarly good tradeoffs

For second point as soon as option is available e.g. for 256 bits output hashes it should be super easy to add some reasonable choice:

SHA512-256 - the fastest on 64bit architecture SHA-256 - the fastest on 32 bits architecture BLAKE2s-256, SHA3-256 - for people who believe that NSA is spying on them

Also there is "new" kid on the block - BLAKE3 - this could be maybe alternative to metro for default full content hashing? It is of course slower as it is designed as crypto hash. But quick test and read tell me that more than fast enough for deduplication.

https://github.com/BLAKE3-team/BLAKE3 https://crates.io/crates/blake3

It has been added to OpenZFS last year for flies' hashes and they are very conservative in adding something which is not proven - only this is good reccomendation.

on my machine it is more than 2x faster compared to SHA512-256 on single thread: b3sum --num-threads 1 10GBtestFile - 1400 MB/sec

using built in multithreding: b3sum 10GBtestFile - 10,800 MB/sec

and hashing 8 files using 8 cores (b3 single threaded) - 10,000 MB/sec

b3sum --num-threads 1 1GBtestFile &
b3sum --num-threads 1 1GBtestFile2 &
b3sum --num-threads 1 1GBtestFile3 &
b3sum --num-threads 1 1GBtestFile4 &
b3sum --num-threads 1 1GBtestFile5 &
b3sum --num-threads 1 1GBtestFile6 &
b3sum --num-threads 1 1GBtestFile7 &
b3sum --num-threads 1 1GBtestFile8 &
wait

I tested multiple times to make sure files are in cache - as I was interested in hash speed - not my disk speed.

These numbers mean that it would never really need 100% CPU to do its job (no fans spinning) - simply it is faster than even fast NVME SSD I/O can provide.

All of them are probably already implemented in rust. EDIT: they are already implemented.

kapitainsky commented 1 year ago

I think any performance and efficiency questions can be only answered by testing. If you add configurable hash option I am happy to test it throughly.

kapitainsky commented 1 year ago

Why - no idea. Maybe it scales better with number of processes? You used 4 core CPU, yadf tests used 6 core and I used 8 core. Also I use macOS when two other tests were run in Ubuntu. We all most likely used different SSD. Too many factors to say for sure why:)

Also I tested the latest versions. In your tests yadf was v0.13 when now it is v1.0

Actually I have idea. I looked at htop because yadf is faster in any test I tried... yadf is using all my 8 cores vs fclones only using 4. Which explains why it is about x2 faster

It makes use of multicore much better:

yadf fclones

With fclones I can see thast all 8 cores are used during initial stages - but then only 4 when contents hashes are calculated.

pkolaczk commented 1 year ago

What happens if you force number of threads to 8 with -t 8 ?

pkolaczk commented 1 year ago

Anyway I think this is worth opening a separate issue.

kapitainsky commented 1 year ago

Good thing is that with proper parallelism and proper hashes fclones can become the best

kapitainsky commented 1 year ago

for new hash algs I can also test on ARM - RPi and Apple M2 - so so surprises when released

pkolaczk commented 1 year ago

Hey, got an hour today and added more hashes :) Tested it only very quickly - blake3 is just as fast as metro on my laptop, but SHA family is visually slower, and you were right sha256 is slower than sha512.

kapitainsky commented 1 year ago

SHA256 will be faster if somebody runs 32b CPU - blake3 crate is really well done with Intel and ARM optimization.

BTW - this is really great news!!!! I am going to test it later today

kapitainsky commented 1 year ago

could not wait - blake3 is a bit slower than metro for me - memory usage is up (about 2.5x) but it is 256b hash so no issue

I test it on 220 GB of duplicates:

[2022-09-10 18:24:59.815] fclones: info: Found 75449 (220.6 GB) redundant files

metro - 75s blake3 - 80s sha512 - 100s sha256 - 190s

kapitainsky commented 1 year ago

is some way it is amazing that 220 GB can be read and processed in little over 1 min... computers are crazy fast nowadays

kapitainsky commented 1 year ago

It is great milestone! Now we have a choice.

Thank you again for this great project.

kapitainsky commented 1 year ago

One thing:

you are using Sha512_256 but:

# fclones group --help

        --hash-fn <hash-fn>
            Hash function [default: metro128]  [possible values: metro128, blake3, sha256,
            sha512]

calls it sha512 (which might be missleading)

sha512 and sha512-256 are not the same

kapitainsky commented 1 year ago

And with proper crypto hashes still much faster than jdupes:)

jdupes --summarize -r -Q .

takes 190s for my test data

BTW - competition noticed fclones:):

https://www.jodybruchon.com/2020/07/11/a-challenger-appears-fclones-fastest-duplicate-scanner-ever-its-complicated/

kapitainsky commented 1 year ago

One thing:

you are using Sha512_256 but:

# fclones group --help

        --hash-fn <hash-fn>
            Hash function [default: metro128]  [possible values: metro128, blake3, sha256,
            sha512]

calls it sha512 (which might be missleading)

sha512 and sha512-256 are not the same

sha512 has 512 bits output sha512-256 has output truncated to 256 bits - but it is not as simple as taking half of sha512 output. To maintain required mathematical properties like randomness it is calculated in special way.

Both are the same family of hashes - sha2. Actually very close relatives but not the same. Their speed is the same - difference is memory requirement to juggle 512 vs 256 bits hashes. As per paper I sent link to - sha512-256 as for today is very good practical balance. 256 bits for files dedup is super safe.

It is very frustrating to see many software using sha256 becasue they assume that sha512 is always slower. And 32 bits hardware is a niche.

kapitainsky commented 1 year ago

Also have tested on 32 bits ARM and Debian - this is usually where problems are with many modern software:) compiles ok and all hashes work

kapitainsky commented 1 year ago

another observation - at least for my computer blake3 is super fast but when run on 8 cores makes my computer very slow.

good thing is that thx to options available in fclones when I run it as ssd:8,6 then all is fine... and it is marginally slower deduplication

Maybe in the future default threads strategy could be better - as I said new mac computers now have up to 20 cores. In such cases I/O bottlneck makes it pointless to use all of them even when they have insane fast SSDs - 7GB/sec transfers

Anyway it is good problem to have.

kapitainsky commented 1 year ago

might be worth to mention in readme that some hashes are computationaly more intensive and user should plan.

again thx to fclones options easy. Even if plaughing through TBs of data

I can use --cache and lower number of thread during a day - then ctrl-C bafore going to bed and change strategy

kapitainsky commented 1 year ago

Also have tested on 32 bits ARM and Debian - this is usually where problems are with many modern software:) compiles ok and all hashes work

Some speed results:

metro128 - 32s blake3 - 41s sha512-256 - 75s sha256 - 54s

as expected on 32 bits CPU sha256 is faster than sha512

blake3 doing very well for such small hardware

kapitainsky commented 1 year ago

Last comment is that for sake of making available set of hashes complete I would also include sha3-256

https://en.wikipedia.org/wiki/SHA-3 https://docs.rs/sha3/0.7.3/sha3/struct.Sha3_256.html

Speed is very hardware depndent - probably slower today on our laptops but things are changing quick.

It is standard some people migh have to adhere to.

pkolaczk commented 1 year ago

sha512 has 512 bits output sha512-256 has output truncated to 256 bits - but it is not as simple as taking half of sha512 output.

Good catch, indeed. For now I switched the code to use sha512, because the speed is just the same as sha512/256. I'm actually truncating the output to 128 bits anyways, so this is still not finished. Need to change FileHash representation to allow hashes of varying sizes.

kapitainsky commented 1 year ago

Good catch, indeed. For now I switched the code to use sha512, because the speed is just the same as sha512/256. I'm actually truncating the output to 128 bits anyways, so this is still not finished. Need to change FileHash representation to allow hashes of varying sizes.

Awesome - so in the future it will be easy to add any new hash regardless of its output size. But indeed important to use all output not some truncated part.

kapitainsky commented 1 year ago

I think I would keep sha512 - it is a bit paranoid hash (it is 512 bits long) given tradeoffs of speed and memory - but maybe somebody likes it. At the end now it is optional.

kapitainsky commented 1 year ago

I have tested the latest commit using full hash output length from more-hashes branch and so far all works for me. I worried about peak memory usage for sha512 but increase at least for me is minimal

Below results for my dataset 100k files - 260 GB with 75k files - 220 GB duplicates

hash - max mem - time

====

metro128 - 55MB - 77s blake3 - 61 MB - 82s sha512 - 63 MB - 144s sha256 - 58MB - 256s

Previous version of fclones (without hashes choice) also finishes in about 77s - so there is no impact on previous performance

kapitainsky commented 1 year ago

FYI - bug:

# fclones group . 

or 

# fclones group . --hash-fn metro128

Invalid value for '--hash-fn <hash-fn>': Unknown hash algorithm: metro128
kapitainsky commented 1 year ago

last documentation commit correction:

blake3 is 256 bits hash not 128

kapitainsky commented 1 year ago

also for consistency maybe some links

to sha256 and sha512 - https://en.wikipedia.org/wiki/SHA-2

and sha3-256 and sha3-512 - https://en.wikipedia.org/wiki/SHA-3

pkolaczk commented 1 year ago

Corrected. Is everything ok now? I think it's time to merge it.

kapitainsky commented 1 year ago

I did the last commits tetts - all works - for me - Intel 64 bits CPU/8 cores on macOS and ARM 32 bits on Debian:

some tests (Intel/macOS only) - it is my data set and my setup - skewed heavily for hashing speed not deduplication:

time for group run, max mem usage, aproximated hashing speed

metro       -   82 sec,     60 MB       -   2,988 MB/sec
xxhash3     -   83 sec,     58 MB       -   2,952 MB/sec
blake3      -   80 sec,     56 MB       -   3,063 MB/sec
sha-512     -   143 sec,    56 MB       -   1,713 MB/sec
sha-256     -   253 sec,    59 MB       -   968 MB/sec
sha3-512    -   354 sec,    60 MB       -   692 MB/sec
sha3-256    -   195 sec,    53 MB       -   1,256 MB/sec

some other similar software:

jdupe           -   384 sec,    23 MB (byte-by-byte)        -   638 MB/sec
jdupe -Q        -   200 sec,    22 MB (xxhash64)        -   1,225 MB/sec
yadf            -   80 sec,     33 MB (ahash)           -   3,063 MB/sec
rdfind          -   977 sec,    38 MB (sha1)            -   251 MB/sec
rmlint          -   237 sec,    150 MB (blake2b)        -   1,034 MB/sec
czkawka         -   too complicated to install on macOS, can’t be asked

and fclones single thread

metro       -   181 sec,    56 MB (-t 1)        -   1,340 MB/sec
blake3      -   277 sec,    48 MB (-t 1)        -   884 MB/sec
sha-512     -   1140 sec,   56 MB (-t 1)        -   215 MB/sec
kapitainsky commented 1 year ago

it is fantactic that fclones now has options for different hashes - speed - it all depands on platform and hash implemetation. what is clear is that multithreding makes massive difference on modern hardware.

I have not expected that adding new hashes will require so much work - thank you.

kapitainsky commented 1 year ago

and thank you for mentioning me in your release. very kind