pkolaczk / fclones

Efficient Duplicate File Finder
MIT License
1.82k stars 70 forks source link

Feedback, showdown against 3 other tools #251

Open Sanmayce opened 7 months ago

Sanmayce commented 7 months ago

Hi @pkolaczk could you share why your superfast tool reports differently than other tools? All performers are on GitHub, downloadable.

This scriplet (attached) shows differences between 'rmlint' and 'DIFFTREE' on latest Linux kernel tree. Bottomline: First one gives 26+391=417 duplicates, whereas my script gives 434, who knows what causes the discrepancy?! My email: sanmayce@sanmayce.com

First, it is good to run more such tools, the-more-the-merrier, since the tool below scans only files 1 bytes or bigger long while there are 26 (see further below) files with 0 bytes size - which means 25 duplicates, in the end reported 409+25=434 duplicates, thus DIFTREE is kinda closer to the right count.

[root@djudjeto2 tree_bench]# echo 3 > /proc/sys/vm/drop_caches
[root@djudjeto2 tree_bench]# ./linux_czkawka_cli dup -m 1 -d TreeUnderDeduplication/
Results of searching ["/home/sanmayce/WorkTemp/tree_bench/TreeUnderDeduplication"] with excluded directories [] and excluded items []
-------------------------------------------------Files with same hashes-------------------------------------------------
Found 409 duplicated files which in 274 groups which takes 2.06 MiB.

Testdataset: linux-6.6.1 tree (untarred archive to TreeUnderDeduplication/) OS: Fedora release 38 (Thirty Eight) x86_64 Host: 20LRS04700 ThinkPad 11e 5th Gen Kernel: 6.2.12-300.fc38.x86_64 CPU: Intel Celeron N4100 (4) @ 2.400GHz SSD: nvme Transcend 1TB bufferless Filesystem: ext4

+---------------------------+-------------------------+------------------+------------------+
| Deduplicator              |                    Time | Memory Footprint | Duplicates Found |
+---------------------------+-------------------------+------------------+------------------+
| fclones v.0.34.0          |                  6.60 s |        26,384 KB |              384 |
| linux_czkawka_cli v.6.1.0 |                  7.69 s |       118,448 KB |              434 |
| rmlint v.2.10.1           |                 11.95 s |        61,952 KB |              391 |
| DIFFTREE r.4++            | 1*60*60+49*60+51=6591 s |        88,768 KB |              434 |
+---------------------------+-------------------------+------------------+------------------+

The actual scriplet in use:

# echo 3 > /proc/sys/vm/drop_caches
# /bin/time -v ./DIFFTREE_BLAKE3_r4++.sh TreeUnderDeduplication/
# echo 3 > /proc/sys/vm/drop_caches
# /bin/time -v rmlint TreeUnderDeduplication/
# echo 3 > /proc/sys/vm/drop_caches
# /bin/time -v ./linux_czkawka_cli dup -m 1 -d TreeUnderDeduplication/
# echo 3 > /proc/sys/vm/drop_caches
# /bin/time -v ./fclones-0.34.0-linux-musl-x86_64 group TreeUnderDeduplication/

The full script 'SpeedShowdown.sh' is attached. SpeedShowdown.sh.tar.gz

pkolaczk commented 7 months ago

FClones doesn't scan hidden files by default. You must add --hidden flag to make it equivalent. Another thing to check are settings for following links and max/min file sizes. Different tools have different defaults, so it is good to set them explicitly.

Sanmayce commented 7 months ago

Oh, after adding --hidden -s 0 the duplicates are 433, still 1 less, should be 434?!

pkolaczk commented 6 months ago

Maybe one is a hard link? Hard links are not considered duplicates by default, unless you tell it to treat them differently.

Sanmayce commented 6 months ago

Maybe one is a hard link? Hard links are not considered duplicates by default, unless you tell it to treat them differently.

Not sure, as far as I know, this is how the hard links are to be found, no?:

$ find -type f -links +1

Running the above in root folder of kernel 6.6.2 tree, resulted in empty list.