sahib / rmlint

Extremely fast tool to remove duplicates and other lint from your filesystem
http://rmlint.rtfd.org
GNU General Public License v3.0
1.85k stars 128 forks source link

Feedback, showdown against 3 other tools #641

Open Sanmayce opened 7 months ago

Sanmayce commented 7 months ago

Hi @sahib could you share why your superfast tool reports differently than other tools? All performers are on GitHub, downloadable.

This scriplet (attached) shows differences between 'rmlint' and 'DIFFTREE' on latest Linux kernel tree. Bottomline: First one gives 26+391=417 duplicates, whereas my script gives 434, who knows what causes the discrepancy?! My email: sanmayce@sanmayce.com

First, it is good to run more such tools, the-more-the-merrier, since the tool below scans only files 1 bytes or bigger long while there are 26 (see further below) files with 0 bytes size - which means 25 duplicates, in the end reported 409+25=434 duplicates, thus DIFTREE is kinda closer to the right count.

[root@djudjeto2 tree_bench]# echo 3 > /proc/sys/vm/drop_caches
[root@djudjeto2 tree_bench]# ./linux_czkawka_cli dup -m 1 -d TreeUnderDeduplication/
Results of searching ["/home/sanmayce/WorkTemp/tree_bench/TreeUnderDeduplication"] with excluded directories [] and excluded items []
-------------------------------------------------Files with same hashes-------------------------------------------------
Found 409 duplicated files which in 274 groups which takes 2.06 MiB.

Testdataset: linux-6.6.1 tree (untarred archive to TreeUnderDeduplication/) OS: Fedora release 38 (Thirty Eight) x86_64 Host: 20LRS04700 ThinkPad 11e 5th Gen Kernel: 6.2.12-300.fc38.x86_64 CPU: Intel Celeron N4100 (4) @ 2.400GHz SSD: nvme Transcend 1TB bufferless Filesystem: ext4

+---------------------------+-------------------------+------------------+------------------+
| Deduplicator              |                    Time | Memory Footprint | Duplicates Found |
+---------------------------+-------------------------+------------------+------------------+
| fclones v.0.34.0          |                  6.60 s |        26,384 KB |              384 |
| linux_czkawka_cli v.6.1.0 |                  7.69 s |       118,448 KB |              434 |
| rmlint v.2.10.1           |                 11.95 s |        61,952 KB |              391 |
| DIFFTREE r.4++            | 1*60*60+49*60+51=6591 s |        88,768 KB |              434 |
+---------------------------+-------------------------+------------------+------------------+

The actual scriplet in use:

# echo 3 > /proc/sys/vm/drop_caches
# /bin/time -v ./DIFFTREE_BLAKE3_r4++.sh TreeUnderDeduplication/
# echo 3 > /proc/sys/vm/drop_caches
# /bin/time -v rmlint TreeUnderDeduplication/
# echo 3 > /proc/sys/vm/drop_caches
# /bin/time -v ./linux_czkawka_cli dup -m 1 -d TreeUnderDeduplication/
# echo 3 > /proc/sys/vm/drop_caches
# /bin/time -v ./fclones-0.34.0-linux-musl-x86_64 group TreeUnderDeduplication/

The full script 'SpeedShowdown.sh' is attached. SpeedShowdown.sh.tar.gz