--hash-unmatched seems to scan the whole dataset, like --hash-uniques

intelfx commented 1 year ago

rmlint version

current rmlint develop (58d29ec1), v2.10.1-281-g58d29ec1
two patches applied:
- revert of a575d6ef2646c1fac1abaa6df79efecaa88e02d9 to fix #596
- a patch to gui/setup.py to fix #608

dataset

I have a 30-something TB dataset, that consists of ~20 TB uniques and ~11 TB size-twins:

$ du -hs /mnt/data
32T     /mnt/data

$ find /mnt/data -type f -printf '%s\n' | sort | uniq -c | awk -c '
function bscalc(_in) { "bscalc -H " _in | getline _out; return _out; }
$1 == 1 { nr_uniqs += $1; size_uniqs += $1 * $2; }
$1 != 1 { nr_twins += $1; size_twins += $1 * $2; }
END { 
  printf "Uniques: total %d size %s\n", nr_uniqs, bscalc(size_uniqs);
  printf "Twins: total %d size %s\n", nr_twins, bscalc(size_twins);
}'
Uniques: total 202799 size 19.76 TiB
Twins: total 3074218 size 11.78 TiB

actual behavior

Basic rmlint invocation without --hash-unmatched (ignore --without-fiemap, it's just there to speed up preprocessing, progress-bars were also trimmed):

$ rmlint -T df,dd -j --progress -o pretty -c sh:handler=clone --hidden --without-fiemap /mnt/data
Traversing (3276566 usable files / 0 + 0 ignored files / folders)
Preprocessing (reduces files to 3034739 / found 33504 other lint)
Matching (100 dupes of 63 originals; 12058,91 GB to scan in 3067241 files, ETA:  7d 14h 55m 44s)
^C

Control rmlint invocation with --hash-uniques:

$ rmlint -T df,dd -j --progress -o pretty -c sh:handler=clone --hidden --without-fiemap --hash-uniques /mnt/data
Traversing (3276566 usable files / 0 + 0 ignored files / folders)
Preprocessing (reduces files to 3237535 / found 33504 other lint)
Matching (7 dupes of 7 originals; 32301,25 GB to scan in 3270955 files, ETA: 108d  8h 40m 45s)
^C

Now, --hash-unmatched:

$ rmlint -T df,dd -j --progress -o pretty -c sh:handler=clone --hidden --without-fiemap --hash-unmatched /mnt/data
Traversing (3276566 usable files / 0 + 0 ignored files / folders)
Preprocessing (reduces files to 3237535 / found 33504 other lint)
Matching (7 dupes of 7 originals; 32301,25 GB to scan in 3270955 files, ETA: 120d  9h 31m 56s)
^C

expected behavior

Isn't --hash-unmatched supposed to only scan size twins (i. e. 12 TB at most)?

intelfx commented 1 year ago

I can make --hash-unmatched do what it says on the tin with this code, but it feels hacky:

https://github.com/sahib/rmlint/blob/675089dee9453134d2347ef00222f5f6d1f30979/lib/shredder.c#L839-L842

I wonder if there is something else subtly wrong in the code.

It appears that when --hash-unmatched is used in an unmodified rmlint, this condition is responsible for hashing all the single-file groups:

https://github.com/sahib/rmlint/blob/675089dee9453134d2347ef00222f5f6d1f30979/lib/shredder.c#L855-L859

Could someone please explain what exactly is being done here, what's the idea behind this special case?

intelfx commented 1 year ago

Disregard the comment above (the suggested fix is wrong), see proper analysis in the linked PR.

sahib / rmlint