Reporting - This equals XXX GB of duplicates which could be removed - "incorrect" size for duplicates reported if files already hardlinked elsewhere

james-cook commented 2 years ago

Doing a lot of "heavy duty" rmlinting I notice that the reporting is sometimes "strange":

This equals XXXX of duplicates which could be removed

The incorrect reporting in my current case seems to happen if you have say 2 directories (Dir1 and Dir2) in the rmlint command.

Dir1 and Dir2 both contain "Dupefile1" and "Dupefile2" But, in addition the "Dupefile1" and "Dupefile2" in Dir1 are already hardlinked with files OUTSIDE the rmlint command (i.e. not in Dir1 and Dir2).

Overview of the directories: Overview Dir1: Almost ALL files already hardlinked (links +1):

$ find Dir1 -type f -links +1 -printf "%s\n" | gawk -M '{t+=$1}END{print t}' | numfmt --to=si
387G
$ find Dir1 -type f -links 1 -printf "%s\n" | gawk -M '{t+=$1}END{print t}' | numfmt --to=si
110M

Overview Dir2: : NO files already hardlinked (links +1 gives a format error because it is empty):

$ find Dir2 -type f -links +1 -printf "%s\n" | gawk -M '{t+=$1}END{print t}' | numfmt --to=si
numfmt: invalid number: ‘’
$ find Dir2 -type f -links 1 -printf "%s\n" | gawk -M '{t+=$1}END{print t}' | numfmt --to=si
48G

Running rmlint:

rmlint -c sh:hardlink --progress -S dma -s -1TB --keep-all-tagged Dir1 // Dir2
==> In total 515 files, whereof 49 are duplicates in 29 groups.
==> This equals 57.28 MB of duplicates which could be removed.
==> Scanning took in total  2h  2m 29.321s.
More or less reversing the command (with Dir2 first and Dir1 second) correctly reports:

rmlint -c sh:hardlink --progress -s 1K -S dma -s -1TB 'Dir2' // 'Dir1'
==> In total 515 files, whereof 49 are duplicates in 29 groups.
==> This equals 44.57 GB of duplicates which could be removed.
==> Scanning took in total  1h 12m 13.958s. 

The rmlint.sh created correctly contains commands to (in my case) hardlink the 49 dupes. It is just the reporting "This equals 57.28 MB" in the first run which is incorrect or at least confusing.

My impression is that you would not get the correct report at all without the "//" source/target tagging.

cebtenzzre commented 2 years ago

The default rank criteria "pOma" tries to preserve files with more external hardlinks, but other than that rmlint doesn't concern itself with external link count. --keep-all-tagged is working as designed, it will never modify Dir2 (replace files with hardlinks) to free up space. Maybe you want to tag Dir1 instead? And to correctly reverse the command you would use --keep-all-untagged which should report the same as the first one. The report looks correct to me for both commands you ran.

james-cook commented 2 years ago

Thanks for your reply.

I am probably just looking at the report in the wrong way. I use the reports to give me an indication of whether space "can" be saved - i.e. whether to actually run the rmlint.sh script.

I arrived at using the tagging options because just listing the directories did not report the ca. 45GiB or more savings I was expecting - and that would actually be made if you ran the generated script.

So, ... stepping back... here is the situation: The first thing to note is that these files are currently all on ntfs - mounted with ntfs-3g. This may be significant.

Dir1: Dir1 contains 357GB - some of the files it contains already share inodes with other files on the same drive. And some "repeated" inodes are in Dir1 itself.

$ sudo du -hs 
357G    .
$ sudo find . -type f -links +1 -printf "%s\n" | gawk -M '{t+=$1}END{print t}' | numfmt --to=si
387G
$ sudo find . -type f -links 1 -printf "%s\n" | gawk -M '{t+=$1}END{print t}' | numfmt --to=si
110M

Note that the files with an inode count "+1" - i.e. files with hardlinks - some of these hardlinks i.e. copies are inside DIR1 itself

Dir2: Dir2 contains 26 files, total size 48GB - Note that none of these files are hardlinks and that ALL of these files are in fact also present in Dir1 (FWIW not only are the files copies, they share the same relative paths and filenames). We can see below that Dir2 contains no hardlinks (so the files in Dir2 must be copies).

$sudo du -hs
44G     .
$ sudo find . -type f -links 1 -printf "%s\n" | gawk -M '{t+=$1}END{print t}' | numfmt --to=si
48G
$ sudo find . -type f -links +1 -printf "%s\n" | gawk -M '{t+=$1}END{print t}' | numfmt --to=si
numfmt: invalid number: ‘’ <-- i.e. there are no files with hardlinks in this directory

Note: in other words Dir2 contains no hardlinks yet.

I am seeking to replace the copies in Dir2 with hardlinks to the same files in Dir1.

Just to "confirm" the situation further with a sample duplicate file:

In Dir1:

$ ls -ial DIR1/all/derek.and.clive-001-0019.avi'
70132 -rwxrwxrwx 2 root root 582139904 Oct 10  2009 'DIR1/all/derek.and.clive-001-0019.avi'
(note the hardlink count is 2, and the value for inode)

In Dir2:

$ ls -ial DIR2/all/derek.and.clive-001-0019.avi
1221083 -rwxrwxrwx 1 root root 582139904 Oct 10  2009 'DIR2//all/derek.and.clive-001-0019.avi'
(note the hardlink count is 1, and the value for inode)

The md5sum for the above 2 files (separate inodes) is identical.

When I run rmlint without tagging of any sort on DIR1 and DIR2:

 rmlint -c sh:hardlink --progress -S dma -s -1TB  DIR1 DIR2
?¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦?                     Traversing (515 usable files / 0 + 0 ignored files / folders)
?¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦?                         Preprocessing (reduces files to 451 / found 0 other lint)
?¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦?         Matching (49 dupes of 29 originals; 0 B to scan in 0 files, ETA:  1m 20s)

==> In total 515 files, whereof 49 are duplicates in 29 groups.
==> This equals 5.29 GB of duplicates which could be removed.
==> Scanning took in total  1h 14m 25.385s.

Notes:

with "sh:hardlink" the order of directories is "not relevant"
this value of 5.29GB is not the expected ca. 45GB - what is counted in unclear to me

There are 49 cp_hardlink lines in rmlint.sh. Using various grep commands I found that:

13 of these are for copies inside DIR1 itself only (actually these files are already rmlinted and should already share the same inode so no action would be necessary)
11 are for files in DIR2 and also in DIR1 but without the same relative path
26 are for files in DIR2 and DIR1 which share the same relative path (this alone would save 44GiB) (Total=50, but this is the general picture)

Re-run To try to understand the reporting I opted for tagging (as originally mentioned): By keeping all tagged, we have fewer cp_hardlink commands to deal with.

rmlint -c sh:hardlink --progress -S dma -s -1TB --keep-all-tagged DIR2 // DIR1 ?¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦? Traversing (515 usable files / 0 + 0 ignored files / folders) ?¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦? Preprocessing (reduces files to 418 / found 0 other lint) ?¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦? Matching (26 dupes of 26 originals; 0 B to scan in 0 files, ETA: 1m 46s)

==> In total 515 files, whereof 26 are duplicates in 26 groups. ==> This equals 44.52 GB of duplicates which could be removed. ==> Scanning took in total 1h 41m 13.504s.

rmlint reports 44.52GB of duplicates which is much better. There are 26 cp_hardlink lines in rmlint.sh. There are 26 files total in Dir2. This result is completely as expected.

Why does the 1st run only report 5.2GB of duplicates

The 1st run generates 49 "cp_hardlink" lines
Of these 49 lines, only 9 have the file in Dir2 as the first argument to cp_hardlink.
This order is governed by ranking/sorting Summing the size of these 9 files using something like:
```
SAVEIFS=$IFS
IFS=$(echo -en "\n\b")
du -ch $(<s4.sorted.part.lst) |tail -1
IFS=$SAVEIFS
```
gives: 5.2 GB total

This "could" be the total the report from the 1st run also shows(!?)

In other words, what might be happening is that only the size of the first argument to cp_hardlinks is added to the report total. However, If the first argument to cp_hardlink is itself already a hardlink then its size is not added to the report total.

cebtenzzre commented 1 year ago

Either rmlint -c sh:hardlink -km DIR2 // DIR1 or rmlint -c sh:hardlink --keep-hardlinked DIR1 DIR2 is the basic structure you need. rmlint will not consider space to be freed if the result of running the script doesn't remove all hardlinks to a given inode. It's trying to tell you that the result would be sub-optimal because of the options you chose. https://github.com/sahib/rmlint/blob/854af40f87b366837ad3851243c06fbfb634b153/lib/shredder.c#L1320-L1326

sahib / rmlint

Reporting - This equals XXX GB of duplicates which could be removed - "incorrect" size for duplicates reported if files already hardlinked elsewhere #546