sahib / rmlint

Extremely fast tool to remove duplicates and other lint from your filesystem
http://rmlint.rtfd.org
GNU General Public License v3.0
1.86k stars 128 forks source link

skip_hardlink not generated where expected. Already hardlinked files have "cp_hardlink" generated - even if seen as dupes #545

Closed james-cook closed 1 year ago

james-cook commented 2 years ago

I have already run rmlint over some directories on a drive. Now I am running rmlint on the whole drive to see what remains to be linted. I'm running with "sh:hardlink".

Strangely already hardlinked files (same inode) have "cp_hardlink" instead of "skip_hardlink" generated in rmlint.sh. cp_hardlink itself does no sanity check (check to see whether inodes already the same) and does a "rm" and "ln" all over again.

For example: These files share the same inode:

pi@rpiomv: $ ls -ali '/tmp/temp_E_4TB/HSa.p1/2008.01.13.hs.p-004-01.avi'
895822 -rwxrwxrwx 2 root root 1070006272 Aug 18  2010 /tmp/temp_E_4TB/HSa.p1/2008.01.13.hs.p-004-01.avi
pi@rpiomv: $ xattr -l '/tmp/temp_E_4TB/HSa.p1/2008.01.13.hs.p-004-01.avi'
user.rmlint.blake2b.cksum:
0000   36 66 64 62 39 34 30 33 33 61 35 62 32 33 61 35    6fdb94033a5b23a5
0010   33 38 34 64 37 62 39 64 34 65 34 65 63 64 38 39    384d7b9d4e4ecd89
0020   32 63 65 30 63 31 30 36 38 61 30 62 62 32 35 65    2ce0c1068a0bb25e
0030   32 32 63 34 32 65 65 63 63 35 32 35 62 33 30 62    22c42eecc525b30b
0040   66 33 36 35 63 61 31 64 32 36 30 33 61 36 31 65    f365ca1d2603a61e
0050   64 63 65 30 34 64 38 38 37 34 61 65 38 64 30 64    dce04d8874ae8d0d
0060   37 32 39 35 33 36 35 34 30 33 63 39 38 39 61 39    7295365403c989a9
0070   33 61 33 61 34 62 66 37 31 64 34 37 62 39 38 63    3a3a4bf71d47b98c
0080   00                                                 .

user.rmlint.blake2b.mtime: 1282169807.7567737
pi@rpiomv: $ ls -ali '/tmp/temp_E_4TB/HSa.p2/2008.01.13.hs.p-004-01.avi'
809730 -rwxrwxrwx 2 root root 1070006272 Aug 18  2010 /tmp/temp_E_4TB/HSa.p2/2008.01.13.hs.p-004-01.avi
pi@rpiomv: $ xattr -l '/tmp/temp_E_4TB/HSa.p2/2008.01.13.hs.p-004-01.avi'
user.rmlint.blake2b.cksum:
0000   36 66 64 62 39 34 30 33 33 61 35 62 32 33 61 35    6fdb94033a5b23a5
0010   33 38 34 64 37 62 39 64 34 65 34 65 63 64 38 39    384d7b9d4e4ecd89
0020   32 63 65 30 63 31 30 36 38 61 30 62 62 32 35 65    2ce0c1068a0bb25e
0030   32 32 63 34 32 65 65 63 63 35 32 35 62 33 30 62    22c42eecc525b30b
0040   66 33 36 35 63 61 31 64 32 36 30 33 61 36 31 65    f365ca1d2603a61e
0050   64 63 65 30 34 64 38 38 37 34 61 65 38 64 30 64    dce04d8874ae8d0d
0060   37 32 39 35 33 36 35 34 30 33 63 39 38 39 61 39    7295365403c989a9
0070   33 61 33 61 34 62 66 37 31 64 34 37 62 39 38 63    3a3a4bf71d47b98c
0080   00                                                 .

user.rmlint.blake2b.mtime: 1282169807.7567737

I run the command:

rmlint -c sh:hardlink -T "all -emptyfiles -emptydirs" --progress -S dma -s 1G-1TB --xattr '/tmp/temp_E_4TB/HSa.p1' '/tmp/temp_E_4TB/HSa.p2'

This actually reports 69 dupes but correctly notices that these dupes have zero size (because they are hardlinks of each other):

==> In total 273 files, whereof 69 are duplicates in 69 groups. ==> This equals 0 B of duplicates which could be removed. ==> Scanning took in total 20.342s.

But the generated rmlint.sh file contains:

 :
    461
    462 original_cmd  '/tmp/temp_E_4TB/HSa.p2/2008.01.13.hs.p-004-01.avi' # original
    463 cp_hardlink   '/tmp/temp_E_4TB/HSa.p1/2008.01.13.hs.p-004-01.avi' '/tmp/temp_E_4TB/HSa.p2/2008.01.13.hs.p-004-01.avi' # duplicate
    464
 :

I'm not sure if this is potentially a source of problems down the line but it does seems strange. Any ideas why skip_hardlink is not generated? Or, in fact, why ANY cp_hardlink lines are generated for these 69 files in the first place.

cebtenzzre commented 2 years ago

AFAIK you need to pass --keep-hardlinked for rmlint to explicitly mark hardlinks as originals. skip_hardlink is specific to files that were marked as hardlinks in the preprocessing step, maybe there is a false negative in rm_pp_bundle_hardlinks. Are you asking for an enhancment to cp_hardlink like if original_check "$1" "$2" && ! [ "$1" -ef "$2" ]; then?

james-cook commented 2 years ago

In my command I use -S dma to select the hardlink inode to preserve, and this works fine.

My "issue" is that having looked any very many rmlint.sh files I have often seen skip_hardlink after the "original" line. This makes absolute sense to me. I had assumed this is generated instead of cp_hardlink when the inode is already identical. So I was surprised in this case to see cp_hardlinks generated instead. (I have never used --keep-hardlinked explicitly)

If you then run your "final" rmlint.sh on "critical files" in PARANOID mode (as I do since https://github.com/sahib/rmlint/issues/436) then the cp_hardlink does a cmp all over again even though the inodes are already identical.

My suggestion would be to have an extra test in cp_hardlink to compare inode values (in fact in original_check). This is the existing code:

original_check() {
    if [ ! -e "$2" ]; then
        printf "${COL_RED}^^^^^^ Error: original has disappeared - cancelling.....${COL_RESET}\n"
        return 1
    fi

    if [ ! -e "$1" ]; then
        printf "${COL_RED}^^^^^^ Error: duplicate has disappeared - cancelling.....${COL_RESET}\n"
        return 1
    fi

    # Check they are not the exact same file (hardlinks allowed):
    if [ "$1" = "$2" ]; then
        printf "${COL_RED}^^^^^^ Error: original and duplicate point to the *same* path - cancelling.....${COL_
RESET}\n"
        return 1
    fi

    # Do double-check if requested:
    if [ -z "$DO_PARANOID_CHECK" ]; then
        return 0
    else
        if ! check_for_equality "$1" "$2"; then
            printf "${COL_RED}^^^^^^ Error: files no longer identical - cancelling.....${COL_RESET}\n"
            return 1
        fi
    fi
}

In the above, the test they are not the exact same file contains the comment "hardlinks allowed" - interesting... what does it mean?

As can be seen, no inode check is made at any point. If not at the "top level" one could at least be placed before the call to check_for_equality. Or more universally in check_for_equality itself (not sure of ripple effects here though)

Existing code, check_for_equality:

check_for_equality() {
    if [ -f "$1" ]; then
        # Use the more lightweight builtin `cmp` for regular files:
        cmp -s -- "$1" "$2"
    else
        # Fallback to `rmlint --equal` for directories:
        "$RMLINT_BINARY" -p --equal  -- "$1" "$2"
    fi
}

An inode test could be placed before the cmp command

cebtenzzre commented 2 years ago

Which version of rmlint are you using? The develop branch had some changes last year including 5ffc2a99b6f3a9200f68079b8ca296f720331995 that affect the hardlink preprocessing code, maybe one branch works as expected and the other doesn't?

cebtenzzre commented 1 year ago

I have been able to reproduce this on both master and develop. It's a problem with --merge-directories:

$ mkdir -p hardlink-test/a
$ echo xxx >hardlink-test/a/1
$ cp -rl hardlink-test/a hardlink-test/b
$ echo yyy >hardlink-test/a/2
$ rmlint -o sh -c sh:hardlink hardlink-test -S a | grep -B1 '_hardlink '
original_cmd  '/tmp/hardlink-test/a/1' # original
skip_hardlink '/tmp/hardlink-test/b/1' '/tmp/hardlink-test/a/1' # duplicate
$ rmlint -D -o sh -c sh:hardlink hardlink-test -S a | grep -B1 '_hardlink '
original_cmd  '/tmp/hardlink-test/b/1' # original
cp_hardlink   '/tmp/hardlink-test/a/1' '/tmp/hardlink-test/b/1' # duplicate

edit: btw, the files in your example have do in fact have different inodes, 895822 vs 809730, but their sizes match. ls -lhi might make this more obvious (or just use stat).