sahib / rmlint

Extremely fast tool to remove duplicates and other lint from your filesystem
http://rmlint.rtfd.org
GNU General Public License v3.0
1.86k stars 128 forks source link

--unmatched-basename is (1) crashing and (2) misbehaving when hardlinks are part of the dataset #616

Open intelfx opened 1 year ago

intelfx commented 1 year ago

While I was investigating hardlinked file handling for #614 (see comments) I encountered several issues with --unique-basenames.

version

problem 1 testcase

# mkdir dir1 dir2
# for n in {1..20}; do dd if=/dev/urandom of=dir1/file$n bs=1M count=1024; done
# cp -a --reflink=never dir1 -T dir2
# ln -f dir1/file20 dir2/file20
# rmlint -T df --unmatched-basename .
ERROR: Aborting due to a fatal error. (signal received: Segmentation fault)
ERROR: Please file a bug report (See rmlint -h)

This one can be fixed with 594121fa8252b31028781e774954e66905b83431.

problem 2 testcase

# mkdir dir1 dir2
# for n in {1..20}; do dd if=/dev/urandom of=dir1/file$n bs=1M count=1024; done
# cp -a --reflink=never dir1 -T dir2
# ln -f dir1/file20 dir2/file20
# ln -f dir1/file20 dir2/file20_1
# rmlint -T df -o pretty -o summary --unmatched-basename .
==> In total 43 files, whereof 0 are duplicates in 0 groups.
==> This equals 0 B of duplicates which could be removed.
==> 3 other suspicious item(s) found, which may vary in size.
==> Scanning took in total 0.635s.

Expected behavior: 1 group, dir1/file20 original, dir2/file20 original, dir2/file20_1 duplicate. Likewise with a group of just 2 suitable hardlinks (drop dir2/file20, it still won't see the group).

Basically, as I see it, --unique-basenames handling code does not consider hardlinks at all, only considering a single name for every inode.

cebtenzzre commented 1 year ago

I am aware of several issues with --unmatched-basename, on master at least. I'll have to confirm whether either of these are new to me, although I don't think I've seen it crash before.