sahib / rmlint

Extremely fast tool to remove duplicates and other lint from your filesystem
http://rmlint.rtfd.org
GNU General Public License v3.0
1.86k stars 128 forks source link

I feel many of the options take the wrong default #595

Open graemev opened 1 year ago

graemev commented 1 year ago

I feel many of the options default to danger:

bad_group/bad_userid = If I share the filesystem e.g. via NFS , it is often the case that UID/GID won't be found locally (or if I use NIS, it may be unavailable) so chown(2)ing /chgrp(2)ing files by default seems a bad choice.

bad_symlinks. These are often "bad" simply because I don't a filesystem mounted at the moment , so again removing these by default seems a bad choice.

-L ...this just seems plain wrong. I often have more than one directory entry pointing at the same file (inode) I can see no way these can be considered duplicates. I ran this on a 1.5TB filesystem and was horrified when it suggested deleting many thousands of carefully crafted directories (e.g a TODO directory contains just (hard)links to other files, ditto BIG might link the big files ...typically each inode is linked to by about 3 directory entries ) so I think -L should default the other way "A file with multiple directory entries is still just one file ...being known by multiple names does not make it a duplicate)

Stripping binaries ...this is really something I never want to do ... I'm always frustrated to find the symbols missing when I kick of gdb(1) for the sake of a few bytes I can't debug the running version ....arrgh ...again I feel NOT stripping should be the default.

empty directories , I have large numbers of these (e.g the TODO above) which are only empty right now, they will have links added at the appropriate time , finding all my target directories removed would be a disaster. I think the default should be NOT to remove these.

In general there seem to be far too many "fail to danger" choices. Simply running the generated script will damage the filesystem.

I feel a "don't change anything unless the user asks to change it" would be a good philosophy.

My personal choice would be the default to the -n (dry run) then when it was run say "to do this add the -X option.

cebtenzzre commented 1 year ago

rmlint actually doesn't strip binaries by default, it has to be enabled with e.g. -T nonstripped. I suppose the comment about --hardlinked is valid, since hardlinks are likely created explicitly or by rmlint itself and should probably be kept by default. As for the rest, it really depends on whether you use rmlint as a duplicate file finder, or as a handy way to find junk on your filesystem. I personally use -T df most of the time, and maybe it should be documented more prominently that you should use that option if all you care about is duplicate files.

graemev commented 1 year ago

Thanks, I noticed the nonstripped behaviour while rechecking the man page . I was reading the generated script while posting.

In general I'm incredibly wary of a script which can do a huge amount of filesystem damage with a single command. For background, the very first place I ran it actually had about 6 real duplicate files. It had been regularly cleaned with fslint-gui. Right now it contains no dupes, running this with no (few) options:

$ rmlint -o sh:nopts-output
$ wc nopts-output 
  36984  250839 6053290 nopts-output

So almost 37,000 lines , mostly deleting meticulously created hard links, e.g.:

cp  bill.pdf   $BILLID.pdf
ln  $BILLID.pdf  ../topay
ln  $BILLID.pdf  ../$YEAR/$MONTH
ln  $BILLID.pdf  ../$SUPPLIER
....
# Later
mv ../topay/$BILLID ../paid

So ../paid might be empty, as may be 2023/Jan, 2023/Feb etc ....losing all those links would be a disaster.

BTW, it's probably a different git repo but "shredder" always just says "Nothing found" in all locations

Atrate commented 1 year ago

Also, IMO traversing filesystems should also be disabled by default, but that may be a matter of personal preference.

graemev commented 1 year ago

Thinking about "enhancements" rather than the "scary nature of the default values", it strikes me relatively simple to make this a much more powerful command.

Simply make the concept of "same" configurable:

e.g. I use FINDIMAGEDUPES(1p) to match "similar photos" 2 photos taken at exactly the same time & date . are probably the same photo 2 videos might be "the same" if the contents are the same, even if the container differs ...many similar such ideas

Rather than try to 2nd guess what people might want, simply allow a user exit (dynamic call to shared library routine) e.g. :

extern in rmlint_match(FILE fd1, FILE fd2, int flags); // return 0 if not the same

So I could simply add a library to my LD_LIBRARY_PATH (possibly the method name could be an arg to rmlint(1) rather than fixed, as shown here (so I could maintain a library of useful matchers). Typically all I'd do is call some other library code e.g. from FINDIMAGEDUPES package or ffmpeg and just make sure to return 0 or 1 as it determined.

Also a quick "speed up choice" (unrelated to the above) is "two files are NOT the same if the filesize differs" ... this bypasses the need to even open the file itself (There are some edge cases , e.g. sparse files where this assumption might be considered wrong)

cebtenzzre commented 1 year ago

@graemev That is definitely out of scope for this project. The entirety of rmlint is built around finding files with identical content, not similar content. It uses a powerful incremental hashing algorithm in order to match pairs of files, by preprocessing and sorting them into groups of equal filesize, then running incremental matching passes to split the groups, and finally postprocessing the matches. There is no one function in rmlint that simply compares two files, so it is not feasible to patch in another comparison function without essentially rewriting the program from the ground up.

Maybe you are looking for a program like dupeGuru instead.