sahib / rmlint

Extremely fast tool to remove duplicates and other lint from your filesystem
http://rmlint.rtfd.org
GNU General Public License v3.0
1.89k stars 132 forks source link

Misbehavior in fdupes with tagged #244

Closed amerlyq closed 6 years ago

amerlyq commented 7 years ago

I abuse rmlint to search exported headers from build tree into ./build/staging_dir/usr/include. General usage looks like this:

echo "${whole_list_of_package_headers}" | sort | find_dups ./build/staging_dir/usr/include 

Contents of find_dups is as follow:

cmd=( rmlint --hidden --types=df --output=fdupes --rank-by=pdlma --match-without-extension )

## Keep ./usr/include, remove package/*.h
cmd+=( -kM - // -- "$@" )   #1 BUG: -kM don't work as expected
# cmd+=( -Km -- "$@" // - ) #2 OK

## Keep package/*.h, remove ./usr/include
# cmd+=( -KM - // -- "$@" ) #3 OK
# cmd+=( -km -- "$@" // - ) #4 OK

exec "${cmd[@]}"

Results are next (some minor differences are visible only on big output, sorry for that)

#1

./build/jpeg-6b/jconfig.h

./build/staging_dir/usr/include/jconfig.h

./build/jpeg-6b/jpegint.h

./build/staging_dir/usr/include/jpegint.h

./build/jpeg-6b/jerror.h

./build/staging_dir/usr/include/jerror.h

./build/jpeg-6b/jmorecfg.h

./build/staging_dir/usr/include/jmorecfg.h

./build/jpeg-6b/jpeglib.h

./build/staging_dir/usr/include/jpeglib.h
#2

./build/staging_dir/usr/include/jconfig.h
./build/jpeg-6b/jconfig.h

./build/staging_dir/usr/include/jerror.h
./build/jpeg-6b/jerror.h

./build/staging_dir/usr/include/jmorecfg.h
./build/jpeg-6b/jmorecfg.h

./build/staging_dir/usr/include/jpegint.h
./build/jpeg-6b/jpegint.h

./build/staging_dir/usr/include/jpeglib.h
./build/jpeg-6b/jpeglib.h
#3

./build/jpeg-6b/jconfig.h
./build/staging_dir/usr/include/jconfig.h

./build/jpeg-6b/jpegint.h
./build/staging_dir/usr/include/jpegint.h

./build/jpeg-6b/jmorecfg.h
./build/staging_dir/usr/include/jmorecfg.h

./build/jpeg-6b/jerror.h
./build/staging_dir/usr/include/jerror.h

./build/jpeg-6b/jpeglib.h
./build/staging_dir/usr/include/jpeglib.h
#4

./build/jpeg-6b/jconfig.h
./build/staging_dir/usr/include/jconfig.h

./build/jpeg-6b/jpegint.h
./build/staging_dir/usr/include/jpegint.h

./build/jpeg-6b/jmorecfg.h
./build/staging_dir/usr/include/jmorecfg.h

./build/jpeg-6b/jpeglib.h
./build/staging_dir/usr/include/jpeglib.h

./build/jpeg-6b/jerror.h
./build/staging_dir/usr/include/jerror.h

I assume results #1 and #2 must be similar to each other, as do #3 and #4. But you can see, that #1 is strictly different -- fdupes treats all files as completely equal! Is it a bug, or fdupes spec requires that strange behaviour?

SeeSpotRun commented 7 years ago

Interesting; I never really considered -kM or -Km as valid options.

Case #1, -kM, should mean only report duplicates of untagged files but don't delete any tagged files. So it should only report untagged files that have untagged duplicates. In your case it shouldn't report any duplicates; instead it's reporting pairs of files but reporting both as originals, which is kind of correct.

Case #2, -Km shouldn't report any duplicates but unfortunately it is.

I think my preferred solution for this would be to make -kM and -Km invalid option combinations, unless you can convince me that these are useful combinations.

amerlyq commented 7 years ago

Two situations

Three usecases

Options -k/-K/-m/-M/-S looked like solution exactly to my usecase (with little post-processing of fdupes output). I don't have lingering attachment to these options, but man creates impression of them being independent.

       -k --keep-all-tagged / -K --keep-all-untagged
              Don’t delete any duplicates that are in tagged paths (-k) or that are in non-tagged paths (-K).  (Tagged paths are those that were named after //).
       -m --must-match-tagged / -M --must-match-untagged
              Only look for duplicates of which at least one is in one of the tagged paths.  (Paths that were named after //).
       -S --rank-by=criteria (default: pOma)
              ... Note: original path criteria (specified using //) will always take first priority over -S options.

Actual usage notes resulting from man of those three options:

But, in the next step I assumed algorithms for searching/comparison (w/o delving into code, of course). If my assumption correct, performance would be better if comparison subset for -m/-M is small. In such way invalid (by your logic) forms -kM/-Km were formed. -k/-Kstill controls which side of files/destinations to output, but -m/-M used to prefer known in advance smaller subset of files between sides. It's only my assumption about algorithms performance. So, if my assumptions are irrelevant to actual code, those option combinations are better to remove as you wished to avoid confusion similar to mine (however then we don't need two different options -- one would be enough to represent -km/-KM). Otherwise them must be fixed in suggestion of choosing lesser subset of files for comparison algorithms.

amerlyq commented 7 years ago

Actually, I'm beginning to incline to your proposal to simply prohibit usage of -kM/-Km. Which leaves us with the only independently valid variants: -k/-K/-m/-M/-km/-KM. It will be the most reasonable solution (of course, if the inner implementation itself auto-chooses smaller subset for comparison ops).

SeeSpotRun commented 7 years ago

Ok I'll go ahead and add the code to prohibit kM|Km

sahib commented 6 years ago

I'll added a check to prohibit -kM and -Km as of af47630. If tests passes, I'll close this ticket.