sahib / rmlint

Extremely fast tool to remove duplicates and other lint from your filesystem
http://rmlint.rtfd.org
GNU General Public License v3.0
1.86k stars 128 forks source link

Usage of "-S p" with "rmlint abc // dev" syntax and -k K m M with simple (undivided) pathname list #597

Closed rnsc5jjjjj closed 1 year ago

rnsc5jjjjj commented 1 year ago

I am attacking my big mess, and I am stalled by not understanding how the -S p and P and -k K m M options work with respect to the two syntaxes for the command line paths. I expect that the situation I am struggling with is simple, but I don't see a clear explanation in the documentation. I guess this could be called a shortfall in the documentation. If this can be cleared up, I will propose an update to the documentation. This has to be one of the most thoroughly thought out and careful packages I have ever seen, so somehow I am confident that very likely there is a good an intentional design here.

I think it boils Down to:

The sixth paragraph of the man page equates "Preferred" with "Tagged". This seems quite clear, but I am listing it as an assumption to be corrected if appropriate. Further the "Original" is selected from "Preferred" and "Tagged" places if possible. If not possible, it is selected from the "UnPreferred" "UnTagged" "Not-Original" group. I take all of this as given, though not joined in the option documentation.

The -K, -K, -m, and -M options are described relative to the // syntax. If the first named path in the simple syntax is the same as a single element list after the // in the // syntax, then the first element in the simple list syntax is "tagged" and the -k, -K, -m, and -M functionality can be used with this understanding with the simple list syntax. If the first named path in the simple syntax is NOT "Tagged", then -k, -K, -m, and -M cannot be used with the simple syntax, which is a huge loss in functionality, and I think also an unreasonable conclusion. It does not say this anywhere that these options cannot be used with the simple syntax, but I don't see how to escape from this conclusion since their definition is described in terms of the // syntax, and there is no verbiage telling how the two syntaxes relate.

Turning to -S, the p and P options specify keeping the first or last "named path". The fifth paragraph of the man page describes using the "first-named path on the command line" as the original. This is the only other place that I could find that used the terminology "named path," and this wording seems very consistent with the -S p option always "keeping the first named path" (or -S P always "keeping the last named path"). It does not address how this relates to the // syntax where there are two lists separated by //: The first list of Non-Preferred/Non-Tagged paths, the second list of Preferred/Tagged paths. With the simple list syntax the designated "original" path is only a single path, but the "non-original" is a group consisting to the remainder of the list, a group of multiple paths. This isn't much different, though the order of the lists has been reversed. The simple list has the "Original" aka "Preferred" aka "Tagged" first, the // syntax has it the "Preferred" aka "Tagged" last. But both define two distinct lists with the same two roles.

For the simple list syntax, the -S functionality is used to select between candidates within one of the two groups: Either the single "Original" path which could have multiple candidates, or the "non-original" group which could also have multiple candidates. If there are candidates in the "Original" group one of its candidates is selected as the official "original". If not, one of the candidates in the "non-original" group is selected as the official "Original". The -S criteria are applied in order until it is narrowed down to a single candidate (Of course there is the case where there could be multiple candidates after all of the -S criteria are exhausted, but that is another topic and one that I am not concerned with here. In any case the decision can only be arbitrary, e.g. in these situations typically the outcome is "undefined" or "first seen" or "last seen". Don't know what rmlint does, but perhaps in this case it does not matter!)

Given my assumption that the simple list syntax and // syntax are simply different ways to express the two groups of paths, the exact same -S functionality is needed regardless of how the paths are specified.

For the // syntax, a selection to be the "Original" also has to be made, either within the "Preferred/Tagged" group if the duplicate set has a candidate there, otherwise within the "UnPreferred/UnTagged" group. But the -S option does not speak to how the p or P criteria relate when the paths have been specified with the // syntax, it does not have words referring to the // syntax like -k -K -m -M do, just as they did not have words applicable to the simple list syntax. Perhaps "-S p" and "-S P" cannot be used with the // syntax? This seems unreasonable - They are needed. If "-S p (and P)" CAN be used with the // syntax, does the reference to first and last bridge the list on the command line across the //? Or is it applied within the Original/Tagged/Preferred group if it has a candidate, and within the "Non-Original/UnTagged/UnPreferred group if the Original/etc. group if it is the only group with candidates? If it is the "first named path" on the command line and the "last named path" on the command line consistent with paragraph 5 of the man-page where the -S option is introduced, that would bridge the // with the first being on the left side and the last being on the right. With the 'p' criteria this would give preference to selecting a candidate to be the Original from the first of the non-tagged (UnPreferred) set which is counter-intuitive, and the last with 'P' would give preference to the last of the Tagged (Preferred) set which does indeed make sense. But it would make more sense to select within the appropriate group, not the first of last of a list of peer paths within the group. Or perhaps the "-S p" and -S P" cannot be used with the // syntax, but similar to above, it does not say this anywhere, and this would leave a big hole in the functionality, so that must not be true.

Lots of guesses could be made here, but I am hoping that someone can make definitive statements regarding how the two syntaxes relate, how the -S p or P and how the -k K m M options relate to the path specifications for those two syntaxes.

Thank you!

cebtenzzre commented 1 year ago

Tagged paths are described here and at the beginning of the manpage. -k, -K, -m, and -M, apply to tagged paths. Tagging a path is always higher priority than any ranking options specified with -S.

Preferred and tagged do in fact have the same meaning in the source code and documentation. There can be any number of paths before or after the // separator. rmlint doesn't have a complicated algorithm for picking originals, it just has ways of selecting which files have a higher priority for being considered original - mainly tagged paths and ranking (-S, which is pOma by default), but it is also affected by options like --keep-hardlinked.

If you have any specific questions, let me know, but I think the documentation explains it fairly well. I can't really parse all of the guesses you're making, it's a fairly simple interface.

rnsc5jjjjj commented 1 year ago

Thank you, I will study this and come back with bulleted questions. I am grateful for your help.

rnsc5jjjjj commented 1 year ago

A couple of related questions:

With: rmlint dir1 dir2 dir3

If dir1 contains one or more of a group of duplicates, it is assumed that one of the duplicates in dir1 is the original. The original is chosen from this group if possible using the -S criteria. If it is not possible to select an original within dir1 for whatever reason, one will be selected from dir2 or dir3. This is described in the paragraph beginning "rmlint tries to be helpful by guessing..." on the first page of the man-page.

Question 1: In the example above, is dir1 "Tagged" aka "Preferred" with all of the same implications, such that -k, -K, -m, -M all function in this simple list (a list without //) usage as described in their definition with respect to paths identified as "Tagged" paths with the // usage?

Question 2: Is: rmlint dir1 dir2 dir3 the same as rmlint dir2 dir3 // dir1 ?

If not, how is it different?

Thank you.

cebtenzzre commented 1 year ago

No, paths are never implicitly tagged/preferred. You must use // for -kmKM to have any meaning. The only reason the order of paths matters without tagging is because the default rank is pOma, and p will choose the first named path. If there are any tagged paths that match then they are the only ones considered for being marked original.

rnsc5jjjjj commented 1 year ago

Good, thanks.

Moving on to understanding -S p and -S P in the // context. I think I understand, and don't have any more questions. Could I ask you to follow my train of thought below to see if there are any errors or anything you would add? This will help not only me but to provide a stamp of credibility for this thread (Assuming that we will be done!).

rmlint dir1 dir2 dir3 // dir4 dir5 dir6

dir1, dir2, dir3 are not untagged aka unpreferred dir4, dir5, dir6 are tagged aka preferred

If possible an original is selected from a tagged (preferred) path. Which path, and which file within that path is driven by the -S criteria's impact on both path and file.

p, P, d, D, l, L, r, R are all criteria for judging a path in a set of paths and so could raise questions regarding which group of paths we are talking about, as I did in my earlier note (But they won't, as explained below).

Absent the -m flag, the original could be under a tagged (preferred) path or from under an untagged (unpreferred) path.

But per Gentle Guide 2.5.1 "Tagging always takes precedence over the -S options..." so we start with the Tagged group.

We are identifying the original, we first apply p or P to the right hand (Tagged/Preferred) group. 'p' selects the first named path of the Tagged group, P the last named path of the Tagged group. d and D, l and L, r and R as well as the non-path oriented criteria layering their qualifier on whatever restriction was imposed by previous criteria, but always within the Tagged group.

IF and Only IF nothing is found in the Tagged group, the same process will be followed for the UnTagged group. This would happen for groups of duplicates where there is no file meeting the path and file criteria within the Tagged group. Of course this could also be true of the UnTagged group, and then no Original would be found. Not sure what would happen in this situation, that is another topic. Only the rRxX criteria can result in this situation.

-m and -M could modify this and restrict the search to one group or the other.

-k and -K if specified could inhibit deletions of duplicates, but that is not related to selection of the "Original".

cebtenzzre commented 1 year ago

Yeah, that's basically how it works. All ranking options work the same way though - rRxX are not special, they are just more likely to result in all candidates having the same resulting priority (none matching the regex), and the next criteria being used instead. Or if there are no other criteria, something is picked at random. '-S' is only used for ordering, so an original will always be picked for groups of more than one file, even if none of the criteria help to select one.