sahib / rmlint

Extremely fast tool to remove duplicates and other lint from your filesystem
http://rmlint.rtfd.org
GNU General Public License v3.0
1.85k stars 128 forks source link

Treating files as equal only if path matches #651

Closed torarnv closed 3 months ago

torarnv commented 3 months ago

The b option is useful, but is there any way to also include the path in this?

Say I copy dir foo to dir bar/baz/foo. I'd like to check if the foo that's inside bar/baz is 100% equal to the foo in the original dir.

With the current b behavior, it will e.g. treat HEAD files from git as all equal, just because they all contain "master", but of course these can't be deduplicated, as it would break their respective git repos.

torarnv commented 3 months ago

I've used git diff --no index for this use-case, but I suspect git tries to also compute a hash of the changes. All I need to know is whether the files are different, not how they are different.

cebtenzzre commented 3 months ago

Say I copy dir foo to dir bar/baz/foo. I'd like to check if the foo that's inside bar/baz is 100% equal to the foo in the original dir.

If just you want to know whether the contents and structure of two directories is equal, you are probably looking for rmlint -Dj or even rmlint -T dd -j. From man rmlint:

-D --merge-directories (default: disabled) Makes rmlint use a special mode where all found duplicates are collected and checked if whole directory trees are duplicates. Use with caution: You always should make sure that the investigated directory is not modified during rmlint's or its removal scripts run.

-j --honour-dir-layout (default: disabled) Only recognize directories as duplicates that have the same path layout. In other words: All duplicates that build the duplicate directory must have the same path from the root of each respective directory. This flag makes no sense without --merge-directories.

Does that cover your needs, or are you trying to do something very specific here?

these can't be deduplicated, as it would break their respective git repos

From the manpage:

-r --hidden / -R --no-hidden (default) / --partial-hidden Also traverse hidden directories? This is often not a good idea, since directories like .git/ would be investigated, possibly leading to the deletion of internal git files which in turn break a repository. With --partial-hidden hidden files and folders are only considered if they're inside duplicate directories (see --merge-directories) and will be deleted as part of it.

So your example of rmlint corrupting a git repo is explicitly handled correctly by the default options - use --hidden at your own risk. If you have .git folders that are actually near duplicates, using git worktree is a much better idea than trying to run rmlint on them.

cebtenzzre commented 3 months ago

And honestly, rmlint is the wrong tool for finding the difference between two specific directories (files added, removed, modified, etc.). I suggest using something like rsync -ain --delete dir_a/ dir_b/ for such a task, especially since it's cheaper to compare a pair of files directly than to hash them first.

torarnv commented 3 months ago

Much appreciated @cebtenzzre, I missed honour-dir-layout as I only looked at https://rmlint.readthedocs.io/en/master/tutorial.html and https://rmlint.readthedocs.io/en/master/faq.html, not on https://rmlint.readthedocs.io/en/master/rmlint.1.html 😅

Perhaps https://rmlint.readthedocs.io/en/master/ should list the reference docs / man page as an entry under https://rmlint.readthedocs.io/en/master/#user-manual instead of https://rmlint.readthedocs.io/en/master/#informative-reference ?

But I think you're right, rsync or something similar that compares paths first would be better suited.

Thanks for your help! 🙌🏻