jvirkki / dupd

CLI utility to find duplicate files
http://www.virkki.com/dupd
GNU General Public License v3.0
112 stars 16 forks source link

Feature request - duplicate directories #42

Open trushworth opened 2 years ago

trushworth commented 2 years ago

A directory is a duplicate if everything in it (regular files, hidden files, and sub-directories) is identical to the contents of some other directory. This would be easy to do by defining a hash for a directory as the hash of the sorted list of hashes of all of its contents. It needs to be sorted by the hash values in case files or sub-directories have different names. The database could simply treat directories as another kind of file, although I expect it might be better to have a way to distinguish them, if only for listing.

A simple flage like "--directories" could be used to enable this during scan, and possibly to show only directories for a list command. I haven't really thought about the flags an how best they could be made consistent with the existing flags.

This would be interesting in the case where things like source code trees have been copied or picture databases or any directory tree structure that is fairly large. If the user can know that entire directory trees are duplicates the whole of a duplicate tree can be removed with "rm -r " instead of having to work through the files one by one, and then the empty directories, The workflow might end up something like: 1) find and remove duplicate directory trees (i.e. groups of files at once) 2) find and remove individula duplicate files (as we do now)

And by the way, thanks for a useful tool! I've stalled out on duplicate removal many times just because the other tools don't manage incremental work all that well.