sahib / rmlint

Extremely fast tool to remove duplicates and other lint from your filesystem
http://rmlint.rtfd.org
GNU General Public License v3.0
1.91k stars 132 forks source link

Feature suggestion: Filter by duplicate group size #544

Open Claes1981 opened 2 years ago

Claes1981 commented 2 years ago

Feature suggestion: Option to filter results by duplicate group size. Then you could get the results of many small duplicate files or folders, which together adds up to large duplicate groups. At the same time small files with only one or a few duplicates each, could be skipped and hopefully speed the search up. There is already some sorting by size in the preprocessing step if I understand correctly. Are the files sorted by total duplicate group size there (as with the "Sort by size of group", --sort-by=s, output option, not by individual file or directory size)? If by total group size, could you then just skip processing further all files and directories below a chosen group size threshold?

I sometimes run Rmlint on my whole multi-terabyte magnetic hard drives when I do some cleaning and organizing of files, but it can take a week or more. I am mostly interested in only the biggest duplicate groups of files and directories though, not in the millions of small files with only one duplicate each. The current implemented filter by file size (--size=) also filter out duplicate groups of smaller files, but which have a lot of duplicate files each, and would sum up to big sized duplicate groups.