sahib / rmlint

Extremely fast tool to remove duplicates and other lint from your filesystem
http://rmlint.rtfd.org
GNU General Public License v3.0
1.86k stars 128 forks source link

Implement new options to fully hash unique files #479

Closed SeeSpotRun closed 3 years ago

SeeSpotRun commented 3 years ago

Addresses #462

New option --hash-uniques means even unique files get fully hashed. These will generally be outputted to the json report by default. By also specifying --xattr-write, the checksums will be written to the files' extended attributes.

SeeSpotRun commented 3 years ago

@sahib I wonder if write-unfinished can / should be deprecated with this in place? Unfinished checksums seem a bit flakey. They're only really useful if another same-length file has a different unfinished checksum after exactly the same number of bytes. I can't see where we are storing the number of bytes hashed in the json file for --replay so I'm not sure how it's really supposed to work. While --hash-uniques is a bigger overhead for the first run, it's arguably more useful and robust than partial checksums.

sahib commented 3 years ago

I can't find the ticket right now, but there were some issues with --write-unfinished anyways (at least in the form that most users assumed it does what --hash-uniques does). Especially since it got implicitly enabled with the xattr feature. I would vote to remove it altogether.

SeeSpotRun commented 3 years ago

Ok have removed and also added --hash-unmatched which is like --hash-uniques but only hashes files that have one or more size twins. This will be much more efficient in most usecases.