markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
689 stars 75 forks source link

fdupes support question #101

Closed clara-j closed 7 years ago

clara-j commented 8 years ago

I am working on a fork of fdupes to add some features to help with duperemove.

So far I have added the ability to limit the files fdupes considers based on a min or max size. One other thing I was going to add was the ability to disable the byte for byte comparison after a checksum match. This would obviously add the possibility for a false match on checksum collision (very rare since file size is checked too). But I was wondering if duperemove or btrfs itself will also validate the data being de-duped is a match.

lpirl commented 8 years ago

Did you read about "Deduplication phase" in the WIki? Duperemove 'just' submits candidates to the kernel. The kernel will then do a full comparison (so yes, the btrfs file system driver does this, as far as I understand) and possibly dedupe the candidates.

clara-j commented 8 years ago

Thanks. No I had missed that page. That is good to hear since I think that will speed up the fdupes process. I will need to run some tests to verify but at least it won't impact data integrity.

clara-j commented 8 years ago

I re-opened this issue to see if there is anything else people would like to see changed in fdupes that would make it work better for duperemove.

I have now pushed to my fork a version that has 3 new arguments -b limits files based on minimum size -B limits files based on maximum size -e skip final byte to byte verification

When I have some time I am going to see if there is anything else I can do to speed up the process but was wondering if there was other features people would like added that I can look into.

noradtux commented 8 years ago

Having an option to make fdupes stay within a single filesystem (not crossing mounts) would be nice too.

markfasheh commented 8 years ago

An --excludes option would be excellent. Oh also, thanks for doing this!

clara-j commented 8 years ago

Having it stay within a given device seems like it will be a fairly easy addition.

For the exclude setting what behavior are you looking to have. If you can point me to another process that uses a similar behavior.

markfasheh commented 8 years ago

I'm thinking like rsync, so you can do --exclude= and provide a comma separated list of paths to skip. If you want I can give you a specific example.

juliantaylor commented 8 years ago

couldn't you just use grep for that in this case?

markfasheh commented 8 years ago

Sure but then fdupes is reading those files and comparing them, whereas with a --exclude you wouldn't even touch them.

clara-j commented 8 years ago

The xdevice argument should be done by the weekend. I then plan to add the ability to set K, M G for the size limit arguments to make it easier to use.

The exclude will be more complicated and will take me some time to do, but ya I understand the way rsync does it and should be able to emulate that I think. I will probably have to re-factor the original code too a bit to get it to work.

PS. Mark, no thank you for all the great work on this and other BTRFS stuff.

clara-j commented 8 years ago

pushed the code. There is now an x argument for cross devices limiting, and I also added the ability to add M,G, or K when specifying the size limits.

noradtux commented 8 years ago

On 18.09.2015 18:00, clara-j wrote:

pushed the code. There is now an x argument for cross devices limiting, and I also added the ability to add M,G, or K when specifying the size limits.

Great! Thank you :)