dedup whole files - Githubissues

pwr22 commented 9 years ago

The readme mentions that hashing is done per 128KB block by default, is there anyway to force hashing and dedup of whole files only?

markfasheh commented 9 years ago

No, but this is something that will happen implicitly anyway as duperemove discovers that the majority of a file is the same.

pwr22 commented 9 years ago

There are two issues still worth thinking about

What if the last chunk of the file is smaller than the block size? Will this always lead to the final chunk being stored uniquely each time and unnecessary fragmentation?
It is faster to do the hashing with the largest allowed blocksize (1M) and it is much much faster to do the dedupe processing after this. I only ever want to dedupe entire files and I think this would be even faster so for people like me maybe some kind of --whole-files option could be extremely useful

There is another project bedup but it's unmaintained and currently broken for new installs

markfasheh commented 9 years ago

What if the last chunk of the file is smaller than the block size? Will this always lead to the final chunk being stored uniquely each time and unnecessary fragmentation?

Yes, though if it's a big deal you can mitigate this somewhat by using a smaller blocksize.

I only ever want to dedupe entire files and I think this would be even faster so for people like me maybe some kind of --whole-files option could be extremely useful

Ok, so duperemove is a lot more concerned with individual block dedupe than you want. Internally we're classifying duplicates on a block by block basis, changing to whole files (optionally) is a possibility but it goes against the grain of everything else we're doing in duperemove so it could get a bit messy adding it in there. Well, we could probably hack it by saying "only dedupe extents which cover the entire file".

A couple questions:

Why is it that you require whole-file dedupe only
Do you need the software to discover the duplicated files, or is that information you would already have?

dioni21 commented 9 years ago

My answers to those questions, although I am not the original requester: 1) Speed? 2) Most distros have fdupes: http://en.wikipedia.org/wiki/Fdupes - Maybe an option to receive input from fdupes and call ioctls to dedup whole files?

pwr22 commented 9 years ago

Speed is a major concern and memory usage for storing the results of hashing large amounts of data, and my particular use case often has many duplicate files being written. I'm running lxc containers on top of btrfs
Was going to say the same thing as @dioni21

I've not checked how often duplicate blocks are found in identical vs non-identical files but a cursory glance at the output makes me think most of them (for me) are in the identical ones

Originally identical files which have diverged could well benefit from the incremental approach and I guess that was one of the motivations behind it? Though unfortunately only approximations of large blocks are going to end up matched due to block alignment and stuff

I tried running with 4k block size but had to stop before hashing completed due to being at 14/16 GB of memory. After running across my system at default values about 600MB of RAM was eaten due to the bug I've seen men

markfasheh commented 9 years ago

So if we want speed we can't hack this into the current extent search / checksum stage - that code is intentionally chopping up the checksums into blocks.

Taking our input from fdupes is doable, though it would be a special mode where we skip the file scan and checksum stages and proceed directly to dedupe. The downside would be that most other features of duperemove aren't available in this mode (but that doesn't seem to be a big deal for your use case).

I should add, the one thing I want to avoid is having duperemove do yet another kind of file checksum.

pwr22 commented 9 years ago

Perhaps it would be possible to build a seperate binary called "duperemovefiles" which takes a list of known to be idenitcal files? Hopefully this would be able to share some deduping code without messing with what "dupemove" does, which as you say is concern itself with blocks. This would keep the checksumming completely seperate or could be dropped completely from "duperemovefiles"

markfasheh commented 9 years ago

Either way (different binary, or special mode of duperemove) it's the same amount of work. Taking a file list from fdupes for whole-file dedupe seems useful, I've added it to the list of development tasks:

https://github.com/markfasheh/duperemove/wiki/Development-Tasks

Let me know if you agree/disagree with that writeup, otherwise I'll close this issue for now.

pwr22 commented 9 years ago

Looks good to me

markfasheh commented 9 years ago

FYI we now have an --fdupes option which will take output from fdupes and run whole-file dedupe on the results. Please give it a shot (example: fdupes . | duperemove --fdupes) and file any bugs you might find. Thanks again for the suggestion!

pwr22 commented 8 years ago

Sorry for not getting back to you sooner, just wanted to say thanks for much for implementing this, I'm trying it now

tlhonmey commented 8 years ago

Thanks for adding this. It gives me a way to be able to dedupe large quantities of small files safely without having to set the block size awfully low.

Floyddotnet commented 8 years ago

Is there any option to skip files are already deduped? run "fdupes . | duperemove --fdupes" twice the 2. run also needs many time trying to dedupe files there are already deduped in the previous run.

markfasheh / duperemove

dedup whole files #43