markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
816 stars 81 forks source link

Shift resistant block partitioning #110

Closed the8472 closed 8 years ago

the8472 commented 8 years ago

Instead of using fixed-sized blocks it might make sense to use a data-dependent variable-size chunking algorithm that makes the block boundaries resistant to data shifts. That way files that have been concatenated or rewritten to modify their headers could be deduped.

https://en.wikipedia.org/wiki/Rolling_hash#Content_based_slicing_using_Rabin-Karp_hash https://en.wikipedia.org/wiki/Rabin_fingerprint

CyberShadow commented 8 years ago

Sorry, mostly curious - but given that btrfs supports only block-based deduplication, what use would such a feature have?

the8472 commented 8 years ago

My understanding was that files can consist of multiple extents that aren't necessarily aligned to block boundaries. If that's wrong this would indeed be irrelevant.

CyberShadow commented 8 years ago

Fairly sure btrfs deduplication is restricted to block boundaries, yeah.