Closed the8472 closed 8 years ago
Sorry, mostly curious - but given that btrfs supports only block-based deduplication, what use would such a feature have?
My understanding was that files can consist of multiple extents that aren't necessarily aligned to block boundaries. If that's wrong this would indeed be irrelevant.
Fairly sure btrfs deduplication is restricted to block boundaries, yeah.
Instead of using fixed-sized blocks it might make sense to use a data-dependent variable-size chunking algorithm that makes the block boundaries resistant to data shifts. That way files that have been concatenated or rewritten to modify their headers could be deduped.
https://en.wikipedia.org/wiki/Rolling_hash#Content_based_slicing_using_Rabin-Karp_hash https://en.wikipedia.org/wiki/Rabin_fingerprint