markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
805 stars 80 forks source link

Rolling window hash algorithm supported? #185

Open HaleTom opened 7 years ago

HaleTom commented 7 years ago

Are any rolling hash algorihms supported (eg like the one used in rsync?

Eg, if I have two files A and B which are identical, except one byte is inserted in the begining of B, will the hashing algorithm still find that the majority of the data is the same?

nefelim4ag commented 7 years ago

Nope, AFAIK duperemove just calculate hash for fixed block sized chunks And btrfs will not allow to do deduplication on not aligned data.

Ferroin commented 7 years ago

It could still in theory handle a rolling hash that operates at the FS block size (16k by default), but I don't see that being very useful to most people considering that it would insanely increase the amount of time the hashing takes.

mgorny commented 7 years ago

@Ferroin, what advantage would such a hash have over hashing the blocks separately?

Ferroin commented 7 years ago

As far as I know, the current implementation hashes blocks of the specified size (I think it's 1M by default, but in all the places I use it. I explicitly set the block size, so I don't remember the default), and don't overlap any blocks.

So, if you have two 16M files and use a 1M hash block size, duperemove will hash 32 blocks total. In that case, if there is a 1M region starting at 512k into each file and extending to 1536k that happens to be identical, it won't get deduplicated because the rest of the two blocks it's in in each file (0-512k and 1536-2048k) don't match.

If instead, you were to hash all 1M long ranges in the file for each possible filesystem block offset, you would instead end up deduplicating that region, but it would require a total of 1966082 blocks being hashed (983041 for each file), which means that the worst case would take a little over 61 thousand times longer to do the hashing, and that's not even accounting for the matching phase, which actually takes just as long if not longer than hashing. Even with optimizations, this would still take significantly longer than the way the current implementation works, and would not benefit most people.

This is a large part of the reason why I regularly advocate people writing their own deduplication tools to handle their specific data sets, quite often there is some structure to a file beyond just the raw data, and working around that structure can often lead to savings that wouldn't otherwise be possible.

nefelim4ag commented 7 years ago

For that case it much more simple to just hash fs sector size blocks, yes that will create many hashes but will find ALL possible duplicated blocks. (duperemove use 128k by default because of max compress extent size)

Ferroin commented 7 years ago

Except it may not behave as you think. I'm pretty sure that BTRFS (at least) won't merge multiple sequential reflinks pointing to a sequential range elsewhere into a single reflink, which means you'll have performance issues doing that with any reasonably sized file, so that would in turn add even more processing to duperemove (trying to coalesce duplicate blocks into extent ranges that are as large as possible).

nefelim4ag commented 7 years ago

@Ferroin, yes but old behaviour of duperemove was merge that logically in userspace =), but after @markfasheh decide to disable that by default. And may be btrfs must understand several sequential extents and try to merge that by itself.

➜  ~ dd if=/dev/zero of=./zero bs=4k count=128 oflag=sync
128+0 records in
128+0 records out
524288 bytes (524 kB, 512 KiB) copied, 0.278872 s, 1.9 MB/s
~ filefrag -v zero 
Filesystem type is: 9123683e
File size of zero is 524288 (128 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..    4095:          0..      4095:   4096:             not_aligned,inline
   1:        1..       8:   99039716..  99039723:      8:          1:
   2:        9..      26:   99039885..  99039902:     18:   99039724:
   3:       27..      49:   99040064..  99040086:     23:   99039903:
   4:       50..      66:   99040299..  99040315:     17:   99040087:
   5:       67..      94:   99040337..  99040364:     28:   99040316:
   6:       95..     102:   99040441..  99040448:      8:   99040365:
   7:      103..     111:   99040802..  99040810:      9:   99040449:
   8:      112..     125:   99040813..  99040826:     14:   99040811:
   9:      126..     127:   99040873..  99040874:      2:   99040827: last,eof
zero: 10 extents found
➜  ~ dd if=/dev/zero of=./zero bs=4k count=128;sync      
128+0 records in
128+0 records out
524288 bytes (524 kB, 512 KiB) copied, 0.0012163 s, 431 MB/s
➜  ~ filefrag -v zero                              
Filesystem type is: 9123683e
File size of zero is 524288 (128 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..      31:   99040982..  99041013:     32:             encoded
   1:       32..      63:   99040983..  99041014:     32:   99041014: encoded
   2:       64..      95:   99040984..  99041015:     32:   99041015: encoded
   3:       96..     127:   99040985..  99041016:     32:   99041016: last,encoded,eof

So btrfs do that for buffered writes, does i understand all correctly?

Ferroin commented 7 years ago

Doing it for buffered writes isn't the same as doing it for reflinks, and that behavior is implicit in the fact that the writes are buffered (it's kind of a defining aspect of filesystems that use extents). The EXTENT_SAME ioctl (which is what's (usually) used for deduplication) triggers a byte-wise comparison in the kernel (which is why hash collisions don't matter much), and then updates the secondary regions to b e reflinks to the first. I don't think that tries to merge reflinks at all, and I know for a fact defrag won't merge them (it will in fact break the reflinks and force a CoW of the files).