[RFC] Just block deduplication

nefelim4ag commented 9 years ago

It's not a 'true' way, but while computing phase a too slow, you can add "stupid mode" For just block deduplication, without extent optimization

JackSlateur commented 9 years ago

Well, the truth is that block dedup can be far more slower than extent dedup I made some tests: 2 same files, 4k blocks, very slow deduplication, because we ioctl each blocks one by one

The patch to disable extent optimization is pretty simple (a break in walk_dupe_block(), disabling remove_overlapping_extents in main()), if I am not fooling myself

nefelim4ag commented 9 years ago

@clobrother, try to dedup more data, like: truncate -s 4G ./{1..3} With optimization, and by block. Compute time for optimizing it, too big. I.e. yes, it's not a cool, but i just try to find workaround for long optimization phase: Hashing completed. Calculating duplicate extents - this may take some time. [% ]

And then i have several TB of duplicated data, i just want to dedup it, optimization means nothing. And in best case, with simple block-by-block deduplication, it must be happen, while scanning in another thread.

Ferroin commented 9 years ago

Running dedupe on files created with truncate is not in any way a good method to show how fast anything is. Truncate does not allocate any space on disk at all, so scanning the extents is really fast in kernel space because it sees that they are not actual on-disk blocks.

If you really want to test this, you will need to use at least fallocate to create the files, not truncate.

nefelim4ag commented 9 years ago

@Ferroin, Problem not in the read speed, problem in the algorithm, which compute "optimal" extents before deduplication and which can't dedup 1 Gb zeroed file into one 4k file/block but just dedup one 4k extent in one 4k extent in 1 Gb zeroed file, i.e. profit = +4k free space, instead of +1Gb free space

Ferroin commented 9 years ago

Except that if you just use truncate, there are no blocks to de-duplicate. Truncate allocates nothing, which means that you're not getting a good estimate of how much difference this actually makes for any real world usage. I'm not arguing that this isn't something that should be added, I'm arguing that you need to use a realistic benchmark to demonstrate it's benefit, although I do think that this is very much a niche feature.

Also, you need to keep in mind that currently, BTRFS uses 16k blocks by default except on small (I think the threshold is 16G) file-systems, and in some cases uses even larger blocks.

markfasheh commented 8 years ago

Ok so your problem is that the find extents phase takes too long. Indeed, it is very cpu intensive. I've done quite a few passes at optimizing it but as you point out there might be some cases where it takes up too much time.

Check out the following commits (at least) for some history on my working with that code:

dad9711 Large hash buckets (in the 10's of thousands range) are costly to walk in find_all_dups. 5497f7e Keep file blocks on per blocklist/filerec list

So there was some temporary code that tried to categorize larger buckets and run them in 'block only' mode. The commit removing that code (5497f7e) describes why I didn't like it. I thought I had made it perform well enough to no longer need such a hack.

I'm not opposed to a patch that adds an option to skip the optimization phase. We could always remove it before the next major release if it winds up not being needed any more. We need to document it in a manner which helps the user understand when using that option might help them so please try to do that along with it.

nefelim4ag commented 8 years ago

It's not a 'true' way, but while computing phase a too slow

So, i will close this request, because "slow" problem is solved in current trunk https://github.com/markfasheh/duperemove/commit/20348c4064e9c00e37c9c944e869661c6f3c1f7b, thanks @markfasheh

Block deduplication: https://github.com/markfasheh/duperemove/commit/f242479a950709e61ebce4957bf8546e137b09fd

markfasheh / duperemove

[RFC] Just block deduplication #107