Documentation Enhancement: Large file deduplication examples in man page

farblos commented 2 years ago

I have been struggling with the deduplication of large files on btrfs (see also #276). My requirement was to dedupe as much as possible, not caring about defragmentation at all. Using fdupes and some du arithmetics I determined the max. possible deduplication and tried to reach that with duperemove.

My findings so far:

regular extent-based deduplication (duperemove -d -r -h -v --hashfile ...) deduplicates pretty bad and leaves lots of contents un-deduped. However, due to the hash file, a subsequent run on the same data completes quickly.
deduplication in fdupes mode (fdupes ... | duperemove -h -v --fdupes) comes close to max. possible deduplication. However, a subsequent run takes as long as the first run since already deduplicated files do not seem to be skipped (see #160).
finally there is block-based (is it called like that?) deduplication (duperemove --lookup-extents=no -d -r -h -v --hashfile ...).

That seems to combine the advantages of the former two approaches: It deduplicates close to the possible maximum and a second run completes quickly.

However, I haven't found any hints on that final approach in the documentation, only that mentioning of option --lookup-extents=no in issue #276. So here are my questions or proposals:

Does --lookup-extents=no have any disadvantages I have not yet discovered?
The man page mentions default value no for --lookup-extents, which is not correct.
Probably you could add some example using --lookup-extents=no to the man page.

Thanks!

lorddoskias commented 2 years ago

I have been struggling with the deduplication of large files on btrfs (see also #276). My requirement was to dedupe as much as possible, not caring about defragmentation at all. Using fdupes and some du arithmetics I determined the max. possible deduplication and tried to reach that with duperemove.

My findings so far:

regular extent-based deduplication (duperemove -d -r -h -v --hashfile ...) deduplicates pretty bad and leaves lots of contents un-deduped. However, due to the hash file, a subsequent run on the same data completes quickly.

This means you have files with identical content but different layout i.e their extents are different hence dedupe cannot happen. This behavior has been described in the FAQ section of the man page.

deduplication in fdupes mode (fdupes ... | duperemove -h -v --fdupes) comes close to max. possible deduplication. However, a subsequent run takes as long as the first run since already deduplicated files do not seem to be skipped (see fdupes improvement #160).

This is the case because in fdupes mode duperemove simply takes the list of files to dedupe from stdin, likely that's what fdupes outputs. However, it doesn't create a local database of the various hashes of blocks/extents of the files which can subsequently be used to skip files which haven't been changed since the last time they were deduped. So running in fdupes mode really boils down to always going through every file and scanning it.

finally there is block-based (is it called like that?) deduplication (duperemove --lookup-extents=no -d -r -h -v --hashfile ...). That seems to combine the advantages of the former two approaches: It deduplicates close to the possible maximum and a second run completes quickly.

However, I haven't found any hints on that final approach in the documentation, only that mentioning of option --lookup-extents=no in issue #276. So here are my questions or proposals:

There is no "final approach". Duperemove currently offers 3 distinct modes of operations. Each coming with their strengths and weaknesses. I admit that the extente based approach which is indeed currently "default" doesn't work for the majority of people due to differing extent layouts and I was even considering completely deprecating it/removing it despite it being considered "newer" than the block-based one.

Does --lookup-extents=no have any disadvantages I have not yet discovered?

The disadvantage is that you need to hold metadata for every block of a file. For a 10g file, if your blocksize is 128k (this is configurable) you'd have to store the metadata for 81920 blocks, this of course takes additional space on-disk. Naturally by having more blocks the operation of finding dups is going to be slower, because you have more data to process.

The man page mentions default value no for --lookup-extents, which is not correct.

Probably you could add some example using --lookup-extents=no to the man page.

Thanks!

farblos commented 2 years ago

Thanks for the explanation. I might try converting your replies (and my questions as far as needed) into a PR for the manpage.

The main obstacle here is that the manpage currently quite frequently uses "extent" as main unit of deduplication:

\fBduperemove\fR is a simple tool for finding duplicated extents and
submitting them for deduplication. When given a list of files it will

Any objections if I use the neutral word "region" for that, like this:

\fBduperemove\fR is a simple tool for finding duplicated regions and
submitting them for deduplication. When given a list of files it will

and somewhere explain that a region can be an extent or a block?

JackSlateur commented 11 months ago

Hello @farblos Thank you for your suggestion The documentation has been updated accordingly

As an extra note, some behavior has changed since this issue was opened Feel free to check the latest version !

markfasheh / duperemove

Documentation Enhancement: Large file deduplication examples in man page #278