markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
689 stars 75 forks source link

Cannot dedupe tiny files. #108

Closed tlhonmey closed 7 years ago

tlhonmey commented 8 years ago

Currently, even reading from fdupes, files smaller than the defined blocksize are skipped. Is the kernel incapable of deduplicating tiny files? Or is this just an arbitrary leftover from duperemove's primary algorithm that should be turned off for fdupes mode?

Note: I tried reducing the MIN_BLOCKSIZE to 1, and testing with decreasing block sizes.

It grabs a bunch of them, but there comes a point where I get: [0x8065000] Dedupe for file "testfile1" had status (0) "[unknown status]".

It doesn't seem to cause any corruption, but it doesn't seem to deduplicate anything either.

I'm guessing that files small enough to get stored in-line can't be deduped. I don't know if that size threshold could be detected and set as the minimum, but that might be the best thing to do for fdupes mode if it can be. Either that or just a prettier error message.

notslang commented 8 years ago

I've noticed the same issue. I'm working with about 1TB of highly duplicated JS source files, so being able to dedupe them would be really nice. Also, when I pass a directory that doesn't have any big files in it duperemove prints out the usage page:

$ duperemove -rhd /mnt/accord
duperemove v0.10
Find duplicate extents and print them to stdout

Usage: duperemove [-r] [-d] [-h] [--debug] [--hashfile=hashfile] OBJECTS

"OBJECTS" is a list of files (or directories) which we
want to find duplicate extents in. If a directory is 
specified, all regular files inside of it will be scanned.

    <switches>
    -r      Enable recursive dir traversal.
    -d      De-dupe the results - only works on btrfs.
    -h      Print numbers in human-readable format.
    --hashfile=FILE Use a file instead of memory for storing hashes.
    --help      Prints this help text.

Please see the duperemove(8) manpage for more options.

Adding -v shows the "Skipping small file" messages, but without verbose, it makes it look like a syntax issue.

markfasheh commented 8 years ago

Yeah the kernel won't allow dedupe of small files that have their data inlined. If I remember correctly that'll be anything less than 4096 bytes.

So I think the solution is to check the # of files provided on the command line before filtering and print a better message if the user did the right thing but their files can not be deduped.

Ferroin commented 8 years ago

The exact inline value is variable based on mount options and block size. For 4k blocks with no mount option to say otherwise, it's roughly 3900 bytes. Current versions of mkfs.btrfs default to 16k blocks though on reasonably large storage devices wich means it's probably about 16200 bytes there (I'm not certain about this though).

markfasheh commented 7 years ago

Closing for now, @slang800 - that bug with respect to no dedupe candidates found should be fixed in master branch.