markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
805 stars 80 forks source link

How to read output of duperemove? #90

Closed lapsio closed 9 years ago

lapsio commented 9 years ago

I've scanned my VMs directory for test because there's just few files (in print only mode) and...

http://pastebin.com/iTG3qi13

I can't see anything? What's going on here? It's just like... 20 files or so and tooooooooooons of log. Is this thing actually trying to merge tiny pieces of my VMs disks?... It's kind of terrible idea isn't it?... Maybe there should be flag to enable only full-file deduplication? I thought it works on file basis. With sector-based deduplication I'm not sure what to think about it... is it even safe? How badly would it affect performance? Does ZFS also perform sector-based dedup?

I'm using btrfs RAID6, snapshots and lzo compression. Everything on top of dm-cache on dm-crypt

markfasheh commented 9 years ago

Hi, so easy questions first - for full file dedupe you can use --fdupes mode. If you don't want to pipe output from fdupes, the stream format is extremely trivial to fake.

As far as safety goes, duperemove passes blocks to the btrfs-extent-same ioctl which does locking of the metadata and another check of the inode data before allowing a dedupe. So by design it should be safe though that doesn't exclude bugs of course.

ZFS: I really don't know.

The output is showing you where the duplicated data in each file is (in extent format). The reason you have a lot of output is because you have a lot of small bits of duplicate data:

"Simple read and compare of file data found 505 instances of extents that might benefit from deduplication."

"I can't see anything?": I'm not sure what you mean by this - it looks to me like you've got plenty of logs to see ;)

"Is this a good idea": So dedupe has advantages and disadvantages, this is why we don't dedupe by default in duperemove - the user can look over what changes might be made and decide based on their use case what they want to optimize for.

In your case I see a few files with many small duplicated extents (relative to file size). Well, there's a bunch of > 1 meg extents that might be nice to dedupe. If you're worried about fragmentation (it seemed like you were) then deduping them is not a good idea. On the other hand if you'd rather trade that for space savings, go ahead.

I hope this answers your questions.

lapsio commented 9 years ago

uh... why --fdupes isn't mentioned by help message? Are there any other "hidden" flags?

What do you mean by "If you don't want to pipe output from fdupes, the stream format is extremely trivial to fake."?

markfasheh commented 9 years ago

--fdupes was added recently and i haven't gotten around to updating the man page. Feel free to send me a patch ;) Otherwise I'll put it on my list as I agree it should get documented before release.

https://github.com/adrianlopezroche/fdupes

Check out fdupes. It will find identical files and print their names to stdout. You can then pipe this output to 'duperemove --fdupes' and duperemove will submit the entire files for dedupe (instead of finding extents on its own).

The format that fdupes uses to write duplicated filenames to stdout is trivial and can be deduced by running the program yourself. Hence I was saying that if you wanted full file dedupe without even runnin gfdupes (say you know the files are already dupes), you could run 'duperemove --fdupes' and type the input yourself.

markfasheh commented 9 years ago

Closed btw since I think I answered everything, feel free to reopen if you've got more questions.