markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
795 stars 80 forks source link

What's in the hashfile? #223

Closed businessBoris closed 5 years ago

businessBoris commented 5 years ago

I'm expecting that duperemove doesn't haven't any interest in filenames, other than a means of locating data (vs unused sectors).

If I mount (the real root, subvolid=5, of my) btrfs on a 'mktmp -d' created directory and then run:

duperemove -r etc --hashfile=blah 'mktemp -d'

on the mounted fs, I'm not expecting that the hashfile is recreated from scratch every time because, whilst the path name will have changed (through mktemp -d) the data will (almost completely) not.

What I actually find is that duperemove seems to recreated the hashfile by re-checksuming the files all over again.

Hopefully I am very wrong. Please could you confirm.

I'm expecting the hashfile to be a list of csum and disk location only. If the VFS API which decides to dedupe (or not) needs a file pathname, then duperemove should work it out to submit to the API. It shouldn't matter what the pathname of the duplicate block containing file is.

markfasheh commented 5 years ago

Hi, yes the hashfile is intended to help you avoid rescanning files. I'm on a short vacation now and don't have a test box in front of me to check your use case but it could be some interaction with the VFS/btrfs or simply a bug. My guess is that duperemove might be seeing the remount as a different file system. I can take a closer look later this week.

markfasheh commented 5 years ago

Oh to answer your question the hashfile is a sqlite3 db with 4 tables, a files table, an extent table and a block hash table. You can load it up in the 'sqlite3' command and take a look for yourself if you are so inclined.

businessBoris commented 5 years ago

[Feels so good not to be writing this on a phone this time!] Thank you for your answer. It looks like at least one of my points is wrong! duperemove doesn't seem to be running over the same files repeatedly without a reason.

I think the hashfile I was using when this happened was somehow locked or something. This issue should be closed as invalid/wrong. Many apologies and thank you for replying on your holiday. I hope you are having a blast!

I'm off to have a look at the tables to see if I can make use of them in the way that I hoped. Thank you again for humouring me.