Feature: Don't invalidate a hashfile just because it's name or location changes.

jim-collier commented 5 years ago

If I remember correctly, in the hashfile you are storing:

filename
ino
subvol
size
blocks
mtime
dedupe_seq

Also you're looking up an entry by (real) fully-qualified filespec.

I don't know what subvol is, but a consequence of looking up by filespec, is that if that path or name changes, then on subsequent runs, the file will be considered new, and the old entry deleted.

This could make moving or renaming a top-level directly extremely expensive for the next run with the --hashfile flag, turning a potentially minutes-long run into a potentially days-long run. And makes it potentially brittle if not even useless for some use-cases.

What if, instead, you also stored Btrfs csum - which, while itself prone to collisions, combined with other metadata would make a fairly bulletproof index key.

Rather than looking up a hash based on filename field, you'd use some indexed concatination of ino[de number?], size, mtime, and csum. If the path didn't match, just update the record. (Or maybe not even store it at all.)

There would be no risk of data loss with this approach, and a very low risk of other negative effects, with the added payoff of significantly less fragility and potentially significantly faster scans depending on use case.

A false positive would mean not deduplicating a file that was in fact a true candidate. A false negative would mean unnecessarily hashing it again. But the odds of either false outcome happening would be exceedingly low, possibly lower than your 128-bit checksum and certainly lower than 64-bit. (Though in-head estimation of the entropy of combining those values is fraught with peril considering their data ranges don't span their available widths in real-world use.)

And as you know, the risk of data loss posed by sending files to the kernel for deduplication that aren't actually duplicates, is almost nonexistent since the kernel double-checks that blocks are binary duplicates, before deduplicating them as instructed.

I'm just spitballing here, but the database then might have only two fields, like:

inode_size_mtime_csum [index]
checksum

jim-collier commented 5 years ago

So, "csum" isn't actually a thing, it's "crc32". And calculated & stored for blocks, not files. So not really relevant.

The 'index' might be better thought of as a 'fingerprint', and could be a simple string concatenation of "filesize_mtime_inode". A string might be more efficient (and debuggable), since a blob containing a binary concatenation of fixed-length 'fields' for each element, large enough to store the largest possible values for each, would result in a very large binary number of mostly zeros.

markfasheh commented 5 years ago

We subvol is btrfs subvolume or 'device' on every other filesystem. Between those two we already have a unique ID for the file which if I recall we are using. Where are you seeing this behavior? If it's happening even when you move a file on the same btrfs subvol or fs device then we likely have a bug.

biggestsonicfan commented 1 year ago

Piggybacking off this similar issue, but if you've run duperemove with a hash, then rename a folder you've deduped, can you manually edit the entries with an sqlite browser to avoid invalidating the entire hash for that folder?

EDIT: Running duperemove on that folder specifically and using the hash file will change the entries, so there's no issue if you don't forget to update your hash file like that before you proceed with further deduping.

JackSlateur commented 11 months ago

Hello @jim-collier @biggestsonicfan

This patch would do what you want, could you check it please ?

Edit: it brokes hardlinks detection :/

biggestsonicfan commented 11 months ago

Ooooh! Very nice! I've just compiled master branch (a few commits after said patch) already and will give it a try soon. Has the hash file itself changed or can I use my old one to test for now?

JackSlateur commented 11 months ago

Ooooh! Very nice! I've just compiled master branch (a few commits after said patch) already and will give it a try soon. Has the hash file itself changed or can I use my old one to test for now?

Sadly, the hashfile changed at every releases since at least v0.11

biggestsonicfan commented 11 months ago

Sadly, the hashfile changed at every releases since at least v0.11

Just gonna rip the bandaid off and delete the 7 GiB hash file I have now, lol.

JackSlateur commented 11 months ago

How large is your dataset ?

biggestsonicfan commented 11 months ago

I try not to think about it, it keeps me up at night...

JackSlateur commented 11 months ago

Hello, This feature is now implemented

Thank you for your contribution !

markfasheh / duperemove

Feature: Don't invalidate a hashfile just because it's name or location changes. #214