sahib / rmlint

Extremely fast tool to remove duplicates and other lint from your filesystem
http://rmlint.rtfd.org
GNU General Public License v3.0
1.92k stars 132 forks source link

Feature request: Btrfs dedup superflag #338

Open jim-collier opened 5 years ago

jim-collier commented 5 years ago

You're probably aware that offline Btrfs deduplication is getting to be an important thing people do.

I wound up having to write a fairly complex bash script to wrap rmlint's deduplication, that required nontrivial time to read everything, script, research, and test--that basically just does:

It would be really amazing if there was a superflag or metaflag, that told rmlint to dedup Btrfs that required only one pass, and by default set it up for faster incremental future passes (which I'm guessing is the most common use case for deduping). For example:

rmlint --btrfs-dedup /folder

Which more or less would do:

(That is, if my understanding of --xattr-* is correct, e.g. previous issue.)

This would make rmlint the go-to solution for user-friendly Btrfs deduplication. Because there is no other deduplication project out there that does all of:

But as it is, rmlint is daunting. Hugely powerful and flexible, but prohibitively daunting.

Your closest competition for this use case, duperemover, is actually right on your heels in most of those categories, but with the following added benefits:

There doesn't really seem to be any other projects out there with those features and benefits, besides rmlint and duperemover. (I've tried or looked into all of them.) The one thing duperemover can't currently do, is store hashes in the file xattrs themselves, which for my use case is a huge win. (My ~10tb volume doesn't change much but files move around often, and get distributed with rsync -X.)

To be more platform-agnostic, you could also include superflags for other use-cases, such as deleting duplicates in one easy pass.

sahib commented 5 years ago

It would be really amazing if there was a superflag or metaflag, that told rmlint to dedup Btrfs that required only one pass, and by default set it up for faster incremental future passes (which I'm guessing is the most common use case for deduping). For example:

As mentioned in the other issue (#336): You don't need two passes.

Which more or less would do:

rmlint --xattr-write --write-unfinished /folder tempScript=$(mktemp).sh rmlint --xattr-read --types="duplicates" --no-followlinks --hidden --output= --config=$tempScript:handler=clone,reflink /folder $tempScript rm $tempScript

In theory we could do an option --xattr (let's call it like that for now) which is the same as --xattr-read --xattr-write --write-unfinished. Then your example would be come:

# Your example above probably has a typo in the --output / --config part:
$ rmlint --xattr --output=out.sh --config handler=clone,reflink --hidden -F  /folder
# Newer versions of rmlint will produce scripts that delete themselves by default:
$ ./out.sh

You could easily join those two commands to one via &&. See also below.

Smart/fast duplicate identification (by size, then first bytes, then last bytes, then hashfile).

Actually, rmlint is doing more something that could be called incremental hashing, i.e. it does not seek to the end. The benefit of this is surprisingly low.

But as it is, rmlint is daunting. Hugely powerful and flexible, but prohibitively daunting.

That is useful to hear.

To be more platform-agnostic, you could also include superflags for other use-cases, such as deleting duplicates in one easy pass.

Well, probably. A swiss knife always is a little more complex than a specialized one button tool. But I get your point.

To be more platform-agnostic, you could also include superflags for other use-cases, such as deleting duplicates in one easy pass.

One pass deletion keeps popping up quite often (latest was #330), but I have objections at this specific case (especially since you could do easily with rmlint && ./rmlint.sh -d). Do you have any other such use cases in mind that seem hard at the moment? After all, rmlint feature set grew a lot in the last years, so it was not developed always with specific use cases in mind.

jim-collier commented 5 years ago

I can see how this could be a legitimate philosophical issue. The trusty Linux principle of "do one thing and do it well" isn't even much help here, as rmlint does many things well. Or maybe more accurately, the narrow problem it addresses has a (probably) necessarily huge combination of valid input configurations.

Maybe it's like rsync vs XCOPY. XCOPY with just two flags, /S and /D, covers a common (if not by far the most common) use-case for copying multiple things. To get roughly analogous functionality out of rsync (or any linux copy utility) from scratch, requires a fair amount of research, man page reading, and awareness of subtle gotchas such as an ending slash or not. Tweaking rsync to do something slightly different, even if you're pretty familiar with it, is difficult and risky, and is often best done with a wrapper script for safety and reproducibility.

But the concept of superflag/metaflag/macros or whatever you'd like to call them, is one way to not compromise on the basic philosophy of a tool, while also offering some compromise to higher-level usage.

So imagine if rsync had a superflag called something like "--macro xcopy-sd" (or something more modern that doesn't evoke DOS but covers the same common use case). Internally it might just replace that flag with native equivalents, plus other overrides to otherwise default behaviors that might make more sense in that context.

Rmlint has so many flags and options, many of them mutually exclusive in a complex matrix, that some set of superflags/metaflags/macroflags could be really useful. Maybe something along the lines of:

Hope that helps. Just some thoughts to ponder. I appreciate your position on the issue.

sahib commented 5 years ago

I can see how this could be a legitimate philosophical issue. The trusty Linux principle of "do one thing and do it well" isn't even much help here, as rmlint does many things well. Or maybe more accurately, the narrow problem it addresses has a (probably) necessarily huge combination of valid input configurations.

Well, I think that it still only doing one thing ("Help free my filesystem from waste"). It's just that filesystem are horribly complex and most people do not even realize that. I'm not even user if that is a good thing or not.

Maybe it's like rsync vs XCOPY. XCOPY with just two flags, /S and /D, covers a common (if not by far the most common) use-case for copying multiple things. To get roughly analogous functionality out of rsync (or any linux copy utility) from scratch, requires a fair amount of research, man page reading, and awareness of subtle gotchas such as an ending slash or not. Tweaking rsync to do something slightly different, even if you're pretty familiar with it, is difficult and risky, and is often best done with a wrapper script for safety and reproducibility.

...and yet it's the first time I hear of XCOPY. Maybe I'm too young, google says its something from DOS. :smirk:

--macro=incremental-btrfs-dedup

Good idea. So maybe a --xattr and a --btrfs option that sets both up. I'll work something up (together with the documentation fixes in the next few days).

--macro=list-duplicate-directories

-D is actually already a "super option" for that, since it sets sensible defaults for search duplicate directories. Source code for reference. Or is there anything else that list-duplicate-directories would do?

jim-collier commented 5 years ago

...and yet it's the first time I hear of XCOPY. Maybe I'm too young, google says its something from DOS.

Ha, yeah I'm old enough to remember being a DOS user. But XCOPY is actually quite handy for Windows CMD scripting. Especially since every Windows install is guaranteed to have it.

-D is actually already a "super option" for that, since it sets sensible defaults for search duplicate directories. Source code for reference. Or is there anything else that list-duplicate-directories would do?

My impression from reading the docs several times through, was that by defualt, -D deletes the duplicate directories? (Or scripts them to be deleted...) Maybe I just assumed that and it was a stubborn assumption that docs couldn't shake? My personal use-case is to log them in an easy-to-human-read way, then manually inspect them. There's definitely no algorithmic way to determine which are the right ones to delete. Some might be higher in the directory tree, some lower, some longer paths, some shorter, some older, some newer, etc. (Although those types of options are extremely useful for making things easier.) I've been playing around with options to determine the best output for human-readable output, so I'm not sure which might be the "best".

sahib commented 5 years ago

My impression from reading the docs several times through, was that by defualt, -D deletes the duplicate directories? (Or scripts them to be deleted...)

It only puts an entry in the script (like all other dupes). rmlint tries to stay away from deleting stuff directly, since having a second step for gives you time to reflect on what you're doing.

My personal use-case is to log them in an easy-to-human-read way, then manually inspect them.

Remember that rmlint writes an rmlint.json which includes these kind of information. You could easily get all the paths of duplicate directories and print them with an expression like this:

$ cat rmlint.json | jq '.[1:-1][] | .path'

More complex queries could be done too, of course. For example grouping those paths.

nealmcb commented 3 years ago

I'm glad to see that an --xattr option was added in recent versions.

SeeSpotRun commented 3 years ago

Given that FIDEDUPERANGE is [supposedly] inherently non-destructive, I'm open to the idea of a fidedupe formatter which does the deduping while rmlint runs.

So for btrfs filesystems I would run something like:

$ rmlint --types df --size 4k --xattr --hash-unmatched -o fidedupe //mnt/btrfsroot 

And this could certainly be rolled up into a super-option which would set the above options

$ rmlint --btrfs-dedupe /mnt/btrfsroot 

or, to include read-only snapshots:

$ sudo rmlint --btrfs-dedupe --dedupe-readonly /mnt/btrfsroot 

(although --xattr won't work on read-only snapshots)