markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
794 stars 78 forks source link

Option to exclude paths or subvolumes #289

Closed mvglasow closed 11 months ago

mvglasow commented 1 year ago

I have btrfs set up in the following way:

Most of the data is fairly static: only a small subset of the data gets added, changed or deleted in a given period. Still some operations may introduce duplication (in my case, restoring files from a corrupted volume on a clone I made earlier).

The directory structure was created so that I can export each of the top-level volumes via Samba and have access to snapshot data through that share, much like commercial network file servers do.

However, the directory structure also means that running duperemove on any top-level dir will index all 24 snapshots, in addition to the current version of the data. For my purposes, it would be sufficient to just scan the current version plus its latest snapshot, i.e. 2 versions of the data instead of just 25. (At the cost of not catching cases where a file was reverted from the latest snapshot to an earlier one, but that should be rare and the space savings negligible.)

Right now, scanning the entire top-level subvolume, excluding snapshots but including the latest snapshot is a tedious chore. I would have to do a non-recursive run on the dir (catching all top-level files), then a recursive run on each of its subdirs plus the latest snapshot – lots of manual work and typing (or scripting).

I see three ways how this could be made easier, so I could achieve what I need with a simple command:

markfasheh commented 1 year ago

Hi, we have an --exclude= option already - is there something about the existing implementation that doesn't work for your setup?

mvglasow commented 1 year ago

Now that you mention it, I found it in the man page, and it looks like pretty much what I would have needed. Didn’t see it at first because it is not mentioned in the docs – you might want to add it there (or rather, keep the docs in sync with the man page, e.g. by running some man2html tool as part of your CI pipeline).

JackSlateur commented 11 months ago

@mvglasow Hello, I believe this issue is now fixed

The online documentation and the man pages are now generated from the same markdown files, so they are kept in synced

Thank you for your report