markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
795 stars 80 forks source link

duperemove on xfs doesn't recurse #225

Closed enok71 closed 4 years ago

enok71 commented 4 years ago

When running duperemove using an existing hash-file on an xfs filesystem, duperemove seems to see subdirectories as filesystem boundaries, and skips them. I successfully ran duperemove a year ago on the same filesystem, creating the hashfile which I now want to reuse after adding lots of files and subdirectories to the filesystem.

I use OpenSUSE 15.1 rpm duperemove-0.11.beta4-lp151.2.4.x86_64

# duperemove --version
duperemove v0.11.beta4

Running the command line:

# duperemove -dr --debug --hashfile=/root/hashfile.hash /mnt > /root/duperemove.out 2>&1

In /root/duperemove.out there is now a line:

# grep Skipping duperemove.out
Skipping file /mnt because of -x

True enough, /mnt is my mounted xfs filesystem. But even if I run

# duperemove -dr --debug --hashfile=/root/hashfile.hash /mnt/* > /root/duperemove.out 2>&1

The output now contains

# grep Skipping duperemove.out
Skipping small file /mnt/backup.sh
Skipping file /mnt/Shared because of -x
Skipping file /mnt/Logs because of -x
Skipping file /mnt/Images because of -x
Skipping file /mnt/Backups because of -x

i.e. all subdirectories are "skipped because of -x", which indicates that duperemove wrongly classifies them as crossing filesystems boundaries (?)

In any case, the process ends very quickly and no extents have been deduplicated.

BTW, compiling and running the latest duperemove from github renders the output:

bfile_del_orphans()/9429: Database error 22 while preparing hashes statement: unknown error
Warning: The hash file format in Duperemove master branch is under development and may change.
If the changes are not backwards compatible, you will have to re-create your hash file.
Gathering file list...
Adding files from database for hashing.
dominikholler commented 4 years ago

I got the 'Skipping file /mnt because of -x' for some sub directories, too, while other worked like expected. Because I do not understand how "-x" can be disabled on command line, I set ' one_file_system = 0' in file_scan.c.

lorddoskias commented 4 years ago

Dedupe is not supported across different mount points so it makes not sense to support this in duperemove. As a matter of correctness I think -x should be entirely removed and really considered the default behavior. As such I'm closing this issue.

enok71 commented 4 years ago

Just note that it didn't recurse into subdirectories which were NOT mount points.

enok71 commented 4 years ago

Besides - how should one run the command on a mounted filesystem if not by giving the mount point as the starting point?

lorddoskias commented 4 years ago

Just note that it didn't recurse into subdirectories which were NOT mount points.

This is a bug then, can you reproduce with current master?

Besides - how should one run the command on a mounted filesystem if not by giving the mount point as the starting point?

Yes, that's how it's supposed to be run, my point was that if you have /mnt/foo/bar and /mnt/foo/foor3 , where bar is a mount point then duperemove is supposed to ignore it but it's supposed to scan every file under foo3, provided foo3 is just a subdirectory.

enok71 commented 4 years ago

I agree that recursion should skip mounted filesystems (except filesystems mounted on mount points given on the command line).

I don't have the system available now, but as I recall it some directories were skipped and others not, like @dominikholler also reports. I couldn't see any reason/pattern.

As I recall it the error message I quoted from running the latest version from github indicated that the database format had changed. When I removed the old hash file it worked and did not skip any directories.

Perhaps hash file incompatibility was what triggered the bug, and the v0.11.beta4 did not check the format but instead failed in a strange way? (The hash file was created using some even older version.)

lorddoskias commented 4 years ago

So the problem likely was that duperemove did load the device id form the database file it didn't match with the one where folder currently resided and so it failed prematurely. I've just tested master with a btrfs with sub directories not being mountpoints and it worked as expected.