doesn't find nearly all duplicate extents anymore, not even identical files

fallenguru commented 1 year ago

I've been using duperemove with btrfs successfully for years, though mostly on Ubuntu 18.04 [v. 0.11] and previous incarnations of Debian stable [same].

Now I've reinstalled my main desktop with Ubuntu 22.04, which has 0.11.2—and the detection of duplicate extents is suddenly severely lacking.

For example, I tend to have multiple copies of games around for testing. These are created using cp --reflink, but while moving files around between old and new disks, I managed to break a whole bunch of links, ending up with multiple full copies. No problem, I thought, lets run duperemove over it. To my surprise, there was barely any change, say between 1 and 2 GB for two directories of 9-something GB each which I knew to be duplicates except for some config files and such. (The output of diff --recursive --quiet --no dereference confirmed that the two directories were in fact virtually identical.)

Now for the fun bit: When I ran fdupes (which dutifully found all duplicates) and piped its output to duperemove, they got deduped just fine. My two directories were back down to occupying the space of one copy plus change.

That's just one example out of many. The only other thing that changed, besides the Distro and duperemove version, is that the disk in question has compression on, which should be transparent; and anyway, the results on the other disk are also suspiciously bad.

Any ideas what could be going on here?

fallenguru commented 1 year ago

Digging into this a bit deeper I gather that there was a massive change in duperemove's behaviour soon after the release of 0.11, and duperemove (now) not finding duplicate files (any more) is, in a way, by design. Ordinarily I'd hang my head in shame now, apologise, and close the bug myself; but:

Respectfully, I don't think such a big change should have been introduced in a point release. H—, I'd have put a big " Note that the way duperemove searches for duplicate data by default has changed since v. 0.11, which may lead to unexpected results. " banner in the startup message.
I would appreciate an explanation for the rationale behind this change. If I understand correctly, duperemove will only try to find duplicate extents now, such that two files whose extent layout differs will not be (fully) deduped. The problem is, I don't see the real world scenario in which duplicate data would also share the same extent layout in the first place?

Certainly not on my backup server, which aggregates (incremental btrfs) backups from multiple boxes; they run similar OSes and their users have many overlapping use cases, so there's a lot of duplicate data, but the same extent layouts, that would be coincidence. Even on the same box, say you have file A, which is created, modified a bunch, then copied (without reflink, into file B), possibly copied around multiple times, with a different disk or so in the mix, meanwhile both are modified again. Every edit would fragment the file, but every full copy operation would effectively defragment it—again, the extent layouts being similar would be coincidence. Obviously you have to chunk the data in some form, but I just don't see how picking what amounts to a variable chunk size is going to give very good results re. finding duplicate data.

I tried the original accidental test case again, i.e. copy a directory with a couple of GBs' worth of files in it, large and small (without reflink), modify a bunch of files in both, but only a couple of 100 KB's worth, then dedupe. Almost no space savings, maybe 20 % [verified with btrfs fi du]. Next attempt, same procedure, except this time I ran btrfs defrag over the two directories before duperemove. Not even that did much.

Third, as I said, I tried to dig into this, but I came away utterly confused re. what options there are, what they actually do, and how in combination, they affect duperemove's search strategy. I.e. the man page linked on GitHub doesn't mention --dedupe-options=[no]block any more, "my" [v. 11.2, via Ubuntu 22.04] man page says it's "deprecated", but not whether it still works in this version; the latter has --dedupe-options=[no]partial, but doesn't go into any detail re. what it does, but the online man page does not mention it. Maybe turning --lookup-extents off again would help? I've no idea ...
And while I'm complaining about the man page—is -A still default for root? The older man pages explicitly said so, the current ones don't state a default.
Lastly, is there any way now to restore the legacy behaviour, or something close to it? Efficient or not, fragmentation or not, it did what I expect a dedup tool to do, find (partially) duplicate files.

In conclusion and IMHO, it runs contrary to a user's expectations that dupremove doesn't find duplicate files any more, especially considering that it used to do so just fine. Please consider different default settings as well as documenting the behaviour better.

fallenguru commented 1 year ago

A quick and dirty test script says that on 0.11.2, for my freshly copied directory test case at least:

with the default options, no deduplication happens [837.16MiB nominal / 837.16MiB exclusive / 0.00B shared on the copy according to btrfs fi du]
the "net change in shared extents" figure is utterly worthless. Depending on the settings, it will return anything from 200 MiB-ish to 580 GiB-ish ...
[no]block doesn't make any difference re. deduplication result
--lookup-extents=no gives the best result [837.16MiB / 276.00KiB / 831.33MiB]; in that case, partial doesn't matter
alternatively, --lookup-extents=yes --dedupe-options=partial is very good as well [837.16MiB / 1.79MiB / 829.75MiB].

Both good variants deliver the expected result as far as deduplication goes, which is "the copied directory should take barely any space", runtime-wise they're indistinguishable (on this small dataset).

Repeated with a similar dataset, only 9.5 GiB total, the default settings yielded 9.54GiB / 8.49GiB / 695.45MiB --lookup-extents=no gave [9.54GiB / 108.86MiB / 8.20GiB] and --lookup-extents=yes --dedupe-options=partial just hung after the hashing stage until I killed it after five minutes ...

JackSlateur commented 1 year ago

Hello @fallenguru

Thank you for your valued feedback

The behavior changed indeed since pre-v0.11 release Before, duperemove used to process blocks and nothing else After, it acquired the ability to process extents, which is really great from a scalability point of view and, while it decreases the overall efficency (as in deduplicated data), it can still yield significant results I have 10-15% efficency on my real world dataset

Yet, because extents-based dedupe is not be efficient for all use cases, the block-based was reintroduced soon after its removal

Anyway, here are the behavior mapping:

By default: extent-based, no block-based. This it the recommended usage if it suits your use case.
If not, then enable block-based along with extent-based: duperemove --hashfile=... --dedupe-options=partial

That is all. As stated in the man page, extent-based can be disabled with some options. There is no reason do to so, unless you are running some old version of btrfs, which version is not known to me.

Also, to answer your other question, files are opened in read-write mode by default

The documentation and man pages should be updated a bit more, perhaps I should drop a bit of the legacy parts or at least put them away

fallenguru commented 1 year ago

The behavior changed indeed since pre-v0.11 release

That's weird, because Debian/Ubuntu's 0.11 definitely have the old behaviour. It's not like them to ship "pre-someversion" as "version", usually. No matter.

the ability to process extents [...] is really great from a scalability point of view

Err, you did see that --lookup-extents=yes --dedupe-options=partial hangs on a mere 9.5 GB? I used to run it on a couple of 100 GB at a time (sometimes I'd forget to exclude the snapshot directory, then it was many times that ^^).

while it decreases the overall efficency (as in deduplicated data), it can still yield significant results I have 10-15% efficency on my real world dataset

What do you mean by 10–15 % efficiency? It deduplicates 10–15 % of what it could be deduplicating? If so, then that sounds ... really bad?

But more importantly, I thought my use cases of "consolidating backups and/or live home directories from multiple machines and/or users", and "converting conventional copies of what may or may not have been the same data once into reflink copies", and "saving as much space and/or IO as possible on machines with small and/or slow storage" were plenty real world. To give an additional, specific, example, I use a lot of WINE prefixes. Like, tens of them, even a hundred, across a couple of versions of WINE. Deduped, they'd take up basically no space. The new defaults break all of my use cases.

That's why I asked—what is the workload is towards which the new defaults are geared? What is the new "extents-based" mode meant to solve?

Because, as stated, I cannot even think of a usage pattern that would result in the extent layouts between duplicates being largely the same, and I'd really like to understand the rationale behind this.

If it's avoiding fragmentation, surely gathering duplicate chunks found in block mode into contiguous ranges, then dropping ranges that are below a configurable limit from consideration, i.e. "don't dedupe if doing so would create a fragment smaller than the given size" is the better option? Tuning the dedup stage, not nerfing the search stage? Could even defragment files a bit while you're at it.

extents-based dedupe is not be efficient for all use cases

The question is, should it be the default, and why? What is the expected behaviour? IMHO it is reasonable for a user to expect a deduplication tool to

find all chunks that are duplicated between files by default and
in particular find duplicate files (because intuitively, that's the simplest case).
It's also reasonable to expect optimisations to prioritise (potentially) large gains. By which I mean, no-one cares if it misses a couple of small duplicate files/regions, especially not if that has some other benefit like being faster, less fragmentation, what have you—but if it misses files ranging from a couple of 100 MB to a couple of GB that are largely identical, it's torches and pitchforks time.

As far as I can tell, "new" duperemove meets none of these expectations. And it's not just me. See #292, #282, #278, #276, #267, #239, #224, #218, #216. I dare say that's a big chunk of all the issues that were reported here since the change.

As stated in the man page, extent-based can be disabled with some options.

Respectfully, it is not stated in the man page. The man page doesn't even mention that there are two (well, three) entirely different modes, let alone how to switch between them. Assuming --lookup-extents is one of those options, that just says "allows duperemove to skip checksumming some blocks by checking their extent state"—that sounds like an optimisation that makes the process faster, not something that affects the end result. Assuming [no]fiemap is one of them, the man page makes it sound like a debug/workaround switch.

There is no reason do to [disable extents-based], unless you are running some old version of btrfs

Except for the fact that it barely does anything, you mean? ^^

Of the "good" options, what's the advantage of --lookup-extents=yes --dedupe-options=partial over --lookup-extents=no? Assuming it doesn't hang?

For the record, the filesystem I'm currently testing on was created with btrfs-progs v5.16.2 on kernel 5.19, but I'm on kernel 6.2 now.

I don't know who the lead maintainers are currently, but I'd really appreciate it if you could get together and reevaluate this "extents-based" change. Because if the version of duperemove that ships with Ubuntu 22.04 isn't actually buggy, if this is by design, then it seems to me to be a very poor design.

Push comes to shove, I can just downgrade to 0.11 or even 0.10, but that's hardly ideal. I suppose I could write a wrapper script that always runs it in two passes, --fdupes first, then a native run, but that would mean hashing everything twice = every run taking twice as long.

Or go look for another tool entirely, but duperemove has served me well for many years, and I'd really like it to continue to do so, to improve, not become next to useless because of some design choice.

JackSlateur commented 1 year ago

Yes my bad, the behavior changed here: https://github.com/markfasheh/duperemove/commit/6883b81576b4646afac4b4ed1b813504a0990264

I did read the fact that you hang on a very small dataset I am aware of some issues with the block-based deduplication, work is required there When you say it hangs: is your CPU working as hell ? In what step does it hang, exactly ? Could you show me the stdout ?

By 10-15% efficency, I mean it was able to reduce the dataset by 10-15%, which is quite good in my opinion But again, it depends on your dataset

The extent-same code does actually work, at least in some cases For instance, with newly create wine prefixes (mkdir prefix{1,2}; WINEPREFIX=... wine winecfg):

0.37 [jack:/tmp/test/plop/wine/mnt] du -sh *
1.6G    prefix1
1.6G    prefix2
0.34 [jack:/tmp/test/plop/wine/mnt] df -h .
Filesystem      Size  Used Avail Use% Mounted on
/dev/loop0      5.0G  3.3G  1.8G  66% /tmp/test/plop/wine/mnt
1.13 [jack:/tmp/test/plop/wine/mnt] duperemove -rhd --quiet .
Found 9292 identical extents.
Simple read and compare of file data found 3199 instances of extents that might benefit from deduplication.
Comparison of extent info shows a net change in shared extents of: 3.1GB
Total files scanned:  9298
2.73 [jack:/tmp/test/plop/wine/mnt] df -h .
Filesystem      Size  Used Avail Use% Mounted on
/dev/loop0      5.0G  1.5G  3.5G  30% /tmp/test/plop/wine/mnt

The default are aimed to provide a compromise between data deduplication and process speed / consumption

The man explains how to disable fiemap (extent-based operations): https://github.com/markfasheh/duperemove/blob/master/duperemove.8#L188

Except for the fact that it barely does anything, you mean? ^^ Of the "good" options, what's the advantage of --lookup-extents=yes --dedupe-options=partial over --lookup-extents=no? Assuming it doesn't hang?

Extents operation has no real drawback: on the worst case, you will spend a little more CPU time doing syscalls, for no gain The --lookup-extents=no was added in 2016 because of some issues with btrfs. I do not face those issues anymore, so perhaps all that code could be removed

I have been (re)working on duperemove in the last couple of month, with one main purpose: make duperemove more accessible and easier to use for common use cases Some changes has been merged and are living behind to 0.12 release

My todolist is currently focused on the code behind the hash phase and the block-based lookup, both of which are not the easiest to grasp In the hash phase, I'd like to batch hash insertion (greatly reduce memory usage for block scans on large files). Always doing whole-file checksum is a good idea and has been added In the block-based lookup, I'd like to fully implement the --batchsize stuff: currently, if you have a large dataset, the process collapses on itself due to suboptimal cpu operations (this prevents me from using the dedupe-options=partial on my personnal server, what a shame)

As I said, thank you for your feedback, I will definitely take it into account In the mean time, feel free to test the 0.12 version (or even better, the code from master) to see if it is somehow better

fallenguru commented 1 year ago

Some more testing, including the larger dataset [9.5 GB, bottom]:

duperemove-tests.tar.gz results

Conclusions:

The defaults are unsuitable.
The defaults plus --dedupe-options=partial should not be used, either (for the small dataset, all other options are strictly better; with the larger one it consistently hangs).
--lookup-extents=no isn't ideal, either, because --lookup-extents=yes --dedupe-options=nofiemap consistently beats it in all aspects. This means there's no space savings estimate at the end, but the number that reports is pure fiction anyway.
Counter-intuitively, --lookup-extents=yes --dedupe-options=nofiemap,partial actually helps with fragmentation, but is slower, at least on the larger set.

low fragmentation > dedupe speed? ⇒ --dedupe-options=nofiemap,partial dedupe speed > low fragmentation [or if the above hangs] => --dedupe-options=nofiemap

fallenguru commented 1 year ago

When you say it hangs: is your CPU working as hell ? In what step does it hang, exactly ? Could you show me the stdout ?

CPU load: hang

The stdout is in the log in the tar.gz in the previous post.

For instance, with newly create wine prefixes

I haven't actually tested WINE prefixes yet, just directories with WINE prefixes in them. With horrible results. Those have a lot more game data than actual prefix data, though, it's well possible the prefixes themselves dedupe fine. I'll check.

...... ...

No, exact same result with just a random pfx. Having ´fiemap´ on effectively kills my deduplication rate. Goes from 4 % exclusive data in the copy to to 89 %.

I'm reasonably happy with turning off fiemap for now; but if you think there's a bug here after all, I can try 0.12 if you like. It's just, if you really think 89 % duplicated data vs 4 % [pfx] respectively 92 % duplicated data vs 1 % [large dataset] is working as intended, I'd rather bow out now.

JackSlateur commented 1 year ago

Are you recreating the files between each tests ? fiemap excludes data that are already shared (whose physical offsets are the same)

fallenguru commented 1 year ago

Are you recreating the files between each tests ?

Yes. See test.sh in the tar.gz for details. It's really not a script meant for anyone to see, but I suppose it does document the testing process.

Re. the hanging issue: On the data I actually wanted to dedupe (~650 GB), duperemove hangs at file [055153/134315] (41.06%) in, no matter whether partial is enabled or not. The only difference is that nopartial doesn't tax the CPU as much. The file seems to be part of a journald journal.

RustyNova016 commented 1 year ago

I got three questions comming out of this thread:

What is the command to do the most dedupe in 0.12? I suppose it's --dedupe-option=same,partial, but this message (https://github.com/markfasheh/duperemove/issues/301#issuecomment-1667906107) confused me a bit.
I cannot install the 0.11.1 version, as using make gives errors about multiple definition of 'alloc__mutex'. Is it a big difference between 0.11.1 and 0.12 to warrant debuging/trying to find an older build?
Is is just better to jump to something like Bees? I came to Duperemove because it was way easier to install and configure, but now I'm wondering if it's worth it.

fallenguru commented 1 year ago

I cannot install the 0.11.1 version [...]

Cherry-picking commit 58dd49fb429339b7104c23224f45aa99c5d160a0 (Fix declare_alloc_tracking macro) should fix the FTBFS error you're seeing.

JackSlateur commented 1 year ago

Hello @fallenguru

I just merged one last bit of code about this issue, regarding identical files

It would be nice if you could check out the changes and share some feedback with me

JackSlateur commented 1 year ago

What is the command to do the most dedupe in 0.12? I suppose it's --dedupe-option=same,partial, but this message (doesn't find nearly all duplicate extents anymore, not even identical files #301 (comment)) confused me a bit.

I believe the two options (from the current master, not from the v0.12 released tag) are: duperemove -rhd /data Or: duperemove -rhd /data --dedupe-options=partial

Is is just better to jump to something like Bees? I came to Duperemove because it was way easier to install and configure, but now I'm wondering if it's worth it.

Bees and duperemove are not the same:

bees only works on btrfs while duperemove works at the VFS layer (eg works with btrfs but also xfs, probably bcachefs etc)
bees works at the filesystem layer while duperemove works at the file-layer: you have a better granularity

At the end, this is all about your usecases

JackSlateur commented 1 year ago

Hello @fallenguru

I believe the issues you reported are fixed in the latest release Feel free to reopen this if you still have an issue

Thank you for your report

markfasheh / duperemove

doesn't find nearly all duplicate extents anymore, not even identical files #301