Closed fallenguru closed 7 months ago
Digging into this a bit deeper I gather that there was a massive change in duperemove's behaviour soon after the release of 0.11, and duperemove (now) not finding duplicate files (any more) is, in a way, by design. Ordinarily I'd hang my head in shame now, apologise, and close the bug myself; but:
Certainly not on my backup server, which aggregates (incremental btrfs) backups from multiple boxes; they run similar OSes and their users have many overlapping use cases, so there's a lot of duplicate data, but the same extent layouts, that would be coincidence. Even on the same box, say you have file A, which is created, modified a bunch, then copied (without reflink, into file B), possibly copied around multiple times, with a different disk or so in the mix, meanwhile both are modified again. Every edit would fragment the file, but every full copy operation would effectively defragment it—again, the extent layouts being similar would be coincidence. Obviously you have to chunk the data in some form, but I just don't see how picking what amounts to a variable chunk size is going to give very good results re. finding duplicate data.
I tried the original accidental test case again, i.e. copy a directory with a couple of GBs' worth of files in it, large and small (without reflink), modify a bunch of files in both, but only a couple of 100 KB's worth, then dedupe. Almost no space savings, maybe 20 % [verified with btrfs fi du]. Next attempt, same procedure, except this time I ran btrfs defrag over the two directories before duperemove. Not even that did much.
--dedupe-options=[no]block
any more, "my" [v. 11.2, via Ubuntu 22.04] man page says it's "deprecated", but not whether it still works in this version; the latter has --dedupe-options=[no]partial
, but doesn't go into any detail re. what it does, but the online man page does not mention it. Maybe turning --lookup-extents
off again would help? I've no idea ...-A
still default for root? The older man pages explicitly said so, the current ones don't state a default.In conclusion and IMHO, it runs contrary to a user's expectations that dupremove doesn't find duplicate files any more, especially considering that it used to do so just fine. Please consider different default settings as well as documenting the behaviour better.
A quick and dirty test script says that on 0.11.2, for my freshly copied directory test case at least:
btrfs fi du
][no]block
doesn't make any difference re. deduplication result--lookup-extents=no
gives the best result [837.16MiB / 276.00KiB / 831.33MiB]; in that case, partial
doesn't matter--lookup-extents=yes --dedupe-options=partial
is very good as well [837.16MiB / 1.79MiB / 829.75MiB].Both good variants deliver the expected result as far as deduplication goes, which is "the copied directory should take barely any space", runtime-wise they're indistinguishable (on this small dataset).
Repeated with a similar dataset, only 9.5 GiB total, the default settings yielded 9.54GiB / 8.49GiB / 695.45MiB --lookup-extents=no
gave [9.54GiB / 108.86MiB / 8.20GiB] and --lookup-extents=yes --dedupe-options=partial
just hung after the hashing stage until I killed it after five minutes ...
Hello @fallenguru
Thank you for your valued feedback
The behavior changed indeed since pre-v0.11 release
Before, duperemove
used to process blocks and nothing else
After, it acquired the ability to process extents, which is really great from a scalability point of view and, while it decreases the overall efficency (as in deduplicated data), it can still yield significant results
I have 10-15% efficency on my real world dataset
Yet, because extents-based dedupe is not be efficient for all use cases, the block-based was reintroduced soon after its removal
Anyway, here are the behavior mapping:
duperemove --hashfile=... --dedupe-options=partial
That is all. As stated in the man page, extent-based can be disabled with some options. There is no reason do to so, unless you are running some old version of btrfs, which version is not known to me.
Also, to answer your other question, files are opened in read-write
mode by default
The documentation and man pages should be updated a bit more, perhaps I should drop a bit of the legacy parts or at least put them away
The behavior changed indeed since pre-v0.11 release
That's weird, because Debian/Ubuntu's 0.11 definitely have the old behaviour. It's not like them to ship "pre-someversion" as "version", usually. No matter.
the ability to process extents [...] is really great from a scalability point of view
Err, you did see that --lookup-extents=yes --dedupe-options=partial
hangs on a mere 9.5 GB? I used to run it on a couple of 100 GB at a time (sometimes I'd forget to exclude the snapshot directory, then it was many times that ^^).
while it decreases the overall efficency (as in deduplicated data), it can still yield significant results I have 10-15% efficency on my real world dataset
What do you mean by 10–15 % efficiency? It deduplicates 10–15 % of what it could be deduplicating? If so, then that sounds ... really bad?
But more importantly, I thought my use cases of "consolidating backups and/or live home directories from multiple machines and/or users", and "converting conventional copies of what may or may not have been the same data once into reflink copies", and "saving as much space and/or IO as possible on machines with small and/or slow storage" were plenty real world. To give an additional, specific, example, I use a lot of WINE prefixes. Like, tens of them, even a hundred, across a couple of versions of WINE. Deduped, they'd take up basically no space. The new defaults break all of my use cases.
That's why I asked—what is the workload is towards which the new defaults are geared? What is the new "extents-based" mode meant to solve?
Because, as stated, I cannot even think of a usage pattern that would result in the extent layouts between duplicates being largely the same, and I'd really like to understand the rationale behind this.
If it's avoiding fragmentation, surely gathering duplicate chunks found in block mode into contiguous ranges, then dropping ranges that are below a configurable limit from consideration, i.e. "don't dedupe if doing so would create a fragment smaller than the given size" is the better option? Tuning the dedup stage, not nerfing the search stage? Could even defragment files a bit while you're at it.
extents-based dedupe is not be efficient for all use cases
The question is, should it be the default, and why? What is the expected behaviour? IMHO it is reasonable for a user to expect a deduplication tool to
As far as I can tell, "new" duperemove meets none of these expectations. And it's not just me. See #292, #282, #278, #276, #267, #239, #224, #218, #216. I dare say that's a big chunk of all the issues that were reported here since the change.
As stated in the man page, extent-based can be disabled with some options.
Respectfully, it is not stated in the man page. The man page doesn't even mention that there are two (well, three) entirely different modes, let alone how to switch between them. Assuming --lookup-extents
is one of those options, that just says "allows duperemove to skip checksumming some blocks by checking their extent state"—that sounds like an optimisation that makes the process faster, not something that affects the end result. Assuming [no]fiemap
is one of them, the man page makes it sound like a debug/workaround switch.
There is no reason do to [disable extents-based], unless you are running some old version of btrfs
Except for the fact that it barely does anything, you mean? ^^
Of the "good" options, what's the advantage of --lookup-extents=yes --dedupe-options=partial
over --lookup-extents=no
? Assuming it doesn't hang?
For the record, the filesystem I'm currently testing on was created with btrfs-progs v5.16.2 on kernel 5.19, but I'm on kernel 6.2 now.
I don't know who the lead maintainers are currently, but I'd really appreciate it if you could get together and reevaluate this "extents-based" change. Because if the version of duperemove that ships with Ubuntu 22.04 isn't actually buggy, if this is by design, then it seems to me to be a very poor design.
Push comes to shove, I can just downgrade to 0.11 or even 0.10, but that's hardly ideal. I suppose I could write a wrapper script that always runs it in two passes, --fdupes
first, then a native run, but that would mean hashing everything twice = every run taking twice as long.
Or go look for another tool entirely, but duperemove has served me well for many years, and I'd really like it to continue to do so, to improve, not become next to useless because of some design choice.
Yes my bad, the behavior changed here: https://github.com/markfasheh/duperemove/commit/6883b81576b4646afac4b4ed1b813504a0990264
I did read the fact that you hang on a very small dataset I am aware of some issues with the block-based deduplication, work is required there When you say it hangs: is your CPU working as hell ? In what step does it hang, exactly ? Could you show me the stdout ?
By 10-15% efficency, I mean it was able to reduce the dataset by 10-15%, which is quite good in my opinion But again, it depends on your dataset
The extent-same code does actually work, at least in some cases
For instance, with newly create wine prefixes (mkdir prefix{1,2}; WINEPREFIX=... wine winecfg
):
0.37 [jack:/tmp/test/plop/wine/mnt] du -sh *
1.6G prefix1
1.6G prefix2
0.34 [jack:/tmp/test/plop/wine/mnt] df -h .
Filesystem Size Used Avail Use% Mounted on
/dev/loop0 5.0G 3.3G 1.8G 66% /tmp/test/plop/wine/mnt
1.13 [jack:/tmp/test/plop/wine/mnt] duperemove -rhd --quiet .
Found 9292 identical extents.
Simple read and compare of file data found 3199 instances of extents that might benefit from deduplication.
Comparison of extent info shows a net change in shared extents of: 3.1GB
Total files scanned: 9298
2.73 [jack:/tmp/test/plop/wine/mnt] df -h .
Filesystem Size Used Avail Use% Mounted on
/dev/loop0 5.0G 1.5G 3.5G 30% /tmp/test/plop/wine/mnt
The default are aimed to provide a compromise between data deduplication and process speed / consumption
The man explains how to disable fiemap (extent-based operations): https://github.com/markfasheh/duperemove/blob/master/duperemove.8#L188
Except for the fact that it barely does anything, you mean? ^^ Of the "good" options, what's the advantage of --lookup-extents=yes --dedupe-options=partial over --lookup-extents=no? Assuming it doesn't hang?
Extents operation has no real drawback: on the worst case, you will spend a little more CPU time doing syscalls, for no gain
The --lookup-extents=no
was added in 2016 because of some issues with btrfs. I do not face those issues anymore, so perhaps all that code could be removed
I have been (re)working on duperemove in the last couple of month, with one main purpose: make duperemove more accessible and easier to use for common use cases Some changes has been merged and are living behind to 0.12 release
My todolist is currently focused on the code behind the hash phase and the block-based lookup, both of which are not the easiest to grasp
In the hash phase, I'd like to batch hash insertion (greatly reduce memory usage for block scans on large files). Always doing whole-file checksum is a good idea and has been added
In the block-based lookup, I'd like to fully implement the --batchsize
stuff: currently, if you have a large dataset, the process collapses on itself due to suboptimal cpu operations (this prevents me from using the dedupe-options=partial on my personnal server, what a shame)
As I said, thank you for your feedback, I will definitely take it into account In the mean time, feel free to test the 0.12 version (or even better, the code from master) to see if it is somehow better
Some more testing, including the larger dataset [9.5 GB, bottom]:
Conclusions:
--dedupe-options=partial
should not be used, either (for the small dataset, all other options are strictly better; with the larger one it consistently hangs).--lookup-extents=no
isn't ideal, either, because --lookup-extents=yes --dedupe-options=nofiemap
consistently beats it in all aspects. This means there's no space savings estimate at the end, but the number that reports is pure fiction anyway.--lookup-extents=yes --dedupe-options=nofiemap,partial
actually helps with fragmentation, but is slower, at least on the larger set.low fragmentation > dedupe speed? ⇒ --dedupe-options=nofiemap,partial
dedupe speed > low fragmentation [or if the above hangs] => --dedupe-options=nofiemap
When you say it hangs: is your CPU working as hell ? In what step does it hang, exactly ? Could you show me the stdout ?
CPU load:
The stdout is in the log in the tar.gz in the previous post.
For instance, with newly create wine prefixes
I haven't actually tested WINE prefixes yet, just directories with WINE prefixes in them. With horrible results. Those have a lot more game data than actual prefix data, though, it's well possible the prefixes themselves dedupe fine. I'll check.
...... ...
No, exact same result with just a random pfx. Having ´fiemap´ on effectively kills my deduplication rate. Goes from 4 % exclusive data in the copy to to 89 %.
I'm reasonably happy with turning off fiemap
for now; but if you think there's a bug here after all, I can try 0.12 if you like.
It's just, if you really think 89 % duplicated data vs 4 % [pfx] respectively 92 % duplicated data vs 1 % [large dataset] is working as intended, I'd rather bow out now.
Are you recreating the files between each tests ?
fiemap
excludes data that are already shared (whose physical offsets are the same)
Are you recreating the files between each tests ?
Yes. See test.sh
in the tar.gz for details. It's really not a script meant for anyone to see, but I suppose it does document the testing process.
Re. the hanging issue: On the data I actually wanted to dedupe (~650 GB), duperemove hangs at file [055153/134315] (41.06%)
in, no matter whether partial
is enabled or not. The only difference is that nopartial
doesn't tax the CPU as much. The file seems to be part of a journald journal.
I got three questions comming out of this thread:
What is the command to do the most dedupe in 0.12? I suppose it's --dedupe-option=same,partial
, but this message (https://github.com/markfasheh/duperemove/issues/301#issuecomment-1667906107) confused me a bit.
I cannot install the 0.11.1 version, as using make
gives errors about multiple definition of 'alloc__mutex'
. Is it a big difference between 0.11.1 and 0.12 to warrant debuging/trying to find an older build?
Is is just better to jump to something like Bees? I came to Duperemove because it was way easier to install and configure, but now I'm wondering if it's worth it.
I cannot install the 0.11.1 version [...]
Cherry-picking commit 58dd49fb429339b7104c23224f45aa99c5d160a0 (Fix declare_alloc_tracking macro) should fix the FTBFS error you're seeing.
Hello @fallenguru
I just merged one last bit of code about this issue, regarding identical files
It would be nice if you could check out the changes and share some feedback with me
- What is the command to do the most dedupe in 0.12? I suppose it's
--dedupe-option=same,partial
, but this message (doesn't find nearly all duplicate extents anymore, not even identical files #301 (comment)) confused me a bit.
I believe the two options (from the current master, not from the v0.12 released tag) are:
duperemove -rhd /data
Or:
duperemove -rhd /data --dedupe-options=partial
- Is is just better to jump to something like Bees? I came to Duperemove because it was way easier to install and configure, but now I'm wondering if it's worth it.
Bees
and duperemove
are not the same:
At the end, this is all about your usecases
Hello @fallenguru
I believe the issues you reported are fixed in the latest release Feel free to reopen this if you still have an issue
Thank you for your report
I've been using duperemove with btrfs successfully for years, though mostly on Ubuntu 18.04 [v. 0.11] and previous incarnations of Debian stable [same].
Now I've reinstalled my main desktop with Ubuntu 22.04, which has 0.11.2—and the detection of duplicate extents is suddenly severely lacking.
For example, I tend to have multiple copies of games around for testing. These are created using
cp --reflink
, but while moving files around between old and new disks, I managed to break a whole bunch of links, ending up with multiple full copies. No problem, I thought, lets run duperemove over it. To my surprise, there was barely any change, say between 1 and 2 GB for two directories of 9-something GB each which I knew to be duplicates except for some config files and such. (The output ofdiff --recursive --quiet --no dereference
confirmed that the two directories were in fact virtually identical.)Now for the fun bit: When I ran fdupes (which dutifully found all duplicates) and piped its output to duperemove, they got deduped just fine. My two directories were back down to occupying the space of one copy plus change.
That's just one example out of many. The only other thing that changed, besides the Distro and duperemove version, is that the disk in question has compression on, which should be transparent; and anyway, the results on the other disk are also suspiciously bad.
Any ideas what could be going on here?