markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
794 stars 78 forks source link

SIGABRT in dedupe_extent_list: dext->de_num_dupes < 2 #249

Closed ericzinnikas closed 3 years ago

ericzinnikas commented 3 years ago

Using latest from master (as of a few days ago) I'm consistently running into the following backtrace:

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff715c859 in __GI_abort () at abort.c:79
#2  0x0000555555584356 in dedupe_extent_list (dext=0x60b0001b1f50, fiemap_bytes=0x7ffff02f7cf0, kern_bytes=0x7ffff02f7d10, passno=802)
    at run_dedupe.c:308
#3  0x000055555558537f in extent_dedupe_worker (dext=0x60b0001b1f50, fiemap_bytes=0x7ffff02f7cf0, kern_bytes=0x7ffff02f7d10)
    at run_dedupe.c:509
#4  0x00005555555854fa in dedupe_worker (priv=0x60b0001b1f50, counts=0x7fffffffe330) at run_dedupe.c:532

The offending line in run_dedupe.c:

abort_on(dext->de_num_dupes < 2);

I've tried digging into the function and doing some debugging myself, but I don't understand the internal structure well enough. It reproduces 100% of the time, always on the same file(s) -- if I skip that file, it will happen a few more times throughout the dedupe process.

filefrag output for two files I see this with:

$ filefrag -v repro_file
Filesystem type is: 58465342
File size of repro_file is 159744 (39 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:       32..      38: 7081836420..7081836426:      7:         32: last,eof
repro_file: 1 extent found

$ filefrag -v repro_file_2
Filesystem type is: 58465342
File size of repro_file_2 is 972596 (238 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..      95:  320672384.. 320672479:     96:
   1:      224..     237:  320672608.. 320672621:     14:             last,eof
repro_file_2: 1 extent found

Running with the following ags:

--hashfile=hashes.dat --lookup-extents=yes --dedupe-options=noblock --skip-zeroes --debug -hdr

lorddoskias commented 3 years ago

What is the exact command line you are running ? This says that you have a structure describing a duplicated extent but with only a single extent. This is somewhat of an oxymoron , because duplication implies at least 2 extents. Also I see that both files don't have overlapping extents in size.

ericzinnikas commented 3 years ago

./duperemove --hashfile=hashes.dat --lookup-extents=yes --dedupe-options=noblock --skip-zeroes --debug -hdr /external/dir1/ /external/dir2/folder/ /external/dir3/ -- this is what I'm running, just with dir names changed, I also enabled DEBUG in the makefile, but that just helped see the full backtrace in gdb.

There are several thousand files it is processing and those 2 it hits are not necessarily related / sequential to each other.

I know we ran into that overflow issue before, due to extent size, could there be another instance of that occurring in the codebase? Though I guess these are not particularly large.

lorddoskias commented 3 years ago

Ok, here are a couple of things to try.

  1. Try to reproduce by running ./duperemove --hashfile=newdb.dat --lookup-extents=yes --dedupe-options=noblock --skip-zeroes --debug -hdr /path/to/file1 /path/to/file2

  2. If it reproduces then do ./duperemove --io-threads=1 --cpu-threads=1 --hashfile=newdb2.dat --lookup-extents=yes --dedupe-options=noblock --skip-zeroes --debug -hdr /path/to/file1 /path/to/file2 - the idea here is to see if it will reproduce if we don't have any internal concurrency, just to exclude it as a probable cause.

  3. Then if you could provide a database file containing the 2 files which cause the sigabrt it would help immensely to debug it locally. I suspect you might be encountering collisions in 2 different extents, of different size, and this causes duperemove to misbehave because it's hitting a bug . Providing me the database would allow me to see what exactly is going on, bear in mind the 'files' table will contains full paths to the files so in case you want to censor them you should do it before sending me the db file.

ericzinnikas commented 3 years ago

Yep, it seems that there are 2 extents with the same hash but different lengths:

sqlite> select hex(digest), len from extents;
94457523E1FA1F44D3E931AA80892419|393216
23851BFA7DA72AF0B9CB11DA106601D1|55092
23851BFA7DA72AF0B9CB11DA106601D1|28672

and duperemove output:

Total files:  2
Total extent hashes: 3
Loading only duplicated hashes from hashfile.
Found 2 identical extents.
Simple read and compare of file data found 2 instances of extents that might benefit from deduplication.
Showing 1 identical extents of length 28.0K with id 23851bfa
Start           Filename
128.0K  "/data/test-duperemove/repro_file"
Showing 1 identical extents of length 53.8K with id 23851bfa
Start           Filename
896.0K  "/data/test-duperemove/repro_file_2"
Using 1 threads for dedupe phase
ERROR: run_dedupe.c:308

Here is the DB: [edit: removed, should be available in your email notifications. Else email me \firstname\>@\<lastname\.org and I'll send it.]

lorddoskias commented 3 years ago

This issue should be fixed by https://github.com/lorddoskias/duperemove/commit/5b94a81d4ada99dfebfd497e5b36aac6774259a1

Care to test it? When running it locally on your database file it doesn't detect any duplicates (as is correct, since the len parameter is different for both extents). Once you report it's okay I will merge the fix.

ericzinnikas commented 3 years ago

Thanks, I can confirm this works!

lorddoskias commented 3 years ago

Fix merged in #252 so closing.