markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
805 stars 80 forks source link

`duperemove-0.14` triggers asserts in `dedupe_extent_list()` at `run_dedupe.c:448` #329

Closed trofi closed 10 months ago

trofi commented 10 months ago

Ran duperemove from current master at 296d43a208be82e3ae2473ba51d15f6e025f0c80 Merge pull request #326 from autumnontape/cfi-icall, fails as:

$ gdb --args ./duperemove -rd  --batchsize=0 --dedupe-options=partial,same /nix/store/.links/
...
Thread 27 "pool" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff48a76c0 (LWP 783012)]
0x00007ffff7ae666c in __pthread_kill_implementation ()
   from /nix/store/znf2bj54q3qj4pyk0xbp7pk12xbxq07b-glibc-2.38-27/lib/libc.so.6
...
(gdb) bt
#0  0x00007ffff7ae666c in __pthread_kill_implementation ()
   from /nix/store/znf2bj54q3qj4pyk0xbp7pk12xbxq07b-glibc-2.38-27/lib/libc.so.6
#1  0x00007ffff7a91bf6 in raise () from /nix/store/znf2bj54q3qj4pyk0xbp7pk12xbxq07b-glibc-2.38-27/lib/libc.so.6
#2  0x00007ffff7a7a8f8 in abort () from /nix/store/znf2bj54q3qj4pyk0xbp7pk12xbxq07b-glibc-2.38-27/lib/libc.so.6
#3  0x000000000040d849 in dedupe_extent_list (passno=<optimized out>, kern_bytes=0x7ffff48a6cb0,
    fiemap_bytes=<synthetic pointer>, dext=<optimized out>) at run_dedupe.c:448
#4  extent_dedupe_worker (kern_bytes=0x7ffff48a6cb0, fiemap_bytes=<synthetic pointer>, dext=<optimized out>)
    at run_dedupe.c:496
#5  dedupe_worker (priv=<optimized out>, counts=0x7fffffffac50) at run_dedupe.c:539
#6  0x00007ffff7f04d72 in g_thread_pool_thread_proxy ()
   from /nix/store/6bpc4sc5apc2ryxhjyij43n3wi169hfn-glib-2.76.4/lib/libglib-2.0.so.0
#7  0x00007ffff7f0441d in g_thread_proxy ()
   from /nix/store/6bpc4sc5apc2ryxhjyij43n3wi169hfn-glib-2.76.4/lib/libglib-2.0.so.0
#8  0x00007ffff7ae48b1 in start_thread () from /nix/store/znf2bj54q3qj4pyk0xbp7pk12xbxq07b-glibc-2.38-27/lib/libc.so.6
#9  0x00007ffff7b68e6c in clone3 () from /nix/store/znf2bj54q3qj4pyk0xbp7pk12xbxq07b-glibc-2.38-27/lib/libc.so.6
trofi commented 10 months ago

If I run it in a single thread with --io-threads=1 --cpu-threads=1 I get ther assert failure in run_dedupe.c:287:

...
[0x7ffff0000b90] (0016/1626) Try to dedupe extents with id 0f49c355
[0x7ffff0000b90] Dedupe 1 extents (id: 0f49c355) with target: (0, 3243), "/nix/store/.links/12fmw1zahji9vkqy1b0mkwgad8s37lhryhd5f7z9gm2j114pxwra"
[0x7ffff0000b90] (0017/1626) Try to dedupe extents with id 568eb2f1
[0x7ffff0000b90] Dedupe 1 extents (id: 568eb2f1) with target: (0, 3245), "/nix/store/.links/0nwf3ywxdmn9c16s3y2djb52wr7qz55hmvnxa2q9kvgzgad262xd"
ERROR: run_dedupe.c:287
[stack trace follows]
/mnt/archive/big/git/duperemove/duperemove(print_stack_trace+0x29) [0x409b39]
/mnt/archive/big/git/duperemove/duperemove() [0x40d844]
/nix/store/6bpc4sc5apc2ryxhjyij43n3wi169hfn-glib-2.76.4/lib/libglib-2.0.so.0(+0x8ad72) [0x7ffff7f04d72]
/nix/store/6bpc4sc5apc2ryxhjyij43n3wi169hfn-glib-2.76.4/lib/libglib-2.0.so.0(+0x8a41d) [0x7ffff7f0441d]
/nix/store/znf2bj54q3qj4pyk0xbp7pk12xbxq07b-glibc-2.38-27/lib/libc.so.6(+0x908b1) [0x7ffff7ae48b1]
/nix/store/znf2bj54q3qj4pyk0xbp7pk12xbxq07b-glibc-2.38-27/lib/libc.so.6(+0x114e6c) [0x7ffff7b68e6c]
trofi commented 10 months ago

I wrote a duperemove-fuzz.bash fuzzer that crashes duperemove within 10 seconds for me:

#!/usr/bin/env bash

duperemove_binary=$1
target_dir=$2

shift; shift

if [[ -z $duperemove_binary ]] || [[ -z $target_dir ]]; then
    echo "Usage: $0 </abs/path/to/duperemove> <directory> [duperemove opts]"
    exit 1
fi

# fail on any error
set -e

mkdir "$target_dir"
cd "$target_dir"

shopt -s nullglob

while :; do
    sync
    files=(*)
    f_count=${#files[@]}
    dst=$f_count

    case $((RANDOM % 4)) in
        0)  # copy existing file
            [[ $f_count -eq 0 ]] && continue

            cp_arg=""
            case $((RANDOM % 2)) in
                0) cp_arg=--reflink=always;;
                1) cp_arg=--reflink=never;;
            esac
            src=$((RANDOM % f_count))
            cp -v "$cp_arg" "$src" "$dst"
            ;;
        1) # create new file of 4x4KB distinct blocks
            printf "0%*d" 4095 "$dst"  > "$dst"
            printf "1%*d" 4095 "$dst" >> "$dst"
            printf "2%*d" 4095 "$dst" >> "$dst"
            printf "3%*d" 4095 "$dst" >> "$dst"
            ;;
        2) # run duperemove
            "$duperemove_binary" "$@" -rd -b 4096 "$target_dir"
            ;;
        3) # dd 4KB of one file into another
            [[ $f_count -eq 0 ]] && continue

            src=$((RANDOM % f_count))
            dst=$((RANDOM % f_count))
            [[ $src = $dst ]] && continue

            src_block=$((RANDOM % 3))
            dst_block=$((RANDOM % 3))
            dd "if=$src" "iseek=$src_block" "of=$dst" "oseek=$dst_block" bs=4096 count=1
            ;;
    esac
done

Using:

$ time { rm -rfv ~/tmp/dr/ && bash ./duperemove-fuzz.bash $PWD/duperemove/duperemove ~/tmp/dr -q; }
...
4096 bytes (4.1 kB, 4.0 KiB) copied, 8.7406e-05 s, 46.9 MB/s
'90' -> '189'
Simple read and compare of file data found 29 instances of files that might benefit from deduplication.
ERROR: run_dedupe.c:287
[stack trace follows]
/home/slyfox/dev/git/duperemove/duperemove(print_stack_trace+0x29) [0x409b39]
/home/slyfox/dev/git/duperemove/duperemove() [0x40d844]
/nix/store/6bpc4sc5apc2ryxhjyij43n3wi169hfn-glib-2.76.4/lib/libglib-2.0.so.0(+0x8ad72) [0x7f0def29ad72]
/nix/store/6bpc4sc5apc2ryxhjyij43n3wi169hfn-glib-2.76.4/lib/libglib-2.0.so.0(+0x8a41d) [0x7f0def29a41d]
/nix/store/znf2bj54q3qj4pyk0xbp7pk12xbxq07b-glibc-2.38-27/lib/libc.so.6(+0x908b1) [0x7f0deee7a8b1]
/nix/store/znf2bj54q3qj4pyk0xbp7pk12xbxq07b-glibc-2.38-27/lib/libc.so.6(+0x114e6c) [0x7f0deeefee6c]
./duperemove-fuzz.bash: line 27: 2598070 Aborted                 (core dumped) "$duperemove_binary" "$@" -rd -b 4096 "$target_dir"

real    0m5.790s
user    0m0.993s
sys     0m2.831s
JackSlateur commented 10 months ago

Hello @trofi I cannot reproduce this behavior, either from my live systems or using your script

Could you give you your kernel version and btrfs-utils version used to create the filesystem ?

Thank you

trofi commented 10 months ago

I wonder if it's related to the compression settings I use. The following seems to reproduce failure for me:

$ uname -a
Linux nz 6.6.2 #1-NixOS SMP PREEMPT_DYNAMIC Mon Nov 20 10:59:39 UTC 2023 x86_64 GNU/Linux

$ btrfs --version
btrfs-progs v6.6.2
$ cd /tmp/
$ fallocate -l 10G btrfs.img
$ mkfs.btrfs btrfs.img
btrfs-progs v6.6.2
See https://btrfs.readthedocs.io for more information.

NOTE: several default settings have changed in version 5.15, please make sure
      this does not affect your deployments:
      - DUP for metadata (-m dup)
      - enabled no-holes (-O no-holes)
      - enabled free-space-tree (-R free-space-tree)

Label:              (null)
UUID:               a54fa315-268b-4820-bd9d-52eb263d23b6
Node size:          16384
Sector size:        4096
Filesystem size:    10.00GiB
Block group profiles:
  Data:             single            8.00MiB
  Metadata:         DUP             256.00MiB
  System:           DUP               8.00MiB
SSD detected:       no
Zoned device:       no
Incompat features:  extref, skinny-metadata, no-holes, free-space-tree
Runtime features:   free-space-tree
Checksum:           crc32c
Number of devices:  1
Devices:
   ID        SIZE  PATH
    1    10.00GiB  btrfs.img

$ sudo mount -onoatime,compress=zstd btrfs.img m/
$ sudo chown $(whoami) m/

$ time ./duperemove-fuzz.bash ~/dev/git/duperemove/duperemove $PWD/m/dr -q
...
Simple read and compare of file data found 34 instances of files that might benefit from deduplication.
ERROR: run_dedupe.c:287
[stack trace follows]
/home/slyfox/dev/git/duperemove/duperemove(print_stack_trace+0x29) [0x409b39]
/home/slyfox/dev/git/duperemove/duperemove() [0x40d844]
/nix/store/6bpc4sc5apc2ryxhjyij43n3wi169hfn-glib-2.76.4/lib/libglib-2.0.so.0(+0x8ad72) [0x7f769cc36d72]
/nix/store/6bpc4sc5apc2ryxhjyij43n3wi169hfn-glib-2.76.4/lib/libglib-2.0.so.0(+0x8a41d) [0x7f769cc3641d]
/nix/store/znf2bj54q3qj4pyk0xbp7pk12xbxq07b-glibc-2.38-27/lib/libc.so.6(+0x908b1) [0x7f769c8168b1]
/nix/store/znf2bj54q3qj4pyk0xbp7pk12xbxq07b-glibc-2.38-27/lib/libc.so.6(+0x114e6c) [0x7f769c89ae6c]
./duperemove-fuzz.bash: line 27: 3607680 Aborted                 (core dumped) "$duperemove_binary" "$@" -rd -b 4096 "$target_dir"

real    0m2.220s
user    0m0.872s
sys     0m1.107s
JackSlateur commented 10 months ago

With compression enabled, I did reproduce This commit would fix the issue

Could you check ? Thank you

trofi commented 10 months ago

The change fixes all the assert() related crashes for me. Thank you!

JackSlateur commented 10 months ago

Good news ! I have merged to commit, thank you for your report