markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
805 stars 80 forks source link

duperemove quadratic slowdown on incremental runs #303

Closed trofi closed 1 year ago

trofi commented 1 year ago

I use btrfs filesystem (compression enabled). I have about 30M files in 75K directories. It's a growing cache of build artifacts. It grows a directory at a time and has quite a bit of duplicates (50% to 70%). Most of the similar files have only partial match: ~20 bytes of ~1MB differ (different identifiers embedded).

I wondered if I could use duperemove incrementally one directory at a time. But look like at least by default duperemove does a lot of work when tried to deduplicate one file by reading (and maybe extracting extents?) for all known files.

The following seems to be enough to illustrate slowdown: we create N files and add files one by one into database (real use would be a directory-at-a-time):

$ rm -rf t hf.db; time { mkdir t; for i in `seq 1 1000`; do echo $i > t/$i; duperemove --hashfile=hf.db t/$i; done; }

I'll change 1000 to increase linearly to see how long it takes to process:

# 1k
real    0m8,429s
# 2k
real    0m21,543s (+13s)
# 3k
real    0m39,847s (+18s)
# 4k
real    1m4,735s (+25s)
# 5k
real    1m34,389s (+30s)

It's not a clean test as it also measures time to create the files. File creation is minor though: it takes 9ms to create 5k files.

I think we see a quadratic slowdown here. Is it an intended (required by design) behaviour? Or possibly an omission somewhere and incremental mode could be faster?

Thanks!

JackSlateur commented 1 year ago

Hello @trofi

When you do a command like this: duperemove -hr --hashfile=/tmp/hashes some-directory, duperemove will list and process the stuff in some-directory but will also check files found in the hashfile (do they still exist, maybe they changed etc)

For your use case, which is, as I understand it, some sort of append-only filesystem, all that work is useless and expensive

Could you check the quick code I push in the this branch ? It adds a --ignore-existing-files options for you to check

As a test, I created a directory that contains 100 sub-directory, with 1000 files each (to 100000 files in total) The hashfile has been built:

0.49 [jack:/tmp/mnt] hashstats /tmp/hashes
Raw header info for "/tmp/hashes":
  version: 3.0  block_size: 131072
  num_files: 100000 num_hashes: 0
Print top 10 hashes (this may take some time)

Running it from master, with just one directory as a target:

0.42 [jack:/tmp/mnt] /usr/bin/time -v duperemove -hr --hashfile=/tmp/hashes 1 > /dev/null
    Command being timed: "duperemove -hr --hashfile=/tmp/hashes 1"
    User time (seconds): 0.22
    System time (seconds): 0.78
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.02
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 40576
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 9575
    Voluntary context switches: 8
    Involuntary context switches: 151
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
0.57 [jack:/tmp/mnt] /usr/bin/time -v strace -fty duperemove -hr --hashfile=/tmp/hashes 1 |& wc -l
508305

And with the option:

0.24 [jack:/tmp/mnt] /usr/bin/time -v duperemove -hr --hashfile=/tmp/hashes 1 --ignore-existing-files >/dev/null
    Command being timed: "duperemove -hr --hashfile=/tmp/hashes 1 --ignore-existing-files"
    User time (seconds): 0.08
    System time (seconds): 0.03
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.12
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 22656
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 5086
    Voluntary context switches: 8
    Involuntary context switches: 14
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

0.77 [jack:/tmp/mnt] /usr/bin/time -v strace -fty duperemove -hr --hashfile=/tmp/hashes 1 --ignore-existing-files |& wc -l
13165
JackSlateur commented 1 year ago

Running your test on my system:

From master:

0.02 [jack:/tmp/test] 130 rm -rf t hf.db; time { mkdir t; for i in `seq 1 1000`; do echo $i > t/$i; duperemove --hashfile=hf.db t/$i >/dev/null; done; }

real    0m23.545s
user    0m6.188s
sys 0m17.464s
0.42 [jack:/tmp/test] rm -rf t hf.db; time { mkdir t; for i in `seq 1 2000`; do echo $i > t/$i; duperemove --hashfile=hf.db t/$i  >/dev/null; done; }

real    1m0.945s
user    0m17.297s
sys 0m43.896s

1.06 [jack:/tmp/test] rm -rf t hf.db; time { mkdir t; for i in `seq 1 3000`; do echo $i > t/$i; duperemove --hashfile=hf.db t/$i >/dev/null; done; }

real    1m48.824s
user    0m32.310s
sys 1m16.854s

With the option:

1.39 [jack:/tmp/test] rm -rf t hf.db; time { mkdir t; for i in `seq 1 1000`; do echo $i > t/$i; duperemove --hashfile=hf.db t/$i --ignore-existing-files >/dev/null; done; }

real    0m16.674s
user    0m5.141s
sys 0m11.602s
1.11 [jack:/tmp/test] rm -rf t hf.db; time { mkdir t; for i in `seq 1 2000`; do echo $i > t/$i; duperemove --hashfile=hf.db t/$i --ignore-existing-files >/dev/null; done; }

real    0m37.582s
user    0m14.209s
sys 0m23.538s
1.25 [jack:/tmp/test] rm -rf t hf.db; time { mkdir t; for i in `seq 1 3000`; do echo $i > t/$i; duperemove --hashfile=hf.db t/$i --ignore-existing-files >/dev/null; done; }

real    1m2.863s
user    0m26.583s
sys 0m36.609s
trofi commented 1 year ago

Thank you for such a quick response! --ignore-existing-files does indeed speed up incremental runs with a few caveats.

You are correct, it's a mostly append-only storage: less used directories are garbage-collected once in a few months and new directories are added every few minutes.

I tried to run the change on real storage (it's /nix/store storage from nix package manager) as:

$ cat ingest.bash
#!/usr/bin/env bash

find /nix/store -maxdepth 1 -type d |
while read d; do
    case $d in
        /nix/store) continue;;
        /nix/store/.links) continue;;
    esac

   echo "Ingesting $d"
    ./duperemove -r --hashfile=h.db --ignore-existing-files $d "$@"
done

Visually scanning the output it looks like some files are kept being reported as candidates on every run. If I pick 2 tily directories and add them manually it seems to report unrelated duplication attempts:

# it has 4 small files
./duperemove -r --hashfile=h.db --ignore-existing-files /nix/store/6ghc5jz83f7r4cd4cnpzagbk6jn1w92y-lv-4.51
...
15728640        "/nix/store/pbay9rq3hjapfix5d51acxkplyhrnil4-blas-3/lib/libblas.so.3"
15728640        "/nix/store/pbay9rq3hjapfix5d51acxkplyhrnil4-blas-3/lib/libcblas.so.3"
15728640        "/nix/store/7d0pxnfwsjk9wynnyg0bajscx6n17gmy-lapack-3/lib/liblapacke.so.3"
Showing 1 identical extents of length 1173158 with id 9ed5bac9
Start           Filename
0       "/nix/store/lmq1lqzh91knsc2i4a4yf3f58k1qg3s6-libmysofa-1.3.1/share/libmysofa/MIT_KEMAR_normal_pinna.sofa"
Showing 1 identical extents of length 1232695 with id 29000b05
Start           Filename
0       "/nix/store/wp6wyf46p68j7c7iy7i0l187bgkr727s-python3-3.10.12/lib/python3.10/ensurepip/_bundled/setuptools-65.5.0-py3-none-any.whl"
Showing 2 identical extents of length 1441792 with id bf04ff05
Start           Filename
10616832        "/nix/store/w26n65yi1a2n9ziwna6a5dknanc8xfnc-icu4c-73.2/lib/libicudata.so.73.2"
10616832        "/nix/store/7wl5vy21r1r0kh5mhliqwj1rplzaf8kl-icu4c-73.2/lib/libicudata.so.73.2"
Showing 1 identical extents of length 2055563 with id 80a4ef42
Start           Filename
0       "/nix/store/wp6wyf46p68j7c7iy7i0l187bgkr727s-python3-3.10.12/lib/python3.10/ensurepip/_bundled/pip-23.0.1-py3-none-any.whl"
Showing 2 identical extents of length 2490368 with id a3daa133
Start           Filename
655360  "/nix/store/w26n65yi1a2n9ziwna6a5dknanc8xfnc-icu4c-73.2/lib/libicudata.so.73.2"
655360  "/nix/store/7wl5vy21r1r0kh5mhliqwj1rplzaf8kl-icu4c-73.2/lib/libicudata.so.73.2"
Total files scanned:  2
Total extent hashes scanned: 3

# also has 4 small files, built against older libc version
$ ./duperemove -r --hashfile=h.db --ignore-existing-files /nix/store/7bzkrs0vgvr4fim4ayy0i8zqrf0z84zm-lv-4.51
...
15728640        "/nix/store/pbay9rq3hjapfix5d51acxkplyhrnil4-blas-3/lib/libblas.so.3"
15728640        "/nix/store/pbay9rq3hjapfix5d51acxkplyhrnil4-blas-3/lib/libcblas.so.3"
15728640        "/nix/store/7d0pxnfwsjk9wynnyg0bajscx6n17gmy-lapack-3/lib/liblapacke.so.3"
Showing 1 identical extents of length 1173158 with id 9ed5bac9
Start           Filename
0       "/nix/store/lmq1lqzh91knsc2i4a4yf3f58k1qg3s6-libmysofa-1.3.1/share/libmysofa/MIT_KEMAR_normal_pinna.sofa"
Showing 1 identical extents of length 1232695 with id 29000b05
Start           Filename
0       "/nix/store/wp6wyf46p68j7c7iy7i0l187bgkr727s-python3-3.10.12/lib/python3.10/ensurepip/_bundled/setuptools-65.5.0-py3-none-any.whl"
Showing 2 identical extents of length 1441792 with id bf04ff05
Start           Filename
10616832        "/nix/store/w26n65yi1a2n9ziwna6a5dknanc8xfnc-icu4c-73.2/lib/libicudata.so.73.2"
10616832        "/nix/store/7wl5vy21r1r0kh5mhliqwj1rplzaf8kl-icu4c-73.2/lib/libicudata.so.73.2"
Showing 1 identical extents of length 2055563 with id 80a4ef42
Start           Filename
0       "/nix/store/wp6wyf46p68j7c7iy7i0l187bgkr727s-python3-3.10.12/lib/python3.10/ensurepip/_bundled/pip-23.0.1-py3-none-any.whl"
Showing 2 identical extents of length 2490368 with id a3daa133
Start           Filename
655360  "/nix/store/w26n65yi1a2n9ziwna6a5dknanc8xfnc-icu4c-73.2/lib/libicudata.so.73.2"
655360  "/nix/store/7wl5vy21r1r0kh5mhliqwj1rplzaf8kl-icu4c-73.2/lib/libicudata.so.73.2"
Total files scanned:  2
Total extent hashes scanned: 3

It looks like libicudata.so.73.2 (and many others) gets brought up every time new run is started.

Do I understand correctly that duperemove keeps considering and re-considering the same extents unrelated to incremental input for deduplication?

trofi commented 1 year ago

If I run it with --quiet and remove the header here is how progress looks like:

Ingesting /nix/store/hxnwgjqv2ghx2ms9mqz4cz5w4mhgbc6y-file-5.45-dev
Found 18649 identical extents.
Simple read and compare of file data found 16938 instances of extents that might benefit from deduplication.
Total files scanned:  3

Ingesting /nix/store/7pynr1g9si15n72l0wzp8nz4qf5xb36x-libdbusmenu-qt
Found 18649 identical extents.
Simple read and compare of file data found 16938 instances of extents that might benefit from deduplication.
Total files scanned:  67

Ingesting /nix/store/n604q5jmihjc3qfhgbfwknrlsqbwj19x-libkate-0.4.1
Found 18523 identical extents.
Simple read and compare of file data found 16812 instances of extents that might benefit from deduplication.
Total files scanned:  154

Ingesting /nix/store/bdds9wax3qsnhjxfkinkxzdav3fdxriq-python3.10-pyudev-0.24.1
Found 18523 identical extents.
Simple read and compare of file data found 16812 instances of extents that might benefit from deduplication.
Total files scanned:  57

I suspect is means duperemove keeps reconsidering deduping roughly the same ~18K extents extents on each run.

JackSlateur commented 1 year ago

Hello You are correct, existing files are always considered for deduplication To be more specific, any new extent are matched against the whole dataset

There is an remaining optimisation: deduplicating the new extent with a single old extent (and not the whole list) All old extents are already deduplicated, so they live on the same physical position

That is a more difficult change, I'll think about it

JackSlateur commented 1 year ago

Hello @trofi

First of all, I have merge the previous code into master (https://github.com/markfasheh/duperemove/commit/fa63f6ee1b26feb543f314451e7cc063c3ff854b) The option has been renamed: duperemove [..] --dedupe-option=norescan_files

Also, I have pushed a whole bunch of changes that, I hope, will fix your issue Could you check this branch ?

trofi commented 1 year ago

--dedupe-option=norescan_files test: all good

Tested on a real system as:

$ find /nix/store -maxdepth 1 -type d | fgrep -v '/nix/store/.links' | grep -Pv /nix/store$ | xargs -t -n 1000 ./duperemove -q -r --hashfile=h.db --dedupe-option=norescan_files

drop-v2 branch behaves quite nicely! Thank you!

Running directory-at-a-time is a bit slow (each directory gets processed at least in about one second), but batching directories 100-at-a-time of 1000-at-a-time (like below) makes it very snappy!

My final database is 3.5G (looks decent).

--dedupe-option=norescan_files,partial test: seems to hang

Possibly related question: an incremental mode is also useful to see if duperemove gets stuck or is able to progress when ran against a small directory at a time.

I tried to increase deduplication precision by throwing --dedupe-option=norescan_files,partial at my 3.5G database:

$ find /nix/store -maxdepth 1 -type d | fgrep -v '/nix/store/.links' | grep -Pv /nix/store$ | xargs -t -n 1 ./duperemove -q -r --hashfile=h.db --dedupe-option=norescan_files,partial
./duperemove -q -r '--hashfile=h.db' '--dedupe-option=norescan_files,partial' /nix/store/57904rflsfh46y7fxi62rzc7sidc9vmq-user-environment
No dedupe candidates found.
./duperemove -q -r '--hashfile=h.db' '--dedupe-option=norescan_files,partial' /nix/store/kpadhaqz7vx83rrdvnwrjndkxq9s5iil-roboto-mono-2.002-20190125
Found 1552412 identical extents.

And the output is stuck on it (30+ minutes without progress). The directory does not look big:

$ du -hs /nix/store/kpadhaqz7vx83rrdvnwrjndkxq9s5iil-roboto-mono-2.002-20190125
1,2M    /nix/store/kpadhaqz7vx83rrdvnwrjndkxq9s5iil-roboto-mono-2.002-20190125
$ find /nix/store/kpadhaqz7vx83rrdvnwrjndkxq9s5iil-roboto-mono-2.002-20190125 | wc -l
14

Did duperemove got stuck indefinitely? Or it's just superslow?

JackSlateur commented 1 year ago

@trofi Could you checkout that branch again ? I have just pushed a couple more code

JackSlateur commented 1 year ago

Also note that you will need to recreate the hashfile The partial mode is only used for new (or modified) files

trofi commented 1 year ago

Deleted hash file, checked out latest branch state. Getting the following crash:

$ find /nix/store -maxdepth 1 -type d | fgrep -v '/nix/store/.links' | grep -Pv /nix/store$ | LANG=C xargs -t -n 1 ./duperemove -r --hashfile=h.db --dedupe-option=norescan_files,partial./duperemove -r '--hashfile=h.db' '--dedupe-option=norescan_files,partial' /nix/store/57904rflsfh46y7fxi62rzc7sidc9vmq-user-environment
Gathering file list...
Adding files from database for hashing.
No dedupe candidates found.
./duperemove -r '--hashfile=h.db' '--dedupe-option=norescan_files,partial' /nix/store/kpadhaqz7vx83rrdvnwrjndkxq9s5iil-roboto-mono-2.002-20190125
Gathering file list...
Adding files from database for hashing.
Loading only duplicated hashes from hashfile.
Found 1 identical extents.
ERROR: hash-tree.c:68
[stack trace follows]
./duperemove(print_stack_trace+0x2e) [0x40a48e]
./duperemove(insert_hashed_block+0x3f1) [0x406fe1]
./duperemove(dbfile_load_block_hashes+0x86) [0x410e06]
./duperemove(process_duplicates+0x1a7) [0x406777]
./duperemove(populate_tree+0x25a) [0x40c32a]
./duperemove(main+0x8bc) [0x405f7c]
/nix/store/905gkx2q1pswixwmi1qfhfl6mik3f22l-glibc-2.37-8/lib/libc.so.6(+0x23ace) [0x7ffff7b63ace]
/nix/store/905gkx2q1pswixwmi1qfhfl6mik3f22l-glibc-2.37-8/lib/libc.so.6(__libc_start_main+0x89) [0x7ffff7b63b89]
./duperemove(_start+0x25) [0x406505]
xargs: ./duperemove: terminated by signal 6

gdb's backtrace if it's of any help:

Program received signal SIGABRT, Aborted.
__pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
44      pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007ffff7bc7af3 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#2  0x00007ffff7b78c86 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007ffff7b628ba in __GI_abort () at abort.c:79
#4  0x0000000000406fe6 in insert_file_hash_head (head=<optimized out>, dups=<optimized out>) at hash-tree.c:128
#5  add_file_hash_head (block=<optimized out>, dups=<optimized out>) at hash-tree.c:151
#6  insert_hashed_block (tree=tree@entry=0x7fffffffaa30, digest=digest@entry=0x430348 "[\203\371\210\272\344UV+@G\035\022\344:\240", file=0x6206e0, loff=loff@entry=0, flags=flags@entry=8) at hash-tree.c:290
#7  0x0000000000410e06 in dbfile_load_block_hashes (hash_tree=hash_tree@entry=0x7fffffffaa30) at dbfile.c:1357
#8  0x0000000000406777 in process_duplicates () at duperemove.c:578
#9  0x000000000040c32a in populate_tree (cfg=cfg@entry=0x41b300 <dbfile_cfg>, batch_size=0, callback=callback@entry=0x4065d0 <process_duplicates>, scanned_filerec=scanned_filerec@entry=0x41b340 <scanned_filerec>)
    at file_scan.c:1034
#10 0x0000000000405f7c in create_update_hashfile (filelist_idx=<optimized out>, argv=<optimized out>, argc=<optimized out>) at duperemove.c:652
#11 main (argc=<optimized out>, argv=<optimized out>) at duperemove.c:741
JackSlateur commented 1 year ago

@trofi I have struggle reproducing our issue :/ On the other hand and after checking a bit deeper, that commit is buggy so I have reworked it

trofi commented 1 year ago

Tried commit 4a48838941d591c55119e9613a29fbe54e97aad8. It is able to run agains a few directories successfully and then fails:

$ rm h.db
$ find /nix/store -maxdepth 1 -type d | fgrep -v '/nix/store/.links' | grep -Pv /nix/store$ | LANG=C xargs -t -n 1 ./duperemove -r --hashfile=h.db --dedupe-option=norescan_files,partial
...
[19/19] (100.00%) csum: /nix/store/hhcnwljvhlg4ib1k7zc27pg23611sh64-glibc-2.37-8-bin/bin/getconf
Loading only duplicated hashes from hashfile.
Found 1 identical extents.
ERROR: hash-tree.c:68
[stack trace follows]
./duperemove(print_stack_trace+0x2e) [0x40a40e]
./duperemove(insert_hashed_block+0x3f1) [0x406f61]
./duperemove(dbfile_load_block_hashes+0x86) [0x410d76]
./duperemove(process_duplicates+0x138) [0x4066f8]
./duperemove(populate_tree+0x1b6) [0x40c206]
./duperemove(main+0x8b4) [0x405f74]
/nix/store/aw2fw9ag10wr9pf0qk4nk5sxi0q0bn56-glibc-2.37-8/lib/libc.so.6(+0x23ace) [0x7ffff7b63ace]
/nix/store/aw2fw9ag10wr9pf0qk4nk5sxi0q0bn56-glibc-2.37-8/lib/libc.so.6(__libc_start_main+0x89) [0x7ffff7b63b89]
./duperemove(_start+0x25) [0x4064f5]
xargs: ./duperemove: terminated by signal 6

gdb backtrace:

Program received signal SIGABRT, Aborted.
0x00007ffff7bc7a8c in __pthread_kill_implementation () from /nix/store/aw2fw9ag10wr9pf0qk4nk5sxi0q0bn56-glibc-2.37-8/lib/libc.so.6
(gdb) bt
#0  0x00007ffff7bc7a8c in __pthread_kill_implementation () from /nix/store/aw2fw9ag10wr9pf0qk4nk5sxi0q0bn56-glibc-2.37-8/lib/libc.so.6
#1  0x00007ffff7b78c86 in raise () from /nix/store/aw2fw9ag10wr9pf0qk4nk5sxi0q0bn56-glibc-2.37-8/lib/libc.so.6
#2  0x00007ffff7b628ba in abort () from /nix/store/aw2fw9ag10wr9pf0qk4nk5sxi0q0bn56-glibc-2.37-8/lib/libc.so.6
#3  0x0000000000406f66 in insert_file_hash_head (head=<optimized out>, dups=<optimized out>) at hash-tree.c:128
#4  add_file_hash_head (block=<optimized out>, dups=<optimized out>) at hash-tree.c:151
#5  insert_hashed_block (tree=tree@entry=0x7fffffffaaa0, digest=digest@entry=0x4307c8 "[\203\371\210\272\344UV+@G\035\022\344:\240", file=0x45b4e0,
    loff=loff@entry=0, flags=flags@entry=8) at hash-tree.c:290
#6  0x0000000000410d76 in dbfile_load_block_hashes (hash_tree=hash_tree@entry=0x7fffffffaaa0) at dbfile.c:1357
#7  0x00000000004066f8 in process_duplicates () at duperemove.c:575
#8  0x000000000040c2d6 in populate_tree (cfg=cfg@entry=0x41b300 <dbfile_cfg>, batch_size=0, callback=callback@entry=0x4065c0 <process_duplicates>)
    at file_scan.c:1039
#9  0x0000000000405f74 in create_update_hashfile (filelist_idx=<optimized out>, argv=<optimized out>, argc=<optimized out>) at duperemove.c:644
#10 main (argc=<optimized out>, argv=<optimized out>) at duperemove.c:732
JackSlateur commented 1 year ago

There is another bug report that seems similar: #285

JackSlateur commented 1 year ago

@trofi Could you send me your hashfile (the one that produces the crash) ?

trofi commented 1 year ago

Sure! https://slyfox.uni.cx/b/2023-09-09-h.db is a 3MB h.db I got. Should not contain any sensitive data.

JackSlateur commented 1 year ago

Thank you!

sqlite> SELECT * FROM hashes JOIN (SELECT DISTINCT digest FROM hashes WHERE digest IN (SELECT DISTINCT digest FROM hashes WHERE ino IN (SELECT DISTINCT ino FROM files WHERE dedupe_seq > (SELECT keyval FROM config WHERE keyname = 'dedupe_sequence'))) GROUP BY digest HAVING count(*) > 1) AS duplicate_hashes on hashes.digest = duplicate_hashes.digest;
[���UV+@G�:�|3730010316|848|0|8|[���UV+@G�:�
[���UV+@G�:�|3730010316|848|0|8|[���UV+@G�:�

So there are indeed duplicates records in the hashfile I have added unique constraints to the prevent such record from being inserted

This way, you will crash earlier :)

I will dig deeper

trofi commented 1 year ago

I think it detects the constraint violation now. But looks like it does not cause the duperemove to stop and instead it loops over and over:

$ ./duperemove -q -r '--hashfile=h.db' '--dedupe-option=norescan_files,partial' /nix/store/hhcnwljvhlg4ib1k7zc27pg23611sh64-glibc-2.37-8-bin
Warning: The hash file format in Duperemove master branch is under development and may change.
If the changes are not backwards compatible, you will have to re-create your hash file.
dbfile_store_block_hashes()/2739581: Database error 19 while executing statement: constraint failed
Skipping file due to error 19 from function csum_by_extent (No such device), /nix/store/hhcnwljvhlg4ib1k7zc27pg23611sh64-glibc-2.37-8-bin/bin/getconf
dbfile_begin_trans()/2739578: Database error 1 while starting transaction: SQL logic error
Skipping file due to error 1 from function csum_by_extent (Operation not permitted), /nix/store/hhcnwljvhlg4ib1k7zc27pg23611sh64-glibc-2.37-8-bin/bin/tzselect
dbfile_begin_trans()/2739586: Database error 1 while starting transaction: SQL logic error
Skipping file due to error 1 from function csum_by_extent (Operation not permitted), /nix/store/hhcnwljvhlg4ib1k7zc27pg23611sh64-glibc-2.37-8-bin/bin/xtrace
dbfile_begin_trans()/2739579: Database error 1 while starting transaction: SQL logic error
Skipping file due to error 1 from function csum_by_extent (Operation not permitted), /nix/store/hhcnwljvhlg4ib1k7zc27pg23611sh64-glibc-2.37-8-bin/bin/localedef
dbfile_begin_trans()/2739570: Database error 1 while starting transaction: SQL logic error
Skipping file due to error 1 from function csum_by_extent (Operation not permitted), /nix/store/hhcnwljvhlg4ib1k7zc27pg23611sh64-glibc-2.37-8-bin/bin/nscd
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Found 0 identical extents.
...
trofi commented 1 year ago

Nothing fancy on subvolumes side, but yes on hardlinks. All identical files are already hardlinked to the same inodes:

$ LANG=C stat /nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/libexec/getconf/POSIX_V6_LP64_OFF64 /nix/store/hhcnwljvhlg4ib1k7zc27pg23611sh64-glibc-2.37-8-bin/bin/getconf
  File: /nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/libexec/getconf/POSIX_V6_LP64_OFF64
  Size: 29248           Blocks: 64         IO Block: 4096   regular file
Device: 0,30    Inode: 3730010316  Links: 5
Access: (0555/-r-xr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-08-05 11:00:51.000000000 +0100
Modify: 1970-01-01 01:00:01.000000000 +0100
Change: 2023-08-05 11:00:52.321137476 +0100
 Birth: 2023-08-05 11:00:51.598124397 +0100
  File: /nix/store/hhcnwljvhlg4ib1k7zc27pg23611sh64-glibc-2.37-8-bin/bin/getconf
  Size: 29248           Blocks: 64         IO Block: 4096   regular file
Device: 0,30    Inode: 3730010316  Links: 5
Access: (0555/-r-xr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-08-05 11:00:51.000000000 +0100
Modify: 1970-01-01 01:00:01.000000000 +0100
Change: 2023-08-05 11:00:52.321137476 +0100
 Birth: 2023-08-05 11:00:51.598124397 +0100

I have 4 names for it:

$ find /nix/store/ -inum 3730010316
/nix/store/.links/0miarylrg34m2gn2hji7a8q6rvf96zicp93q7rf87w7dwfmc41l5
/nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/libexec/getconf/POSIX_V6_LP64_OFF64
/nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/libexec/getconf/POSIX_V7_LP64_OFF64
/nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/libexec/getconf/XBS5_LP64_OFF64
/nix/store/hhcnwljvhlg4ib1k7zc27pg23611sh64-glibc-2.37-8-bin/bin/getconf
JackSlateur commented 1 year ago

Ok, more bugs were squashed

I have added more constraints on the hashfile database, as well as, hopefully, fixed your issue (c89a771a3f5c37a81cf2f25bd0b8c6083dd9c17c)

Also, the partial lookup is still a bit slow on my system, it appears that the lookaside allocator is disabled at compile-time on libsqlite3 on Debian An issue has been raised

edit: more bugs found .. (and fixed with 1657ab8df820e27d01ffc29f967776e78acd86fe)

trofi commented 1 year ago

I somehow am seeing SIGSEGVs in multithreaded mode on ./duperemove -q -r '--hashfile=h.db' '--dedupe-options=norescan_files' -d ... (commit 5ad33619a866d2b496db278325f62219d75adfdb).

gdb backtrace:

(gdb) bt
#0  0x00007f4b16fbcded in sqlite3DbMallocRawNN ()
   from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#1  0x00007f4b16fc04fb in sqlite3DbStrDup ()
   from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#2  0x00007f4b16fce527 in sqlite3VdbeScanStatus ()
   from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#3  0x00007f4b16fce5db in sqlite3WhereAddScanStatus ()
   from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#4  0x00007f4b1703a4db in sqlite3WhereBegin ()
   from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#5  0x00007f4b17061484 in sqlite3Update.constprop.0 ()
   from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#6  0x00007f4b170445a4 in yy_reduce.constprop ()
   from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#7  0x00007f4b17045de7 in sqlite3RunParser ()
   from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#8  0x00007f4b1704672f in sqlite3Prepare ()
   from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#9  0x00007f4b17046ba3 in sqlite3LockAndPrepare ()
   from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#10 0x00007f4b17046d06 in sqlite3_prepare_v2 ()
   from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#11 0x000000000040f28e in dbfile_update_extent_poff (db=db@entry=0x11c8778,
    ino=4451283826, subvol=848, loff=3276800, poff=2009557819392) at dbfile.c:343
#12 0x000000000040d6dc in extent_dedupe_worker (kern_bytes=0x7f4b067fbd80,
    fiemap_bytes=<synthetic pointer>, dext=<optimized out>) at run_dedupe.c:518
#13 dedupe_worker (priv=<optimized out>, counts=0x7ffee50daf20) at run_dedupe.c:535
#14 0x00007f4b1716511a in g_thread_pool_thread_proxy ()
   from /nix/store/hl36mh6qdb8a7ymhql4qmfza3cdr24in-glib-2.76.4/lib/libglib-2.0.so.0
#15 0x00007f4b171647cd in g_thread_proxy ()
--Type <RET> for more, q to quit, c to continue without paging--
   from /nix/store/hl36mh6qdb8a7ymhql4qmfza3cdr24in-glib-2.76.4/lib/libglib-2.0.so.0
#16 0x00007f4b16e25dd4 in start_thread () from /nix/store/aw2fw9ag10wr9pf0qk4nk5sxi0q0bn56-glibc-2.37-8/lib/libc.so.6
#17 0x00007f4b16ea79b0 in clone3 () from /nix/store/aw2fw9ag10wr9pf0qk4nk5sxi0q0bn56-glibc-2.37-8/lib/libc.so.6

Adding --io-threads=1 --cpu-threads=1 makes the crash to go away.

JackSlateur commented 1 year ago

Hello @trofi

All the code has been merged in master I am currently running it against my live system, don't hesitate to report any issues you found

Thank you

trofi commented 1 year ago

Sounds good. Let's close this issue then.