slow `populate_tree`: looping around `process_duplicates`

fogti commented 1 year ago

I ran duperemove -Adr something and it produced the following output (repeated thousands of times, scrolling through relatively quickly (adding another of such blocks many times per second):

Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Nothing to dedupe.
Using 12 threads for file hashing phase
Loading only duplicated hashes from hashfile.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Nothing to dedupe.
Using 12 threads for file hashing phase
Loading only duplicated hashes from hashfile.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Nothing to dedupe.
Using 12 threads for file hashing phase
Loading only duplicated hashes from hashfile.

perf report told me that most of the processing time is spent in populate_tree. but the code seems a bit weird as it appears to repeatedly create and destroy a thread pool. which seems to unnecessarily pessimize the normal benefits of thread pools (namely caching the threads so that they don't need to be created and destroyed over and over again, which is slow when done multiple times per second).

fogti commented 1 year ago

e.g. patch https://gist.github.com/fogti/8a81f56c6cac7d1b33e3fe44b497b28e

but after applying that I run into a seemingly random SIGSEGV in deep inside sqlite, idk what's going on there (ok found out that this introduces a race condition between callback and the csumming of items)

fogti commented 1 year ago

ok so the nature of the bug appears to be quite different than what I first expected, somehow even with a batch size of 0 only a small number of items are processed per iteration...

JackSlateur commented 1 year ago

Hello @fogti

Thank you for reporting this issue Sadly, I am unable to reproduce this behavior

To give you a bit more insights, populate_tree is the function that scan the files content ondisk and creates hashes Also, when batching is disabled (which is the default), a single thread pool is used, which processes all entries You can check out the code here

Could you tell me more about your dataset (number of files, dataset size etc) ? Also, could you give me a debugging output, from duperemove -Adr something --debug ?

Best regards,

fogti commented 1 year ago

A problem I have is that I can't reproduce it deterministically, it just happens sometimes. (So I'll run it the next few times with the --debug flags and record the output, waiting for it to reoccur)

JackSlateur commented 1 year ago

Hello @fogti

I may have hit your bug Files that failed to be processed where causing an endless loop

a2a6273eaad6fbfd4d7e05f2e6b8d64f0a83d8f5 will prevent that Indeed, this will not fix the processing of that particular file, but you will see the error now

fogti commented 1 year ago

I encountered this issue again, even with the patch applied

fogti commented 1 year ago

Problem: csum_whole_file never gets called in these loops. FILEREC_NEEDS_SCAN isn't set for any file in the list.

fogti commented 1 year ago

next try:

diff --git a/file_scan.c b/file_scan.c
index af977f9..ea12839 100644
--- a/file_scan.c
+++ b/file_scan.c
@@ -553,7 +553,7 @@ static GThreadPool *setup_pool(void *arg, void *function)
 }

 static void run_pool(GThreadPool *pool, struct filerec *file,
-       unsigned int batch_size)
+       unsigned int batch_size, unsigned long long *anti_counter)
 {
    GError *err = NULL;
    struct filerec *tmp;
@@ -580,6 +580,8 @@ static void run_pool(GThreadPool *pool, struct filerec *file,
                    break;
                }
            }
+       } else {
+           *anti_counter += 1;
        }
    }

@@ -1144,14 +1146,17 @@ int populate_tree(struct dbfile_config *cfg, unsigned int batch_size,
    leading_spaces = num_digits(files_to_scan);

    if (files_to_scan) {
-       while (params.num_files != files_to_scan) {
+       while (params.num_files < files_to_scan) {
            pool = setup_pool(&params, csum_whole_file);
            if (!pool) {
                ret = ENOMEM;
                goto out;
            }

-           run_pool(pool, file, batch_size);
+           unsigned long long anti_counter = 0;
+           run_pool(pool, file, batch_size, &anti_counter);
+           printf("Files skipped: %llu vs. total: %llu\n", anti_counter, files_to_scan);
+           params.num_files += anti_counter;
            callback();
        }

this one fixes it (somewhat).

Using 12 threads for file hashing phase
Files skipped: 687107 vs. total: 687108
Loading only duplicated hashes from hashfile.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Nothing to dedupe.
Total files scanned:  1374214
Total extent hashes scanned: 681283

A single file appears to fail (but I don't know which one).

JackSlateur commented 1 year ago

files_to_scan is the variable used to count the files that has the FILEREC_NEEDS_SCAN flags set Both values are linked

fogti commented 1 year ago

ok, so the problem is that a single file gets marked FILEREC_NEEDS_SCAN but never successfully processed... by which I mean that is never appears to "arrive" in csum_whole_file, which is weird, because we explicitly block on the threadpool to finish...

fogti commented 1 year ago

ok, I think I might've hit it properly... (private stuff redacted)

Error 2: No such file or directory while opening "/home/.../.mozilla/firefox/.../storage/default/.../cache/context_open.marker" (write=0)

https://github.com/markfasheh/duperemove/blob/912fcd79a2c0a16ca8cc05ffaa2e62c2bb6e52d9/file_scan.c#L911-L913 gets hit, goes to err_noclose, skipping the above mentioned fix...

I also get

Skipping file due to error 2 from function (unknown) (No such file or directory), /home/.../.cache/mozilla/firefox/.../cache2/entries/...

later...

So it probably doesn't handle files properly which get deleted between creation of the list and run_pool invocation. (or even between run_pool and the actual scheduling of that entry)

JackSlateur commented 1 year ago

@fogti Thank you for the analysis ! I was able to reproduce and wrote a patch that should fix the issue in the issue-301 branch

fogti commented 1 year ago

hmm the issue-301 branch coredumps for me...

[...] pool[29041]: segfault at 7fe5bd9ff8a0 ip 00007fe4c71e5ff7 sp 00007fe4c4fcde00 error 4 in libc.so.6[7fe4c7170000+159000] likely on CPU 3 (core 3, socket 0)
[...] Code: 74 71 48 8d 9d 80 00 00 00 4c 8d a5 80 02 00 00 64 48 c7 00 00 00 00 00 eb 1b 66 0f 1f 84 00 00 00 00 00 48 89 f8 48 c1 e8 0c <48> 33 07 48 89 03 e8 8e fe ff ff 48 8b 3b 48 85 ff 74 16 40 f6 c7
Thread 10378 "pool" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff59836c0 (LWP 40546)]
0x00007ffff7b96e37 in ?? () from /usr/lib64/libc.so.6
(gdb) bt
#0  0x00007ffff7b96e37 in ?? () from /usr/lib64/libc.so.6
#1  0x00007ffff7b98143 in ?? () from /usr/lib64/libc.so.6
#2  0x00007ffff7b9aeff in free () from /usr/lib64/libc.so.6
#3  0x000055555555d08f in free_filerec (t=0x55555f3a1c80) at filerec.c:54
#4  free_filerec (t=0x55555f3a1c80) at filerec.c:54
#5  filerec_free (file=file@entry=0x55555f3a1c80) at filerec.c:391
#6  0x000055555555d85a in filerec_free (file=file@entry=0x55555f3a1c80) at filerec.c:376
#7  0x000055555555f089 in csum_whole_file (file=<optimized out>, params=0x7fffffffbbe0) at file_scan.c:1009
#8  0x00007ffff7ee5012 in g_thread_pool_thread_proxy (data=<optimized out>) at ../glib-2.76.4/glib/gthreadpool.c:350
#9  0x00007ffff7ee466d in g_thread_proxy (data=0x7ffff0001740) at ../glib-2.76.4/glib/gthread.c:831
#10 0x00007ffff7b8a90c in ?? () from /usr/lib64/libc.so.6
#11 0x00007ffff7c0e06c in ?? () from /usr/lib64/libc.so.6
(gdb) info registers
rax            0x55545f3a1990      93820863256976
rbx            0x55555f3a1b00      93825158224640
rcx            0x171               369
rdx            0x7ffff7cd5ce0      140737350819040
rsi            0x0                 0
rdi            0x55555f3a1b00      93825158224640
rbp            0x7ffff5982c20      0x7ffff5982c20
rsp            0x7ffff5982bb0      0x7ffff5982bb0
r8             0x7                 7
r9             0x55555f3a1c80      93825158225024
r10            0xdabf15896e2573da  -2684403172886875174
r11            0x293               659
r12            0x7ffff7cd5c80      140737350818944
r13            0x230               560
r14            0xb0                176
r15            0x55555f3a1d30      93825158225200
rip            0x7ffff7b96e37      0x7ffff7b96e37
eflags         0x10246             [ PF ZF IF RF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
k0             0x0                 0
k1             0x0                 0
k2             0x0                 0
k3             0x0                 0
k4             0x0                 0
k5             0x0                 0
k6             0x0                 0
k7             0x0                 0

other pool threads have the following backtrace:

Thread 10377 (Thread 0x7ffff49816c0 (LWP 40545) "pool"):
#0  0x00007ffff7c06289 in syscall () from /usr/lib64/libc.so.6
#1  0x00007ffff7f123c1 in g_mutex_lock_slowpath (mutex=0x7fffffffbbe0) at ../glib-2.76.4/glib/gthread-posix.c:1388
#2  0x00007ffff7f12d23 in g_mutex_lock (mutex=mutex@entry=0x7fffffffbbe0) at ../glib-2.76.4/glib/gthread-posix.c:1417
#3  0x000055555555f013 in csum_whole_file (file=<optimized out>, params=0x7fffffffbbe0) at file_scan.c:984
#4  0x00007ffff7ee5012 in g_thread_pool_thread_proxy (data=<optimized out>) at ../glib-2.76.4/glib/gthreadpool.c:350
#5  0x00007ffff7ee466d in g_thread_proxy (data=0x7ffff00015b0) at ../glib-2.76.4/glib/gthread.c:831
#6  0x00007ffff7b8a90c in ?? () from /usr/lib64/libc.so.6
#7  0x00007ffff7c0e06c in ?? () from /usr/lib64/libc.so.6

and main threads:

Thread 2 (Thread 0x7ffff79876c0 (LWP 29347) "pool-spawner"):
#0  0x00007ffff7c06289 in syscall () from /usr/lib64/libc.so.6
#1  0x00007ffff7f12e60 in g_cond_wait (cond=cond@entry=0x5555627532e8, mutex=mutex@entry=0x5555627532e0) at ../glib-2.76.4/glib/gthread-posix.c:1475
#2  0x00007ffff7e7f31b in g_async_queue_pop_intern_unlocked (queue=0x5555627532e0, wait=1, end_time=-1) at ../glib-2.76.4/glib/gasyncqueue.c:425
#3  0x00007ffff7ee4d02 in g_thread_pool_spawn_thread (data=<optimized out>) at ../glib-2.76.4/glib/gthreadpool.c:311
#4  0x00007ffff7ee466d in g_thread_proxy (data=0x555562753330) at ../glib-2.76.4/glib/gthread.c:831
#5  0x00007ffff7b8a90c in ?? () from /usr/lib64/libc.so.6
#6  0x00007ffff7c0e06c in ?? () from /usr/lib64/libc.so.6

Thread 1 (Thread 0x7ffff79893c0 (LWP 29317) "duperemove"):
#0  0x00007ffff7c06289 in syscall () from /usr/lib64/libc.so.6
#1  0x00007ffff7f12e60 in g_cond_wait (cond=cond@entry=0x5555626f2100, mutex=0x555562754bf0) at ../glib-2.76.4/glib/gthread-posix.c:1475
#2  0x00007ffff7ee5b1c in g_thread_pool_free (pool=0x5555626f20e0, immediate=0, wait_=<optimized out>) at ../glib-2.76.4/glib/gthreadpool.c:924
#3  0x000055555556044f in run_pool (file=<optimized out>, pool=0x5555626f20e0) at file_scan.c:578
#4  populate_tree (cfg=cfg@entry=0x555555571660 <dbfile_cfg>, callback=callback@entry=0x55555555a340 <process_duplicates>) at file_scan.c:1037
#5  0x000055555555a091 in main (argc=<optimized out>, argv=0x7fffffffddd8) at duperemove.c:670

fogti commented 1 year ago

running it with -fsanitize=address doesn't even get that far...

ASAN summary + backtrace

``` ==46427==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x55978e955a17 bp 0x7f12e2ef2980 sp 0x7f12e2ef26c0 T4) ==46427==The signal is caused by a READ memory access. ==46427==Hint: this fault was caused by a dereference of a high value address (see register values below). Disassemble the provided pc to learn which register was used. AddressSanitizer:DEADLYSIGNAL #0 0x55978e955a17 in XXH3_accumulate /usr/src/debug/sys-fs/duperemove-0.12_p304/duperemove-c38f546d8e04833ca298fd600d62b9949023e25c/xxhash.h:4366 #1 0x55978e955a17 in XXH3_hashLong_internal_loop /usr/src/debug/sys-fs/duperemove-0.12_p304/duperemove-c38f546d8e04833ca298fd600d62b9949023e25c/xxhash.h:4399 #2 0x55978e955a17 in XXH3_hashLong_128b_internal /usr/src/debug/sys-fs/duperemove-0.12_p304/duperemove-c38f546d8e04833ca298fd600d62b9949023e25c/xxhash.h:5233 #3 0x55978e955a17 in XXH3_hashLong_128b_withSeed_internal /usr/src/debug/sys-fs/duperemove-0.12_p304/duperemove-c38f546d8e04833ca298fd600d62b9949023e25c/xxhash.h:5285 #4 0x55978e955a17 in XXH3_hashLong_128b_withSeed /usr/src/debug/sys-fs/duperemove-0.12_p304/duperemove-c38f546d8e04833ca298fd600d62b9949023e25c/xxhash.h:5303 #5 0x55978e956a58 in XXH3_128bits_internal /usr/src/debug/sys-fs/duperemove-0.12_p304/duperemove-c38f546d8e04833ca298fd600d62b9949023e25c/xxhash.h:5328 #6 0x55978e956a58 in XXH_INLINE_XXH3_128bits_withSeed /usr/src/debug/sys-fs/duperemove-0.12_p304/duperemove-c38f546d8e04833ca298fd600d62b9949023e25c/xxhash.h:5355 #7 0x55978e9578a5 in XXH_INLINE_XXH128 /usr/src/debug/sys-fs/duperemove-0.12_p304/duperemove-c38f546d8e04833ca298fd600d62b9949023e25c/xxhash.h:5373 #8 0x55978e9578a5 in xxhash_checksum_block /usr/src/debug/sys-fs/duperemove-0.12_p304/duperemove-c38f546d8e04833ca298fd600d62b9949023e25c/csum-xxhash.c:33 #9 0x55978e942d9e in csum_blocks /usr/src/debug/sys-fs/duperemove-0.12_p304/duperemove-c38f546d8e04833ca298fd600d62b9949023e25c/file_scan.c:647 #10 0x55978e942d9e in csum_extent /usr/src/debug/sys-fs/duperemove-0.12_p304/duperemove-c38f546d8e04833ca298fd600d62b9949023e25c/file_scan.c:709 #11 0x55978e942d9e in csum_by_extent /usr/src/debug/sys-fs/duperemove-0.12_p304/duperemove-c38f546d8e04833ca298fd600d62b9949023e25c/file_scan.c:839 #12 0x55978e942d9e in csum_whole_file /usr/src/debug/sys-fs/duperemove-0.12_p304/duperemove-c38f546d8e04833ca298fd600d62b9949023e25c/file_scan.c:916 #13 0x7f12e8b53011 in g_thread_pool_thread_proxy ../glib-2.76.4/glib/gthreadpool.c:350 #14 0x7f12e8b5266c in g_thread_proxy ../glib-2.76.4/glib/gthread.c:831 #15 0x7f12e87f890b (/usr/lib64/libc.so.6+0x8790b) #16 0x7f12e887c06b (/usr/lib64/libc.so.6+0x10b06b) AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV /usr/src/debug/sys-fs/duperemove-0.12_p304/duperemove-c38f546d8e04833ca298fd600d62b9949023e25c/xxhash.h:4366 in XXH3_accumulate Thread T4 created by T1 here: #0 0x7f12e8c59821 in pthread_create (/usr/lib/gcc/x86_64-pc-linux-gnu/13/libasan.so.8+0x48821) #1 0x7f12e8b80ae1 in g_system_thread_new ../glib-2.76.4/glib/gthread-posix.c:1221 #2 0x7f12e8b52d15 in g_thread_pool_spawn_thread ../glib-2.76.4/glib/gthreadpool.c:312 Thread T1 created by T0 here: #0 0x7f12e8c59821 in pthread_create (/usr/lib/gcc/x86_64-pc-linux-gnu/13/libasan.so.8+0x48821) #1 0x7f12e8b80ae1 in g_system_thread_new ../glib-2.76.4/glib/gthread-posix.c:1221 #2 0x7f12e8b529c6 in g_thread_new_internal ../glib-2.76.4/glib/gthread.c:935 #3 0x7f12e8b529c6 in g_thread_new ../glib-2.76.4/glib/gthread.c:888 ==46427==ABORTING ```

In case someone else stumbles across this, XXHash seems to greatly dislike ASAN by default, I needed to pass

CFLAGS="${CFLAGS} -ggdb3 -fsanitize=address -fno-omit-frame-pointer -DXXH_VECTOR=2 -DXXH3_STREAM_USE_STACK=0"
CXXFLAGS="${CXXFLAGS} -ggdb3 -fsanitize=address -fno-omit-frame-pointer -DXXH_VECTOR=2 -DXXH3_STREAM_USE_STACK=0"

to get it to run further than that...

fogti commented 1 year ago

ASAN report: https://gist.github.com/fogti/158a670885dd05888ce98c6a9bbe094b

probably culpit for double-free: https://github.com/markfasheh/duperemove/blob/c38f546d8e04833ca298fd600d62b9949023e25c/file_scan.c#L899 https://github.com/markfasheh/duperemove/blob/c38f546d8e04833ca298fd600d62b9949023e25c/file_scan.c#L1004-L1009

JackSlateur commented 1 year ago

@fogti Try this on top of master:

diff --git a/file_scan.c b/file_scan.c
index e6b0a65..0ffdf09 100644
--- a/file_scan.c
+++ b/file_scan.c
@@ -983,12 +983,12 @@ static void csum_whole_file(struct filerec *file,
        return;

 err:
+       filerec_close(file);
+err_noclose:
        g_mutex_lock(&params->mutex);
        params->num_files++;
        g_mutex_unlock(&params->mutex);

-       filerec_close(file);
-err_noclose:
        free(csum_ctxt.buf);
        if (extent_hashes)
                free(extent_hashes);

I'll commit after work

fogti commented 1 year ago

this fails to apply on top of 912fcd79a2c0a16ca8cc05ffaa2e62c2bb6e52d9

markfasheh / duperemove

slow `populate_tree`: looping around `process_duplicates` #304