Closed fogti closed 1 year ago
e.g. patch https://gist.github.com/fogti/8a81f56c6cac7d1b33e3fe44b497b28e
but after applying that I run into a seemingly random SIGSEGV in deep inside sqlite, idk what's going on there (ok found out that this introduces a race condition between callback and the csumming of items)
ok so the nature of the bug appears to be quite different than what I first expected, somehow even with a batch size of 0 only a small number of items are processed per iteration...
Hello @fogti
Thank you for reporting this issue Sadly, I am unable to reproduce this behavior
To give you a bit more insights, populate_tree
is the function that scan the files content ondisk and creates hashes
Also, when batching is disabled (which is the default), a single thread pool is used, which processes all entries
You can check out the code here
Could you tell me more about your dataset (number of files, dataset size etc) ?
Also, could you give me a debugging output, from duperemove -Adr something --debug
?
Best regards,
A problem I have is that I can't reproduce it deterministically, it just happens sometimes. (So I'll run it the next few times with the --debug
flags and record the output, waiting for it to reoccur)
Hello @fogti
I may have hit your bug Files that failed to be processed where causing an endless loop
a2a6273eaad6fbfd4d7e05f2e6b8d64f0a83d8f5 will prevent that Indeed, this will not fix the processing of that particular file, but you will see the error now
I encountered this issue again, even with the patch applied
Problem: csum_whole_file
never gets called in these loops. FILEREC_NEEDS_SCAN
isn't set for any file in the list.
next try:
diff --git a/file_scan.c b/file_scan.c
index af977f9..ea12839 100644
--- a/file_scan.c
+++ b/file_scan.c
@@ -553,7 +553,7 @@ static GThreadPool *setup_pool(void *arg, void *function)
}
static void run_pool(GThreadPool *pool, struct filerec *file,
- unsigned int batch_size)
+ unsigned int batch_size, unsigned long long *anti_counter)
{
GError *err = NULL;
struct filerec *tmp;
@@ -580,6 +580,8 @@ static void run_pool(GThreadPool *pool, struct filerec *file,
break;
}
}
+ } else {
+ *anti_counter += 1;
}
}
@@ -1144,14 +1146,17 @@ int populate_tree(struct dbfile_config *cfg, unsigned int batch_size,
leading_spaces = num_digits(files_to_scan);
if (files_to_scan) {
- while (params.num_files != files_to_scan) {
+ while (params.num_files < files_to_scan) {
pool = setup_pool(¶ms, csum_whole_file);
if (!pool) {
ret = ENOMEM;
goto out;
}
- run_pool(pool, file, batch_size);
+ unsigned long long anti_counter = 0;
+ run_pool(pool, file, batch_size, &anti_counter);
+ printf("Files skipped: %llu vs. total: %llu\n", anti_counter, files_to_scan);
+ params.num_files += anti_counter;
callback();
}
this one fixes it (somewhat).
Using 12 threads for file hashing phase
Files skipped: 687107 vs. total: 687108
Loading only duplicated hashes from hashfile.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Nothing to dedupe.
Total files scanned: 1374214
Total extent hashes scanned: 681283
A single file appears to fail (but I don't know which one).
files_to_scan
is the variable used to count the files that has the FILEREC_NEEDS_SCAN
flags set
Both values are linked
ok, so the problem is that a single file gets marked FILEREC_NEEDS_SCAN
but never successfully processed... by which I mean that is never appears to "arrive" in csum_whole_file
, which is weird, because we explicitly block on the threadpool to finish...
ok, I think I might've hit it properly... (private stuff redacted)
Error 2: No such file or directory while opening "/home/.../.mozilla/firefox/.../storage/default/.../cache/context_open.marker" (write=0)
https://github.com/markfasheh/duperemove/blob/912fcd79a2c0a16ca8cc05ffaa2e62c2bb6e52d9/file_scan.c#L911-L913
gets hit, goes to err_noclose
, skipping the above mentioned fix...
I also get
Skipping file due to error 2 from function (unknown) (No such file or directory), /home/.../.cache/mozilla/firefox/.../cache2/entries/...
later...
So it probably doesn't handle files properly which get deleted between creation of the list and run_pool
invocation. (or even between run_pool
and the actual scheduling of that entry)
@fogti Thank you for the analysis ! I was able to reproduce and wrote a patch that should fix the issue in the issue-301 branch
hmm the issue-301 branch coredumps for me...
[...] pool[29041]: segfault at 7fe5bd9ff8a0 ip 00007fe4c71e5ff7 sp 00007fe4c4fcde00 error 4 in libc.so.6[7fe4c7170000+159000] likely on CPU 3 (core 3, socket 0)
[...] Code: 74 71 48 8d 9d 80 00 00 00 4c 8d a5 80 02 00 00 64 48 c7 00 00 00 00 00 eb 1b 66 0f 1f 84 00 00 00 00 00 48 89 f8 48 c1 e8 0c <48> 33 07 48 89 03 e8 8e fe ff ff 48 8b 3b 48 85 ff 74 16 40 f6 c7
Thread 10378 "pool" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff59836c0 (LWP 40546)]
0x00007ffff7b96e37 in ?? () from /usr/lib64/libc.so.6
(gdb) bt
#0 0x00007ffff7b96e37 in ?? () from /usr/lib64/libc.so.6
#1 0x00007ffff7b98143 in ?? () from /usr/lib64/libc.so.6
#2 0x00007ffff7b9aeff in free () from /usr/lib64/libc.so.6
#3 0x000055555555d08f in free_filerec (t=0x55555f3a1c80) at filerec.c:54
#4 free_filerec (t=0x55555f3a1c80) at filerec.c:54
#5 filerec_free (file=file@entry=0x55555f3a1c80) at filerec.c:391
#6 0x000055555555d85a in filerec_free (file=file@entry=0x55555f3a1c80) at filerec.c:376
#7 0x000055555555f089 in csum_whole_file (file=<optimized out>, params=0x7fffffffbbe0) at file_scan.c:1009
#8 0x00007ffff7ee5012 in g_thread_pool_thread_proxy (data=<optimized out>) at ../glib-2.76.4/glib/gthreadpool.c:350
#9 0x00007ffff7ee466d in g_thread_proxy (data=0x7ffff0001740) at ../glib-2.76.4/glib/gthread.c:831
#10 0x00007ffff7b8a90c in ?? () from /usr/lib64/libc.so.6
#11 0x00007ffff7c0e06c in ?? () from /usr/lib64/libc.so.6
(gdb) info registers
rax 0x55545f3a1990 93820863256976
rbx 0x55555f3a1b00 93825158224640
rcx 0x171 369
rdx 0x7ffff7cd5ce0 140737350819040
rsi 0x0 0
rdi 0x55555f3a1b00 93825158224640
rbp 0x7ffff5982c20 0x7ffff5982c20
rsp 0x7ffff5982bb0 0x7ffff5982bb0
r8 0x7 7
r9 0x55555f3a1c80 93825158225024
r10 0xdabf15896e2573da -2684403172886875174
r11 0x293 659
r12 0x7ffff7cd5c80 140737350818944
r13 0x230 560
r14 0xb0 176
r15 0x55555f3a1d30 93825158225200
rip 0x7ffff7b96e37 0x7ffff7b96e37
eflags 0x10246 [ PF ZF IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
k0 0x0 0
k1 0x0 0
k2 0x0 0
k3 0x0 0
k4 0x0 0
k5 0x0 0
k6 0x0 0
k7 0x0 0
other pool threads have the following backtrace:
Thread 10377 (Thread 0x7ffff49816c0 (LWP 40545) "pool"):
#0 0x00007ffff7c06289 in syscall () from /usr/lib64/libc.so.6
#1 0x00007ffff7f123c1 in g_mutex_lock_slowpath (mutex=0x7fffffffbbe0) at ../glib-2.76.4/glib/gthread-posix.c:1388
#2 0x00007ffff7f12d23 in g_mutex_lock (mutex=mutex@entry=0x7fffffffbbe0) at ../glib-2.76.4/glib/gthread-posix.c:1417
#3 0x000055555555f013 in csum_whole_file (file=<optimized out>, params=0x7fffffffbbe0) at file_scan.c:984
#4 0x00007ffff7ee5012 in g_thread_pool_thread_proxy (data=<optimized out>) at ../glib-2.76.4/glib/gthreadpool.c:350
#5 0x00007ffff7ee466d in g_thread_proxy (data=0x7ffff00015b0) at ../glib-2.76.4/glib/gthread.c:831
#6 0x00007ffff7b8a90c in ?? () from /usr/lib64/libc.so.6
#7 0x00007ffff7c0e06c in ?? () from /usr/lib64/libc.so.6
and main threads:
Thread 2 (Thread 0x7ffff79876c0 (LWP 29347) "pool-spawner"):
#0 0x00007ffff7c06289 in syscall () from /usr/lib64/libc.so.6
#1 0x00007ffff7f12e60 in g_cond_wait (cond=cond@entry=0x5555627532e8, mutex=mutex@entry=0x5555627532e0) at ../glib-2.76.4/glib/gthread-posix.c:1475
#2 0x00007ffff7e7f31b in g_async_queue_pop_intern_unlocked (queue=0x5555627532e0, wait=1, end_time=-1) at ../glib-2.76.4/glib/gasyncqueue.c:425
#3 0x00007ffff7ee4d02 in g_thread_pool_spawn_thread (data=<optimized out>) at ../glib-2.76.4/glib/gthreadpool.c:311
#4 0x00007ffff7ee466d in g_thread_proxy (data=0x555562753330) at ../glib-2.76.4/glib/gthread.c:831
#5 0x00007ffff7b8a90c in ?? () from /usr/lib64/libc.so.6
#6 0x00007ffff7c0e06c in ?? () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x7ffff79893c0 (LWP 29317) "duperemove"):
#0 0x00007ffff7c06289 in syscall () from /usr/lib64/libc.so.6
#1 0x00007ffff7f12e60 in g_cond_wait (cond=cond@entry=0x5555626f2100, mutex=0x555562754bf0) at ../glib-2.76.4/glib/gthread-posix.c:1475
#2 0x00007ffff7ee5b1c in g_thread_pool_free (pool=0x5555626f20e0, immediate=0, wait_=<optimized out>) at ../glib-2.76.4/glib/gthreadpool.c:924
#3 0x000055555556044f in run_pool (file=<optimized out>, pool=0x5555626f20e0) at file_scan.c:578
#4 populate_tree (cfg=cfg@entry=0x555555571660 <dbfile_cfg>, callback=callback@entry=0x55555555a340 <process_duplicates>) at file_scan.c:1037
#5 0x000055555555a091 in main (argc=<optimized out>, argv=0x7fffffffddd8) at duperemove.c:670
running it with -fsanitize=address
doesn't even get that far...
In case someone else stumbles across this, XXHash seems to greatly dislike ASAN by default, I needed to pass
CFLAGS="${CFLAGS} -ggdb3 -fsanitize=address -fno-omit-frame-pointer -DXXH_VECTOR=2 -DXXH3_STREAM_USE_STACK=0"
CXXFLAGS="${CXXFLAGS} -ggdb3 -fsanitize=address -fno-omit-frame-pointer -DXXH_VECTOR=2 -DXXH3_STREAM_USE_STACK=0"
to get it to run further than that...
ASAN report: https://gist.github.com/fogti/158a670885dd05888ce98c6a9bbe094b
probably culpit for double-free: https://github.com/markfasheh/duperemove/blob/c38f546d8e04833ca298fd600d62b9949023e25c/file_scan.c#L899 https://github.com/markfasheh/duperemove/blob/c38f546d8e04833ca298fd600d62b9949023e25c/file_scan.c#L1004-L1009
@fogti Try this on top of master:
diff --git a/file_scan.c b/file_scan.c
index e6b0a65..0ffdf09 100644
--- a/file_scan.c
+++ b/file_scan.c
@@ -983,12 +983,12 @@ static void csum_whole_file(struct filerec *file,
return;
err:
+ filerec_close(file);
+err_noclose:
g_mutex_lock(¶ms->mutex);
params->num_files++;
g_mutex_unlock(¶ms->mutex);
- filerec_close(file);
-err_noclose:
free(csum_ctxt.buf);
if (extent_hashes)
free(extent_hashes);
I'll commit after work
this fails to apply on top of 912fcd79a2c0a16ca8cc05ffaa2e62c2bb6e52d9
I ran
duperemove -Adr something
and it produced the following output (repeated thousands of times, scrolling through relatively quickly (adding another of such blocks many times per second):perf report
told me that most of the processing time is spent inpopulate_tree
. but the code seems a bit weird as it appears to repeatedly create and destroy a thread pool. which seems to unnecessarily pessimize the normal benefits of thread pools (namely caching the threads so that they don't need to be created and destroyed over and over again, which is slow when done multiple times per second).