Hang in traverse - how to debug?

james-cook commented 2 years ago

Running master 2.10.1 and develop plus a few patches and I am getting a hang during initial traverse with both versions.

Traversing (316260 usable files / 0 + 0 ignored files / folders) Always at the same file count. After this, the drives spin down and nothing appears to be happening. The process continues to run, waiting for something. CPU usage is low.

I checked RAM use - no problems there. 4GB RAM, ca. 2GB used. (Raspberry Pi 4B)

With -vvv set, the last message I (always) see is: WARNING: Added big file /srv/path/Scans BU Dec 2014/Africa_Turmoil_WEB.pdf Would it be this file causing the problem? (probably more likely something afterwards)

Running gdb (develop version compiled with DEBUG=1) when I get to the "same place" I do a CTRL-C (not sure if this is useful)

INFO: Added big file /srv/path/Scans BU Dec 2014/Africa_Turmoil_WEB.pdf
^C
Thread 1 "rmlint" received signal SIGINT, Interrupt.
syscall () at ../sysdeps/unix/sysv/linux/arm/syscall.S:37
37      ../sysdeps/unix/sysv/linux/arm/syscall.S: No such file or directory.
(gdb) bt
#0  0xb6af5aec in syscall () at ../sysdeps/unix/sysv/linux/arm/syscall.S:37
#1  0xb6ce2908 in g_cond_wait () at /usr/lib/arm-linux-gnueabihf/libglib-2.0.so.0
#2  0x0001f6b8 in rm_mds_finish ()
#3  0x0002a4f0 in rm_traverse_tree ()
#4  0x0001adf4 in rm_cmd_main ()
#5  0x00015b18 in main ()
(gdb) thread apply all bt

Thread 3 (Thread 0xb5c513a0 (LWP 19170)):
#0  0xb6af5aec in syscall () at ../sysdeps/unix/sysv/linux/arm/syscall.S:37
#1  0xb6ce2a80 in g_cond_wait_until () at /usr/lib/arm-linux-gnueabihf/libglib-2.0.so.0
#2  0x0001f81c in rm_mds_factory ()

Thread 2 (Thread 0xb64523a0 (LWP 19169)):
#0  0xb6b81c58 in __libc_open (file=0xb64503b8 "\230\305\002", oflag=131072)
    at ../sysdeps/unix/sysv/linux/open.c:44
#1  0x0002c61c in rm_sys_open ()
#2  0x0002dba4 in rm_util_is_nonstripped ()
#3  0x00028834 in rm_traverse_file ()
#4  0x00029900 in rm_traverse_directory ()
#5  0x0001f868 in rm_mds_factory ()

Thread 1 (Thread 0xb6829010 (LWP 19164)):
#0  0xb6af5aec in syscall () at ../sysdeps/unix/sysv/linux/arm/syscall.S:37
#1  0xb6ce2908 in g_cond_wait () at /usr/lib/arm-linux-gnueabihf/libglib-2.0.so.0
#2  0x0001f6b8 in rm_mds_finish ()
#3  0x0002a4f0 in rm_traverse_tree ()
#4  0x0001adf4 in rm_cmd_main ()
#5  0x00015b18 in main ()
(gdb)

I am not sure if these backtraces help - I am VERY rusty at debugging. Any hints on how to hunt this problem down appreciated :)

cebtenzzre commented 2 years ago

For that backtrace, only thread 2 is doing anything (the others are blocked in g_cond_wait), so the next steps would be thread 2, fin and see if the open ever finishes. If it doesn't, the kernel is stuck on something (dmesg would help). If it does finish, rmlint may be stuck in an infinite loop in traverse.c. The filename "\230\305\002" is odd, it may indicate memory corruption - ASAN or valgrind might find a root cause.

james-cook commented 2 years ago

Just did...

(gdb) thread 2
[Switching to thread 2 (Thread 0xb64523a0 (LWP 19169))]
#0  0xb6b81c58 in __libc_open (file=0xb64503b8 "\230\305\002", oflag=131072)
    at ../sysdeps/unix/sysv/linux/open.c:44
44      ../sysdeps/unix/sysv/linux/open.c: No such file or directory.
(gdb) fin
Run till exit from #0  0xb6b81c58 in __libc_open (file=0xb64503b8 "\230\305\002", oflag=131072)
    at ../sysdeps/unix/sysv/linux/open.c:44

and now waiting(!)

cebtenzzre commented 2 years ago

info proc gives you the PID, which you can use like ls -l /proc/45834/cwd to get the current directory of rmlint. If there is actually a FIFO called \230\305\002 in that directory (stat /proc/PID/cwd/$'\230\305\002'), then rmlint needs to better avoid FIFOs with stat() or O_NONBLOCK. Otherwise there's an issue in the kernel and dmesg can help.

james-cook commented 2 years ago

I pressed CTRL-C in the fin (above).

info proc shows:

(gdb) info proc
process 19164
cmdline = '/home/pi/source/rmlint/rmlint/rmlint -T all -emptyfiles -emptydirs -vvv -S mda -s -1TB /srv/dev-disk-by-label-OMV2/shd2/from.ext.hdd.M-family.TOSH/'
cwd = '/srv/dev-disk-by-label-OMV2/shd2/rmlints/redo-ALL.TOSH/with.new.rmlint/with.gdb'
exe = '/home/pi/source/rmlint/rmlint/rmlint'

 ls -l /proc/19164/cwd
lrwxrwxrwx 1 pi pi 0 Feb  8 22:59 /proc/19164/cwd -> /srv/dev-disk-by-label-OMV2/shd2/rmlints/redo-ALL.TOSH/with.new.rmlint/with.gdb

That's "just" the directory where I call rmlint from (cwd of the call to rmlint), not the path of the file in question (?)

stat /proc/19164/cwd/$'\230\305\002'
stat: cannot stat '/proc/19164/cwd/'$'\230\305\002': No such file or directory

cebtenzzre commented 2 years ago

Then either GDB is confused or the path is corrupt. If it is definitely blocked in the kernel (something like swi 0x0 just above the pointed to line in disas) then dmesg is the best way to know what could be wrong. A filesystem with a missing source (unplugged disk or NFS timeout) is usually the only reason that would happen.

james-cook commented 2 years ago

I have to humbly ask for a translation if possible ;)

(something like swi 0x0 just above the pointed to line in disas)

cwd - gdb does report where I call the program from. Would cwd be expected to change to each directory as rmlint traverses?

The original files were copied from a failing disk. This could be(?) a root of the problem perhaps(?)

cebtenzzre commented 2 years ago

In GDB, run disas and scroll up to the pointed to location. The surrounding assembly instructions will tell you if the process has switched to kernel mode.

james-cook commented 2 years ago

Is this the relevant pointer?:

(gdb) disas
Dump of assembler code for function __libc_open:
   0xb6b81b8c <+0>:     push    {r1, r2, r3}
   0xb6b81b90 <+4>:     mov     r1, r0
   0xb6b81b94 <+8>:     ldr     r3, [pc, #276]  ; 0xb6b81cb0 <__libc_open+292>
   0xb6b81b98 <+12>:    ldr     r12, [pc, #276] ; 0xb6b81cb4 <__libc_open+296>
   0xb6b81b9c <+16>:    add     r3, pc, r3
   0xb6b81ba0 <+20>:    push    {r4, r5, r6, r7, lr}
   0xb6b81ba4 <+24>:    sub     sp, sp, #16
   0xb6b81ba8 <+28>:    ldr     r5, [r3, r12]
   0xb6b81bac <+32>:    ldr     r2, [sp, #36]   ; 0x24
   0xb6b81bb0 <+36>:    mov     r0, r3
   0xb6b81bb4 <+40>:    ldr     r0, [r5]
   0xb6b81bb8 <+44>:    ands    r3, r2, #64     ; 0x40
   0xb6b81bbc <+48>:    str     r0, [sp, #12]
   0xb6b81bc0 <+52>:    bne     0xb6b81c20 <__libc_open+148>
   0xb6b81bc4 <+56>:    ldr     r0, [pc, #236]  ; 0xb6b81cb8 <__libc_open+300>
   0xb6b81bc8 <+60>:    bics    r0, r0, r2
   0xb6b81bcc <+64>:    beq     0xb6b81c20 <__libc_open+148>
   0xb6b81bd0 <+68>:    ldr     r0, [pc, #228]  ; 0xb6b81cbc <__libc_open+304>
   0xb6b81bd4 <+72>:    add     r0, pc, r0
   0xb6b81bd8 <+76>:    ldr     r0, [r0]
   0xb6b81bdc <+80>:    cmp     r0, #0
   0xb6b81be0 <+84>:    bne     0xb6b81c30 <__libc_open+164>
   0xb6b81be4 <+88>:    mvn     r0, #99 ; 0x63
   0xb6b81be8 <+92>:    ldr     r7, [pc, #208]  ; 0xb6b81cc0 <__libc_open+308>
   0xb6b81bec <+96>:    svc     0x00000000
   0xb6b81bf0 <+100>:   cmn     r0, #4096       ; 0x1000
   0xb6b81bf4 <+104>:   mov     r4, r0
   0xb6b81bf8 <+108>:   bhi     0xb6b81c74 <__libc_open+232>
   0xb6b81bfc <+112>:   ldr     r2, [sp, #12]
   0xb6b81c00 <+116>:   ldr     r3, [r5]
   0xb6b81c04 <+120>:   mov     r0, r4
   0xb6b81c08 <+124>:   cmp     r2, r3
   0xb6b81c0c <+128>:   bne     0xb6b81cac <__libc_open+288>
   0xb6b81c10 <+132>:   add     sp, sp, #16
--Type <RET> for more, q to quit, c to continue without paging--
   0xb6b81c14 <+136>:   pop     {r4, r5, r6, r7, lr}
   0xb6b81c18 <+140>:   add     sp, sp, #12
   0xb6b81c1c <+144>:   bx      lr
   0xb6b81c20 <+148>:   add     r0, sp, #40     ; 0x28
   0xb6b81c24 <+152>:   ldr     r3, [sp, #40]   ; 0x28
   0xb6b81c28 <+156>:   str     r0, [sp, #8]
   0xb6b81c2c <+160>:   b       0xb6b81bd0 <__libc_open+68>
   0xb6b81c30 <+164>:   str     r2, [sp, #36]   ; 0x24
   0xb6b81c34 <+168>:   str     r1, [sp, #4]
   0xb6b81c38 <+172>:   str     r3, [sp]
   0xb6b81c3c <+176>:   bl      0xb6b80c8c <__pthread_enable_asynccancel>
   0xb6b81c40 <+180>:   ldr     r7, [pc, #120]  ; 0xb6b81cc0 <__libc_open+308>
   0xb6b81c44 <+184>:   ldr     r1, [sp, #4]
   0xb6b81c48 <+188>:   ldr     r2, [sp, #36]   ; 0x24
   0xb6b81c4c <+192>:   ldr     r3, [sp]
   0xb6b81c50 <+196>:   mov     r6, r0
   0xb6b81c54 <+200>:   mvn     r0, #99 ; 0x63
=> 0xb6b81c58 <+204>:   svc     0x00000000
   0xb6b81c5c <+208>:   cmn     r0, #4096       ; 0x1000
   0xb6b81c60 <+212>:   mov     r4, r0
   0xb6b81c64 <+216>:   bhi     0xb6b81c90 <__libc_open+260>
   0xb6b81c68 <+220>:   mov     r0, r6
   0xb6b81c6c <+224>:   bl      0xb6b80d2c <__pthread_disable_asynccancel>
   0xb6b81c70 <+228>:   b       0xb6b81bfc <__libc_open+112>
   0xb6b81c74 <+232>:   ldr     r3, [pc, #72]   ; 0xb6b81cc4 <__libc_open+312>
   0xb6b81c78 <+236>:   rsb     r2, r0, #0
   0xb6b81c7c <+240>:   mvn     r4, #0
   0xb6b81c80 <+244>:   ldr     r3, [pc, r3]
   0xb6b81c84 <+248>:   bl      0xb6b844e0 <__aeabi_read_tp>
   0xb6b81c88 <+252>:   str     r2, [r0, r3]
   0xb6b81c8c <+256>:   b       0xb6b81bfc <__libc_open+112>
   0xb6b81c90 <+260>:   ldr     r3, [pc, #48]   ; 0xb6b81cc8 <__libc_open+316>
   0xb6b81c94 <+264>:   rsb     r2, r0, #0
   0xb6b81c98 <+268>:   mvn     r4, #0
   0xb6b81c9c <+272>:   ldr     r3, [pc, r3]
--Type <RET> for more, q to quit, c to continue without paging--
   0xb6b81ca0 <+276>:   bl      0xb6b844e0 <__aeabi_read_tp>
   0xb6b81ca4 <+280>:   str     r2, [r0, r3]
   0xb6b81ca8 <+284>:   b       0xb6b81c68 <__libc_open+220>
   0xb6b81cac <+288>:   bl      0xb6b74490 <__stack_chk_fail@plt>
   0xb6b81cb0 <+292>:   andeq   r5, r1, r12, asr r4
   0xb6b81cb4 <+296>:   andeq   r0, r0, r4, ror r1
   0xb6b81cb8 <+300>:   subeq   r4, r0, r0
   0xb6b81cbc <+304>:   andeq   r7, r1, r4, asr r6
   0xb6b81cc0 <+308>:   andeq   r0, r0, r2, asr #2
   0xb6b81cc4 <+312>:   andeq   r5, r1, r8, ror #9
   0xb6b81cc8 <+316>:   andeq   r5, r1, r12, asr #9
End of assembler dump.
(gdb)

cebtenzzre commented 2 years ago

Yep, it's definitely in a kernel syscall. Running ni in GDB probably hangs too. You can get the kernel stacks with find /proc/19164/task -name "stack" | sudo xargs -I{} sh -c 'echo {}; cat {}'. I also have mentioned dmesg several times because that's where the kernel logs if it something gets stuck.

james-cook commented 2 years ago

Thanks for all of your help :)

Stack

find /proc/19164/task -name "stack" | sudo xargs -I{} sh -c 'echo {}; cat {}'

/proc/19164/task/19164/stack
/proc/19164/task/19169/stack
/proc/19164/task/19170/stack
$ sudo cat /proc/19164/task/19164/stack
$ sudo cat /proc/19164/task/19169/stack
$ sudo cat /proc/19164/task/19170/stack

i.e. the files are empty...(?)

dmesg

dmesg -Tk shows:

00 00
[Tue Feb  8 22:21:36 2022] sd 3:0:0:0: [sdd] tag#12 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD
[Tue Feb  8 22:21:36 2022] sd 3:0:0:0: [sdd] tag#12 CDB: opcode=0x88 88 00 00 00 00 01 ec e8 01 00 00 00 01 00
00 00
[Tue Feb  8 22:21:41 2022] sd 3:0:0:0: [sdd] tag#19 uas_eh_abort_handler 0 uas-tag 3 inflight: CMD OUT
[Tue Feb  8 22:21:41 2022] sd 3:0:0:0: [sdd] tag#19 CDB: opcode=0x8a 8a 00 00 00 00 02 51 40 09 78 00 00 00 08
00 00
[Tue Feb  8 22:21:41 2022] sd 3:0:0:0: [sdd] tag#18 uas_eh_abort_handler 0 uas-tag 6 inflight: CMD OUT
[Tue Feb  8 22:21:41 2022] sd 3:0:0:0: [sdd] tag#18 CDB: opcode=0x8a 8a 00 00 00 00 02 6d c0 09 d8 00 00 00 20
00 00
[T

Does this look like issue? I do see the same message sporadically in the past week. Looking at these entries last night I just thought a uas error would be "more general* and not just be happening "on the same file" - (same rmlint command).

I have had uas problems (even leading to device offlining) with other raspberrypis. It is a known problem area. Downgrading to usb2(!) can help, as can simply rebooting and unplugging/replugging the drives(!) and kernel updates help too.

I checked the kernel version on this machine and see it is very old. So, today I'll back-up the sdcard(os) and update the kernel.

gdb

ni does hang as you mentioned

cebtenzzre commented 2 years ago

Dunno why there are no kernel stacks in /proc (maybe it's an optional feature or your kernel has limited stack unwinding?) but I would definitely attribute the hang to those kernel messages. Not an rmlint issue.

james-cook commented 2 years ago

Will close this as it really appears to be my system - thanks 😊 for the great input and help

james-cook commented 2 years ago

Investigating this:

I switched the harddrive to be directly connected to the raspberrypi and no longer via a hub
I then used a find <the directory from the rmlint command> -exec dd -if <the file> -of </dev/null> kind of command overnight.
This command completed and there were no kernel errors/warnings.
I ran rmlint (develop + patches) to run and it again failed just after 300000 files in traverse
I ran the same command again in gdb but could not get a useful stack at the point of failure (as before)

So I went back to "old school" debugging and placed a logging output in traverse.c:

386d385
<         rm_log_info_line(_("Current file is %s"), p->fts_path);

And now I see that rmlint chokes on AppContainerUserCertRead:

INFO: Current file is /srv/dev-disk-by-label-OMV2/shd2/from.ext.hdd.M-family.TOSH/2020-01-05-p13/Users/XXXXX/AppData/Roaming/Microsoft/SystemCertificates/My/CRLs
INFO: Current file is /srv/dev-disk-by-label-OMV2/shd2/from.ext.hdd.M-family.TOSH/2020-01-05-p13/Users/XXXXX/AppData/Roaming/Microsoft/SystemCertificates/My/AppContainerUserCertRead

This is a pipe/FIFO:

from.ext.hdd.M-family.TOSH/2020-01-05-p13/Users/XXXXX/AppData/Roaming/Microsoft/SystemCertificates/My $ ls -al
total 16
drwxrwxrwx 4 root root 4096 Sep 13 17:11 .
drwxrwxrwx 4 root root 4096 Jan  2  2020 ..
prwxrwxrwx 1 root root    0 Nov 28  2019 AppContainerUserCertRead
drwxrwxrwx 2 root root 4096 Nov 28  2019 CRLs
drwxrwxrwx 2 root root 4096 Nov 28  2019 CTLs

You mentioned excluding such files from rmlint: info proc gives you the PID, which you can use like ls -l /proc/45834/cwd to get the current directory of rmlint. If there is actually a FIFO called \230\305\002 in that directory (stat /proc/PID/cwd/$'\230\305\002'), then rmlint needs to better avoid FIFOs with stat() or O_NONBLOCK. Otherwise there's an issue in the kernel and dmesg can help.

Originally posted by @Cebtenzzre in https://github.com/sahib/rmlint/issues/555#issuecomment-1033137600

cebtenzzre commented 2 years ago

I am able to reproduce the issue like this:

$ mkfifo foo
$ chmod +x foo
$ rmlint -T ns .
(rmlint hangs)

I am not surprised that you are one of the first people to hit this, because it requires that the non-default unstripped binary search is enabled, and that it encounters a FIFO, and that the FIFO is marked executable (apparently because it came from a Windows system). Here's the fix:

 lib/utilities.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/utilities.c b/lib/utilities.c
index 30d82607..00299728 100644
--- a/lib/utilities.c
+++ b/lib/utilities.c
@@ -467,7 +467,7 @@ bool rm_util_is_nonstripped(_UNUSED const char *path, _UNUSED RmStat *statp) {
 #if HAVE_LIBELF
     g_return_val_if_fail(path, false);

-    if(statp && (statp->st_mode & (S_IXUSR | S_IXGRP | S_IXOTH)) == 0) {
+    if(!S_ISREG(statp->st_mode) || (statp->st_mode & (S_IXUSR | S_IXGRP | S_IXOTH)) == 0) {
         return false;
     }

james-cook commented 2 years ago

Your very fast patch fixed the issue for me :) So I'll close again. Thanks

sahib / rmlint

Hang in traverse - how to debug? #555

Stack

dmesg

gdb