Closed james-cook closed 1 year ago
For that backtrace, only thread 2 is doing anything (the others are blocked in g_cond_wait), so the next steps would be thread 2
, fin
and see if the open ever finishes. If it doesn't, the kernel is stuck on something (dmesg would help). If it does finish, rmlint may be stuck in an infinite loop in traverse.c.
The filename "\230\305\002" is odd, it may indicate memory corruption - ASAN or valgrind might find a root cause.
Just did...
(gdb) thread 2
[Switching to thread 2 (Thread 0xb64523a0 (LWP 19169))]
#0 0xb6b81c58 in __libc_open (file=0xb64503b8 "\230\305\002", oflag=131072)
at ../sysdeps/unix/sysv/linux/open.c:44
44 ../sysdeps/unix/sysv/linux/open.c: No such file or directory.
(gdb) fin
Run till exit from #0 0xb6b81c58 in __libc_open (file=0xb64503b8 "\230\305\002", oflag=131072)
at ../sysdeps/unix/sysv/linux/open.c:44
and now waiting(!)
info proc
gives you the PID, which you can use like ls -l /proc/45834/cwd
to get the current directory of rmlint. If there is actually a FIFO called \230\305\002
in that directory (stat /proc/PID/cwd/$'\230\305\002'
), then rmlint needs to better avoid FIFOs with stat() or O_NONBLOCK. Otherwise there's an issue in the kernel and dmesg can help.
I pressed CTRL-C in the fin
(above).
info proc
shows:
(gdb) info proc
process 19164
cmdline = '/home/pi/source/rmlint/rmlint/rmlint -T all -emptyfiles -emptydirs -vvv -S mda -s -1TB /srv/dev-disk-by-label-OMV2/shd2/from.ext.hdd.M-family.TOSH/'
cwd = '/srv/dev-disk-by-label-OMV2/shd2/rmlints/redo-ALL.TOSH/with.new.rmlint/with.gdb'
exe = '/home/pi/source/rmlint/rmlint/rmlint'
ls -l /proc/19164/cwd
lrwxrwxrwx 1 pi pi 0 Feb 8 22:59 /proc/19164/cwd -> /srv/dev-disk-by-label-OMV2/shd2/rmlints/redo-ALL.TOSH/with.new.rmlint/with.gdb
That's "just" the directory where I call rmlint from (cwd of the call to rmlint), not the path of the file in question (?)
stat /proc/19164/cwd/$'\230\305\002'
stat: cannot stat '/proc/19164/cwd/'$'\230\305\002': No such file or directory
Then either GDB is confused or the path is corrupt. If it is definitely blocked in the kernel (something like swi 0x0
just above the pointed to line in disas
) then dmesg is the best way to know what could be wrong. A filesystem with a missing source (unplugged disk or NFS timeout) is usually the only reason that would happen.
I have to humbly ask for a translation if possible ;)
(something like swi 0x0 just above the pointed to line in disas)
cwd
- gdb
does report where I call the program from. Would cwd
be expected to change to each directory as rmlint traverses?
The original files were copied from a failing disk. This could be(?) a root of the problem perhaps(?)
In GDB, run disas
and scroll up to the pointed to location. The surrounding assembly instructions will tell you if the process has switched to kernel mode.
Is this the relevant pointer?:
(gdb) disas
Dump of assembler code for function __libc_open:
0xb6b81b8c <+0>: push {r1, r2, r3}
0xb6b81b90 <+4>: mov r1, r0
0xb6b81b94 <+8>: ldr r3, [pc, #276] ; 0xb6b81cb0 <__libc_open+292>
0xb6b81b98 <+12>: ldr r12, [pc, #276] ; 0xb6b81cb4 <__libc_open+296>
0xb6b81b9c <+16>: add r3, pc, r3
0xb6b81ba0 <+20>: push {r4, r5, r6, r7, lr}
0xb6b81ba4 <+24>: sub sp, sp, #16
0xb6b81ba8 <+28>: ldr r5, [r3, r12]
0xb6b81bac <+32>: ldr r2, [sp, #36] ; 0x24
0xb6b81bb0 <+36>: mov r0, r3
0xb6b81bb4 <+40>: ldr r0, [r5]
0xb6b81bb8 <+44>: ands r3, r2, #64 ; 0x40
0xb6b81bbc <+48>: str r0, [sp, #12]
0xb6b81bc0 <+52>: bne 0xb6b81c20 <__libc_open+148>
0xb6b81bc4 <+56>: ldr r0, [pc, #236] ; 0xb6b81cb8 <__libc_open+300>
0xb6b81bc8 <+60>: bics r0, r0, r2
0xb6b81bcc <+64>: beq 0xb6b81c20 <__libc_open+148>
0xb6b81bd0 <+68>: ldr r0, [pc, #228] ; 0xb6b81cbc <__libc_open+304>
0xb6b81bd4 <+72>: add r0, pc, r0
0xb6b81bd8 <+76>: ldr r0, [r0]
0xb6b81bdc <+80>: cmp r0, #0
0xb6b81be0 <+84>: bne 0xb6b81c30 <__libc_open+164>
0xb6b81be4 <+88>: mvn r0, #99 ; 0x63
0xb6b81be8 <+92>: ldr r7, [pc, #208] ; 0xb6b81cc0 <__libc_open+308>
0xb6b81bec <+96>: svc 0x00000000
0xb6b81bf0 <+100>: cmn r0, #4096 ; 0x1000
0xb6b81bf4 <+104>: mov r4, r0
0xb6b81bf8 <+108>: bhi 0xb6b81c74 <__libc_open+232>
0xb6b81bfc <+112>: ldr r2, [sp, #12]
0xb6b81c00 <+116>: ldr r3, [r5]
0xb6b81c04 <+120>: mov r0, r4
0xb6b81c08 <+124>: cmp r2, r3
0xb6b81c0c <+128>: bne 0xb6b81cac <__libc_open+288>
0xb6b81c10 <+132>: add sp, sp, #16
--Type <RET> for more, q to quit, c to continue without paging--
0xb6b81c14 <+136>: pop {r4, r5, r6, r7, lr}
0xb6b81c18 <+140>: add sp, sp, #12
0xb6b81c1c <+144>: bx lr
0xb6b81c20 <+148>: add r0, sp, #40 ; 0x28
0xb6b81c24 <+152>: ldr r3, [sp, #40] ; 0x28
0xb6b81c28 <+156>: str r0, [sp, #8]
0xb6b81c2c <+160>: b 0xb6b81bd0 <__libc_open+68>
0xb6b81c30 <+164>: str r2, [sp, #36] ; 0x24
0xb6b81c34 <+168>: str r1, [sp, #4]
0xb6b81c38 <+172>: str r3, [sp]
0xb6b81c3c <+176>: bl 0xb6b80c8c <__pthread_enable_asynccancel>
0xb6b81c40 <+180>: ldr r7, [pc, #120] ; 0xb6b81cc0 <__libc_open+308>
0xb6b81c44 <+184>: ldr r1, [sp, #4]
0xb6b81c48 <+188>: ldr r2, [sp, #36] ; 0x24
0xb6b81c4c <+192>: ldr r3, [sp]
0xb6b81c50 <+196>: mov r6, r0
0xb6b81c54 <+200>: mvn r0, #99 ; 0x63
=> 0xb6b81c58 <+204>: svc 0x00000000
0xb6b81c5c <+208>: cmn r0, #4096 ; 0x1000
0xb6b81c60 <+212>: mov r4, r0
0xb6b81c64 <+216>: bhi 0xb6b81c90 <__libc_open+260>
0xb6b81c68 <+220>: mov r0, r6
0xb6b81c6c <+224>: bl 0xb6b80d2c <__pthread_disable_asynccancel>
0xb6b81c70 <+228>: b 0xb6b81bfc <__libc_open+112>
0xb6b81c74 <+232>: ldr r3, [pc, #72] ; 0xb6b81cc4 <__libc_open+312>
0xb6b81c78 <+236>: rsb r2, r0, #0
0xb6b81c7c <+240>: mvn r4, #0
0xb6b81c80 <+244>: ldr r3, [pc, r3]
0xb6b81c84 <+248>: bl 0xb6b844e0 <__aeabi_read_tp>
0xb6b81c88 <+252>: str r2, [r0, r3]
0xb6b81c8c <+256>: b 0xb6b81bfc <__libc_open+112>
0xb6b81c90 <+260>: ldr r3, [pc, #48] ; 0xb6b81cc8 <__libc_open+316>
0xb6b81c94 <+264>: rsb r2, r0, #0
0xb6b81c98 <+268>: mvn r4, #0
0xb6b81c9c <+272>: ldr r3, [pc, r3]
--Type <RET> for more, q to quit, c to continue without paging--
0xb6b81ca0 <+276>: bl 0xb6b844e0 <__aeabi_read_tp>
0xb6b81ca4 <+280>: str r2, [r0, r3]
0xb6b81ca8 <+284>: b 0xb6b81c68 <__libc_open+220>
0xb6b81cac <+288>: bl 0xb6b74490 <__stack_chk_fail@plt>
0xb6b81cb0 <+292>: andeq r5, r1, r12, asr r4
0xb6b81cb4 <+296>: andeq r0, r0, r4, ror r1
0xb6b81cb8 <+300>: subeq r4, r0, r0
0xb6b81cbc <+304>: andeq r7, r1, r4, asr r6
0xb6b81cc0 <+308>: andeq r0, r0, r2, asr #2
0xb6b81cc4 <+312>: andeq r5, r1, r8, ror #9
0xb6b81cc8 <+316>: andeq r5, r1, r12, asr #9
End of assembler dump.
(gdb)
Yep, it's definitely in a kernel syscall. Running ni
in GDB probably hangs too. You can get the kernel stacks with find /proc/19164/task -name "stack" | sudo xargs -I{} sh -c 'echo {}; cat {}'
. I also have mentioned dmesg several times because that's where the kernel logs if it something gets stuck.
Thanks for all of your help :)
find /proc/19164/task -name "stack" | sudo xargs -I{} sh -c 'echo {}; cat {}'
/proc/19164/task/19164/stack
/proc/19164/task/19169/stack
/proc/19164/task/19170/stack
$ sudo cat /proc/19164/task/19164/stack
$ sudo cat /proc/19164/task/19169/stack
$ sudo cat /proc/19164/task/19170/stack
i.e. the files are empty...(?)
dmesg -Tk
shows:
00 00
[Tue Feb 8 22:21:36 2022] sd 3:0:0:0: [sdd] tag#12 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD
[Tue Feb 8 22:21:36 2022] sd 3:0:0:0: [sdd] tag#12 CDB: opcode=0x88 88 00 00 00 00 01 ec e8 01 00 00 00 01 00
00 00
[Tue Feb 8 22:21:41 2022] sd 3:0:0:0: [sdd] tag#19 uas_eh_abort_handler 0 uas-tag 3 inflight: CMD OUT
[Tue Feb 8 22:21:41 2022] sd 3:0:0:0: [sdd] tag#19 CDB: opcode=0x8a 8a 00 00 00 00 02 51 40 09 78 00 00 00 08
00 00
[Tue Feb 8 22:21:41 2022] sd 3:0:0:0: [sdd] tag#18 uas_eh_abort_handler 0 uas-tag 6 inflight: CMD OUT
[Tue Feb 8 22:21:41 2022] sd 3:0:0:0: [sdd] tag#18 CDB: opcode=0x8a 8a 00 00 00 00 02 6d c0 09 d8 00 00 00 20
00 00
[T
Does this look like issue? I do see the same message sporadically in the past week. Looking at these entries last night I just thought a uas error would be "more general* and not just be happening "on the same file" - (same rmlint command).
I have had uas problems (even leading to device offlining) with other raspberrypis. It is a known problem area. Downgrading to usb2(!) can help, as can simply rebooting and unplugging/replugging the drives(!) and kernel updates help too.
I checked the kernel version on this machine and see it is very old. So, today I'll back-up the sdcard(os) and update the kernel.
ni
does hang as you mentioned
Dunno why there are no kernel stacks in /proc (maybe it's an optional feature or your kernel has limited stack unwinding?) but I would definitely attribute the hang to those kernel messages. Not an rmlint issue.
Will close this as it really appears to be my system - thanks 😊 for the great input and help
Investigating this:
find <the directory from the rmlint command> -exec dd -if <the file> -of </dev/null>
kind of command overnight.So I went back to "old school" debugging and placed a logging output in traverse.c:
386d385
< rm_log_info_line(_("Current file is %s"), p->fts_path);
And now I see that rmlint chokes on AppContainerUserCertRead:
INFO: Current file is /srv/dev-disk-by-label-OMV2/shd2/from.ext.hdd.M-family.TOSH/2020-01-05-p13/Users/XXXXX/AppData/Roaming/Microsoft/SystemCertificates/My/CRLs
INFO: Current file is /srv/dev-disk-by-label-OMV2/shd2/from.ext.hdd.M-family.TOSH/2020-01-05-p13/Users/XXXXX/AppData/Roaming/Microsoft/SystemCertificates/My/AppContainerUserCertRead
This is a pipe/FIFO:
from.ext.hdd.M-family.TOSH/2020-01-05-p13/Users/XXXXX/AppData/Roaming/Microsoft/SystemCertificates/My $ ls -al
total 16
drwxrwxrwx 4 root root 4096 Sep 13 17:11 .
drwxrwxrwx 4 root root 4096 Jan 2 2020 ..
prwxrwxrwx 1 root root 0 Nov 28 2019 AppContainerUserCertRead
drwxrwxrwx 2 root root 4096 Nov 28 2019 CRLs
drwxrwxrwx 2 root root 4096 Nov 28 2019 CTLs
You mentioned excluding such files from rmlint:
info proc
gives you the PID, which you can use like ls -l /proc/45834/cwd
to get the current directory of rmlint. If there is actually a FIFO called \230\305\002
in that directory (stat /proc/PID/cwd/$'\230\305\002'
), then rmlint needs to better avoid FIFOs with stat() or O_NONBLOCK. Otherwise there's an issue in the kernel and dmesg can help.
Originally posted by @Cebtenzzre in https://github.com/sahib/rmlint/issues/555#issuecomment-1033137600
I am able to reproduce the issue like this:
$ mkfifo foo
$ chmod +x foo
$ rmlint -T ns .
(rmlint hangs)
I am not surprised that you are one of the first people to hit this, because it requires that the non-default unstripped binary search is enabled, and that it encounters a FIFO, and that the FIFO is marked executable (apparently because it came from a Windows system). Here's the fix:
lib/utilities.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/utilities.c b/lib/utilities.c
index 30d82607..00299728 100644
--- a/lib/utilities.c
+++ b/lib/utilities.c
@@ -467,7 +467,7 @@ bool rm_util_is_nonstripped(_UNUSED const char *path, _UNUSED RmStat *statp) {
#if HAVE_LIBELF
g_return_val_if_fail(path, false);
- if(statp && (statp->st_mode & (S_IXUSR | S_IXGRP | S_IXOTH)) == 0) {
+ if(!S_ISREG(statp->st_mode) || (statp->st_mode & (S_IXUSR | S_IXGRP | S_IXOTH)) == 0) {
return false;
}
Your very fast patch fixed the issue for me :) So I'll close again. Thanks
Running master 2.10.1 and develop plus a few patches and I am getting a hang during initial traverse with both versions.
Traversing (316260 usable files / 0 + 0 ignored files / folders)
Always at the same file count. After this, the drives spin down and nothing appears to be happening. The process continues to run, waiting for something. CPU usage is low.I checked RAM use - no problems there. 4GB RAM, ca. 2GB used. (Raspberry Pi 4B)
With
-vvv
set, the last message I (always) see is:WARNING: Added big file /srv/path/Scans BU Dec 2014/Africa_Turmoil_WEB.pdf
Would it be this file causing the problem? (probably more likely something afterwards)Running gdb (develop version compiled with DEBUG=1) when I get to the "same place" I do a CTRL-C (not sure if this is useful)
I am not sure if these backtraces help - I am VERY rusty at debugging. Any hints on how to hunt this problem down appreciated :)