Open ParkHanbum opened 5 years ago
Thanks for the report. I'll take a look later..
I also see this problem. It's get crashed running with -vv
.
# gdb --args /home/root/uftrace record -vv -D 3 ./node.arm.pg -e ''
...
$0m[90mmcount: [mod:$76fff958, idx: 189]$enter 17fcf80:
$0m[90mmcount: [mod$ 76fff958, idx: 188$ enter 17fcf80:
$0m[9$m[90mmcount: [mod: $6fff958, idx: 190] $nter 17fcf80:
D0m[90mmcount:$[mod: 76fff958, idx$ 14] enter 17fcf80:$
Thread 2.2 "node.arm.pg" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 5369]
0x76b46118 in ?? () from /lib/libc.so.6
(gdb) bt
#0 0x76b46118 in ?? () from /lib/libc.so.6
#1 0x76b43b64 in vfprintf () from /lib/libc.so.6
#2 0x76fb297a in __pr_dbg (fmt=0x76fc2d8c "mcount: preparing shmem buffers: tid = %d\n") at /usr/src/debug/uftrace/0.9.2-r0/git/utils/debug.c:154
#3 0x76fb08d0 in prepare_shmem_buffer (mtdp=mtdp@entry=0x76fa6888) at /usr/src/debug/uftrace/0.9.2-r0/git/libmcount/record.c:71
#4 0x76fad40a in mcount_prepare () at /usr/src/debug/uftrace/0.9.2-r0/git/libmcount/mcount.c:764
#5 0x76fadc8a in mcount_entry (parent_loc=0x76fa5dbc, child=8378799, regs=0x76fa5da8) at /usr/src/debug/uftrace/0.9.2-r0/git/libmcount/mcount.c:1191
#6 0x76fc03cc in __gnu_mcount_nc () at /usr/src/debug/uftrace/0.9.2-r0/git/arch/arm/mcount.S:57
#7 0x007fd9ae in node::DebugSignalThreadMain (unused=0x0) at /usr/src/debug/nodejs/6.11.2-r3.1/node-v6.11.2/src/node.cc:4091
#8 0x76c45ef4 in ?? () from /lib/libpthread.so.0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
The problem shows when -vv
is used with --logfile
. I made the problem as simple as possible and found the strange problem in the below pr_dbg2
statement:
diff --git a/libmcount/plthook.c b/libmcount/plthook.c
index c77c4df7..7c65c4b6 100644
--- a/libmcount/plthook.c
+++ b/libmcount/plthook.c
@@ -747,7 +747,7 @@ unsigned long plthook_entry(unsigned long *ret_addr, unsigned long child_idx,
if (likely(child_idx < pd->dsymtab.nr_sym)) {
sym = &pd->dsymtab.sym[child_idx];
- pr_dbg3("[idx: %4d] enter %lx: %s (mod: %lx)\n",
+ pr_dbg2("[idx: %4d] enter %lx: %s (mod: %lx)\n", /* <- segfault here */
child_idx, sym->addr, sym->name, module_id);
}
else {
I actually changed pr_dbg3
to pr_dbg2
for testing and found the problem here. If I remove this line, then it works fine. After testing, I made the problem same as the original pr_dbg2
statement.
diff --git a/libmcount/plthook.c b/libmcount/plthook.c
index c77c4df7..4c271159 100644
--- a/libmcount/plthook.c
+++ b/libmcount/plthook.c
@@ -747,8 +771,7 @@ unsigned long plthook_entry(unsigned long *ret_addr, unsigned long child_idx,
if (likely(child_idx < pd->dsymtab.nr_sym)) {
sym = &pd->dsymtab.sym[child_idx];
- pr_dbg3("[idx: %4d] enter %lx: %s (mod: %lx)\n",
- child_idx, sym->addr, sym->name, module_id);
+ fprintf(logfp, "12345678");
}
else {
sym = NULL;
The strange thing is that if I change fprintf(logfp, "12345678")
to fprintf(logfp, "1234567")
, then the problem is gone. It seems that I cannot print 8 characters in this place because it was fine in other places.
I also often see that the pr_dbg
statement shows sym->name
as (null)
most of the cases. I'm not sure if it's related to this problem.
As shown in the above backtrace log I posted, it might be related to print internals for logfp
.
(gdb) bt
#0 0x76b46118 in ?? () from /lib/libc.so.6
#1 0x76b43b64 in vfprintf () from /lib/libc.so.6
#2 0x76fb297a in __pr_dbg (fmt=0x76fc2d8c "mcount: preparing shmem buffers: tid = %d\n") at /usr/src/debug/uftrace/0.9.2-r0/git/utils/debug.c:154
#3 0x76fb08d0 in prepare_shmem_buffer (mtdp=mtdp@entry=0x76fa6888) at /usr/src/debug/uftrace/0.9.2-r0/git/libmcount/record.c:71
#4 0x76fad40a in mcount_prepare () at /usr/src/debug/uftrace/0.9.2-r0/git/libmcount/mcount.c:764
#5 0x76fadc8a in mcount_entry (parent_loc=0x76fa5dbc, child=8378799, regs=0x76fa5da8) at /usr/src/debug/uftrace/0.9.2-r0/git/libmcount/mcount.c:1191
#6 0x76fc03cc in __gnu_mcount_nc () at /usr/src/debug/uftrace/0.9.2-r0/git/arch/arm/mcount.S:57
#7 0x007fd9ae in node::DebugSignalThreadMain (unused=0x0) at /usr/src/debug/nodejs/6.11.2-r3.1/node-v6.11.2/src/node.cc:4091
#8 0x76c45ef4 in ?? () from /lib/libpthread.so.0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Otherwise, there might be some nasty bug around this place of code that triggers the segfault here.
@honggyukim good job. I think it is better to reduce that can affect target program.
I have no clue for this problem as of now. Maybe it's related to the size of the message printed since it's ok with -v
. It seems to need to check the internals of printf()
anyway.
It'd be useful if we can reproduce with a very simple test example but I'm not sure if it's possible.
update.
problem occurred at here:
#0 buffered_vfprintf (s=s@entry=0x7ffff6c22680 <_IO_2_1_stderr_>, format=format@entry=0x7ffff7bc1ae8 "mcount: preparing shmem buffers: tid = %d\n", args=args@entry=0x7ffff7fd2a80)
at vfprintf.c:2308
#1 0x00007ffff6891726 in _IO_vfprintf_internal (s=0x7ffff6c22680 <_IO_2_1_stderr_>, format=0x7ffff7bc1ae8 "mcount: preparing shmem buffers: tid = %d\n", ap=0x7ffff7fd2a80)
at vfprintf.c:1301
#2 0x00007ffff7ba52e0 in __pr_dbg (fmt=0x7ffff7bc1ae8 "mcount: preparing shmem buffers: tid = %d\n") at /home/m/git/uftrace/utils/debug.c:156
#3 0x00007ffff7b9ff2b in prepare_shmem_buffer (mtdp=0x7ffff7fd3620) at /home/m/git/uftrace/libmcount/record.c:71
#4 0x00007ffff7b998cd in mcount_prepare () at /home/m/git/uftrace/libmcount/mcount.c:780
#5 0x00007ffff7b9a5cb in __mcount_entry (parent_loc=0x7ffff7fd2df8, child=93824999852295, regs=0x7ffff7fd2d78) at /home/m/git/uftrace/libmcount/mcount.c:1207
#6 0x00007ffff7b9a8f0 in mcount_entry (parent_loc=0x7ffff7fd2df8, child=93824999852295, regs=0x7ffff7fd2d78) at /home/m/git/uftrace/libmcount/mcount.c:1289
#7 0x00007ffff7bbda07 in mcount () at /home/m/git/uftrace/arch/x86_64/mcount.S:61
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
Dump of assembler code for function buffered_vfprintf:
0x00007ffff68945d0 <+0>: push %r14
0x00007ffff68945d2 <+2>: push %r13
0x00007ffff68945d4 <+4>: push %r12
0x00007ffff68945d6 <+6>: push %rbp
0x00007ffff68945d7 <+7>: push %rbx
0x00007ffff68945d8 <+8>: mov %rdi,%rbx
0x00007ffff68945db <+11>: sub $0x2140,%rsp
0x00007ffff68945e2 <+18>: mov %fs:0x28,%rax
0x00007ffff68945eb <+27>: mov %rax,0x2138(%rsp)
0x00007ffff68945f3 <+35>: xor %eax,%eax
0x00007ffff68945f5 <+37>: mov 0xc0(%rdi),%eax
0x00007ffff68945fb <+43>: test %eax,%eax
0x00007ffff68945fd <+45>: jne 0x7ffff6894770 <buffered_vfprintf+416>
0x00007ffff6894603 <+51>: movl $0xffffffff,0xc0(%rdi)
0x00007ffff689460d <+61>: lea 0x130(%rsp),%rax
0x00007ffff6894615 <+69>: lea 0x30(%rsp),%rdi
=> 0x00007ffff689461a <+74>: mov %rbx,0x110(%rsp)
I can't understand why buffered_vfprintf require huge size of stack.
0x7ffff7fd0000 0x7ffff7fd1000 0x1000 0x0
0x7ffff7fd1000 0x7ffff7fd5000 0x4000 0x0
(gdb) info reg $rsp
rsp 0x7ffff7fd0380 0x7ffff7fd0380
(gdb) info reg $rbp
rbp 0x7ffff7fd2a60 0x7ffff7fd2a60
rsp register's value point to another memory region. I think that 0x7ffff7fd0000 - 0x7ffff7fd1000 memory region was not writable.
(gdb) frame 6
#6 0x00007ffff7b9a8f0 in mcount_entry (parent_loc=0x7ffff7fd2df8, child=93824999852295, regs=0x7ffff7fd2d78) at /home/m/git/uftrace/libmcount/mcount.c:1289
1289 int ret = __mcount_entry(parent_loc, child, regs);
(gdb) x/i child
0x555555c98907 <node::inspector::(anonymous namespace)::StartIoThreadMain(void*)+23>:
lea 0x1c52772(%rip),%rbx # 0x5555578eb080 <_ZN4node9inspector12_GLOBAL__N_1L21start_io_thread_asyncE>
(gdb) x/i *parent_loc
0x7ffff6c2e6db <start_thread+219>: mov %rax,%fs:0x630
I'm not sure but I think stack overflow is the cause.
(gdb) frame
#6 0x00007ffff7b9a8f0 in mcount_entry (parent_loc=0x7ffff7fd2df8, child=93824999852295, regs=0x7ffff7fd2d78) at /home/m/git/uftrace/libmcount/mcount.c:1289
1289 int ret = __mcount_entry(parent_loc, child, regs);
(gdb) info reg $rbp
rbp 0x7ffff7fd2d50 0x7ffff7fd2d50
(gdb) info reg $rsp
rsp 0x7ffff7fd2d00 0x7ffff7fd2d00
Hmm.. the stack should be increased automatically. Maybe some task has set a resource limit for stack ulimit -s
.
Anyway does following patch make any difference?
diff --git a/libmcount/record.c b/libmcount/record.c
index fccd103..b19f19e 100644
--- a/libmcount/record.c
+++ b/libmcount/record.c
@@ -63,7 +63,7 @@ static struct mcount_shmem_buffer *allocate_shmem_buffer(char *sess_id, size_t s
void prepare_shmem_buffer(struct mcount_thread_data *mtdp)
{
- char buf[128];
+ char buf[64];
int idx;
int tid = mcount_gettid(mtdp);
struct mcount_shmem *shmem = &mtdp->shmem;
@namhyung unfortunately, it was not affect.
$ ulimit -s
16000
0x7ffff7fd0000 0x7ffff7fd1000 0x1000 0x0
0x7ffff7fd1000 0x7ffff7fd5000 0x4000 0x0
0x7ffff7fd5000 0x7ffff7ff5000 0x20000 0x0 /dev/shm/uftrace-b2dadc4392e4f366-1466-000
0x7ffff7ff5000 0x7ffff7ff7000 0x2000 0x0
0x7ffff7ff7000 0x7ffff7ffa000 0x3000 0x0 [vvar]
0x7ffff7ffa000 0x7ffff7ffc000 0x2000 0x0 [vdso]
0x7ffff7ffc000 0x7ffff7ffd000 0x1000 0x27000 /lib/x86_64-linux-gnu/ld-2.27.so
0x7ffff7ffd000 0x7ffff7ffe000 0x1000 0x28000 /lib/x86_64-linux-gnu/ld-2.27.so
0x7ffff7ffe000 0x7ffff7fff000 0x1000 0x0
0x7ffffffc4000 0x7ffffffff000 0x3b000 0x0 [stack]
0xffffffffff600000 0xffffffffff601000 0x1000 0x0 [vsyscall]
(gdb) info reg $rsp
rsp 0x7ffff7fd03e0 0x7ffff7fd03e0
(gdb) info reg $rbp
rbp 0x7ffff7fd2ac0 0x7ffff7fd2ac0
maybe increased stack size by using ulimit was not affect to forked process.
Yeah, process itself can change its rlimit. Maybe some process/thread set it to a very small number and doesn't expect to call printf()
or so...
here is the backtrace from coredump.