namhyung / uftrace

Function graph tracer for C/C++/Rust/Python
https://uftrace.github.io/slide/
GNU General Public License v2.0
3.06k stars 474 forks source link

s-nested.c with -fpatchable-function-entry gets crashed #1630

Closed honggyukim closed 1 year ago

honggyukim commented 1 year ago

The s-nested.c can be compiled as follows.

$ gcc -fpatchable-function-entry=5 s-nested.c

But this gets crashed when recording with -P.

$ uftrace -P. a.out 
WARN: Segmentation fault: invalid permission (addr: 0x7ffc7c59a080)
WARN:  if this happens only with uftrace, please consider -e/--estimate-return option.

WARN: Backtrace from uftrace v0.13.1-24-g55da ( x86_64 dwarf python3 luajit tui perf sched dynamic )
WARN: =====================================
WARN: [2] (foo_internal.0[5641d530e18e] <= foo[5641d530e1de])
WARN: [1] (foo[5641d530e1ab] <= main[5641d530e342])
WARN: [0] (main[5641d530e2ff] <= <7fbe80e56d90>[7fbe80e56d90])

Please report this bug to https://github.com/namhyung/uftrace/issues.

WARN: child terminated by signal: 11: Segmentation fault
# DURATION     TID     FUNCTION
            [125510] | main() {
            [125510] |   foo() {
            [125510] |     foo_internal.0() {

uftrace stopped tracing with remaining functions
================================================
task: 125510
[2] foo_internal.0
[1] foo
[0] main

This can simply be reproduced with -P foo_internal.0.

$ uftrace record -P foo_internal.0 --debug-domain dynamic:3 a.out 
dynamic: dynamic patch type: a.out: 5 (fpatchable)
dynamic: update 0x560ebe2c8189 for 'foo_internal.0' function dynamically to call __fentry__
dynamic: patched all (1) functions in 'a.out'
WARN: Segmentation fault: invalid permission (addr: 0x7fff263b3080)
WARN:  if this happens only with uftrace, please consider -e/--estimate-return option.

WARN: Backtrace from uftrace v0.13.1-24-g55da ( x86_64 dwarf python3 luajit tui perf sched dynamic )
WARN: =====================================
WARN: [0] (foo_internal.0[560ebe2c818e] <= foo[560ebe2c81de])

Please report this bug to https://github.com/namhyung/uftrace/issues.

WARN: child terminated by signal: 11: Segmentation fault

It looks fine in foo_internal.0 with objdump -d.

0000000000001189 <foo_internal.0>:
    1189:       90                      nop
    118a:       90                      nop
    118b:       90                      nop
    118c:       90                      nop
    118d:       90                      nop
    118e:       55                      push   %rbp
    118f:       48 89 e5                mov    %rsp,%rbp
    1192:       4c 89 d2                mov    %r10,%rdx
    1195:       4c 89 55 f8             mov    %r10,-0x8(%rbp)
    1199:       8b 02                   mov    (%rdx),%eax
    119b:       8d 48 01                lea    0x1(%rax),%ecx
    119e:       89 0a                   mov    %ecx,(%rdx)
    11a0:       5d                      pop    %rbp
    11a1:       c3                      ret

But this gets segfaulted as follows.

$ cgdb --args uftrace record -P foo_internal.0 a.out
        ...
 1│ Dump of assembler code for function foo_internal.0:
 2│    0x0000555555555189 <+0>:     call   0x555555555ff0
 3│    0x000055555555518e <+5>:     push   rbp
 4│    0x000055555555518f <+6>:     mov    rbp,rsp
 5│    0x0000555555555192 <+9>:     mov    rdx,r10
 6│    0x0000555555555195 <+12>:    mov    QWORD PTR [rbp-0x8],r10
 7│    0x0000555555555199 <+16>:    mov    eax,DWORD PTR [rdx]
 8│    0x000055555555519b <+18>:    lea    ecx,[rax+0x1]
 9├──> 0x000055555555519e <+21>:    mov    DWORD PTR [rdx],ecx
10│    0x00005555555551a0 <+23>:    pop    rbp
11│    0x00005555555551a1 <+24>:    ret
12│ End of assembler dump.
** Dump of assembler code for function foo_internal.0: (555555555189 - 5555555551a1) **

Thread 2.1 "a.out" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff728e840 (LWP 125622)]
0x000055555555519e in foo_internal ()
honggyukim commented 1 year ago

This problem is found from runtest result.

$ ./runtest.py 052
Start 1 tests without worker pool
Compiler                  gcc                                           clang                                       
Test case                 pg             finstrument-fu fpatchable-fun  pg             finstrument-fu fpatchable-fun
------------------------: O0 O1 O2 O3 Os O0 O1 O2 O3 Os O0 O1 O2 O3 Os  O0 O1 O2 O3 Os O0 O1 O2 O3 Os O0 O1 O2 O3 Os
052 nested_func         : OK OK OK OK OK OK OK OK OK OK SG SG SG SG SG  SK SK SK SK SK SK SK SK SK SK SK SK SK SK SK
$ ./runtest.py -vde -O0 -c gcc 052
Start 1 tests without worker pool
Compiler                  gc
Test case                 fp
------------------------: O0
build command: gcc -o t-nested -fno-inline -fno-builtin -fno-ipa-cp -fno-omit-frame-pointer -D_FORTIFY_SOURCE=0  -fpatchable-function-entry=5 -O0  s-nested.c   
test command: /home/honggyu/work/uftrace/uftrace live --no-pager --no-event --libmcount-path=/home/honggyu/work/uftrace  -P . t-nested
WARN: Segmentation fault: invalid permission (addr: 0x7fffe25b4080)
WARN:  if this happens only with uftrace, please consider -e/--estimate-return option.

WARN: Backtrace from uftrace v0.13.1-24-g55da ( x86_64 dwarf python3 luajit tui perf sched dynamic )
WARN: =====================================
WARN: [2] (foo_internal.0[55d053dea18e] <= foo[55d053dea1de])
WARN: [1] (foo[55d053dea1ab] <= main[55d053dea342])
WARN: [0] (main[55d053dea2ff] <= <7f1562473d90>[7f1562473d90])
WARN: child terminated by signal: 11: Segmentation fault
052 nested_func         : SG
honggyukim commented 1 year ago

I see that r10 register also has to be pushed onto stack and this change fixes the problem.

diff --git a/arch/x86_64/fentry.S b/arch/x86_64/fentry.S
index c25bbec6..76455589 100644
--- a/arch/x86_64/fentry.S
+++ b/arch/x86_64/fentry.S
@@ -28,10 +28,11 @@

 GLOBAL(__fentry__)
        .cfi_startproc
-       sub $48, %rsp
-       .cfi_adjust_cfa_offset 48
+       sub $56, %rsp
+       .cfi_adjust_cfa_offset 56

        /* save register arguments in mcount_args */
+       movq %r10, 48(%rsp)
        movq %rdi, 40(%rsp)
        movq %rsi, 32(%rsp)
        movq %rdx, 24(%rsp)
@@ -40,10 +41,10 @@ GLOBAL(__fentry__)
        movq %r9,   0(%rsp)

        /* child addr */
-       movq 48(%rsp), %rsi
+       movq 56(%rsp), %rsi

        /* parent location */
-       lea 56(%rsp), %rdi
+       lea 64(%rsp), %rdi

        /* mcount_args */
        movq %rsp, %rdx
@@ -72,9 +73,10 @@ GLOBAL(__fentry__)
        movq 24(%rsp), %rdx
        movq 32(%rsp), %rsi
        movq 40(%rsp), %rdi
+       movq 48(%rsp), %r10

-       add $48, %rsp
-       .cfi_adjust_cfa_offset -48
+       add $56, %rsp
+       .cfi_adjust_cfa_offset -56
        retq
        .cfi_endproc
 END(__fentry__)

It works fine as follows.

$ uftrace -P. a.out 
# DURATION     TID     FUNCTION
            [167273] | main() {
            [167273] |   foo() {
   0.065 us [167273] |     foo_internal.0();
   0.723 us [167273] |   } /* foo */
            [167273] |   bar() {
            [167273] |     qsort() {
   0.088 us [167273] |       compar.1();
   0.086 us [167273] |       compar.1();
   0.086 us [167273] |       compar.1();
   1.269 us [167273] |     } /* qsort */
   1.730 us [167273] |   } /* bar */
   3.087 us [167273] | } /* main */

But I need to study what r10 register does. If r10 really have to be preserved then the same is needed in the normal mcount entry.

I also have to adjust the above diff by relocate r10 push to the below of r9 push.

honggyukim commented 1 year ago

It seems that r10 is used to store the outter function’s stack pointer.

So, the trampoline loads the outer function's stack pointer into %r10 and jumps to the nested function's body. … As you can see, the nested function uses %r10 to access the outer function's variables.

https://stackoverflow.com/questions/8179521/implementation-of-nested-functions

honggyukim commented 1 year ago

https://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions says

R10 is used as a static chain pointer in case of nested functions[28]: 21

There is a related commit at https://gitlab.com/x86-psABIs/x86-64-ABI/-/commit/adc986909f2c5aad7aeedbcc05b5f449a2eabfbf that is applied to the original abi.pdf,

honggyukim commented 1 year ago

If r10 really have to be preserved then the same is needed in the normal mcount entry.

Unlike dynamic tracing, r10 is preserved before calling mcount when -pg option is used.

$ gcc -pg -O2 -o t-nested s-nested.c
$ objdump -d t-nested
      ...
00000000000012e0 <foo_internal.0>:
    12e0:       55                      push   %rbp
    12e1:       48 89 e5                mov    %rsp,%rbp
    12e4:       41 52                   push   %r10
    12e6:       ff 15 fc 2c 00 00       call   *0x2cfc(%rip)        # 3fe8 <mcount@GLIBC_2.2.5>
    12ec:       41 5a                   pop    %r10
    12ee:       41 8b 02                mov    (%r10),%eax
    12f1:       8d 50 01                lea    0x1(%rax),%edx
    12f4:       41 89 12                mov    %edx,(%r10)
    12f7:       5d                      pop    %rbp
    12f8:       c3                      ret
    12f9:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)

So we don't have to push r10 register in mcount entry and r10 push is needed both in __fentry__ and __dentry__.

honggyukim commented 1 year ago

r10 push is needed both in __fentry__ and __dentry__.

I need to check if it's needed in __dentry__ again because it just works fine.

$ gcc s-nested.c 

$ uftrace -P. a.out 
# DURATION     TID     FUNCTION
            [966532] | main() {
            [966532] |   foo() {
   0.140 us [966532] |     foo_internal.0();
   1.219 us [966532] |   } /* foo */
            [966532] |   bar() {
            [966532] |     qsort() {
   0.173 us [966532] |       compar.1();
   0.075 us [966532] |       compar.1();
   0.070 us [966532] |       compar.1();
   1.476 us [966532] |     } /* qsort */
   1.995 us [966532] |   } /* bar */
   4.244 us [966532] | } /* main */
namhyung commented 1 year ago

The safer option would be to save r10 anyway. But I'd like to check it with misc/bench.sh how much it'd affect.

Also I'm thinking of SSE registers too as float type argument is broken some cases. Also need to check the performance impact.

namhyung commented 1 year ago

I need to check if it's needed in dentry again because it just works fine.

It may depend on the compiler version, I got a crash with the dynamic tracing on my system.

namhyung commented 1 year ago

Hmm.. I found __dentry__ already saves r10. Then I need to investigate why it got the crash. :(

namhyung commented 1 year ago

Ok, it was a permission problem. It required -Wl,-z,execstack on my system. Without that, it crashed even not with uftrace.

honggyukim commented 1 year ago

https://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions says

This also says r11 and rax also have to be saved along with r10, but not sure if it applies in Linux as well because the reference is from microsoft.

The registers RAX, RCX, RDX, R8, R9, R10, R11 are considered volatile (caller-saved).[25]

As you mentioned __dentry__ already saves r10 and I see that it also saves rax and r11 as follows.

GLOBAL(__dentry__)
                ...
        /* save rax (implicit argument for variadic functions) */
        push %rax

        /* save scratch registers due to -fipa-ra */
        push %r10
        push %r11

        call mcount_entry
                ...

Maybe shouldn't we save rax, r10 and r11 in __fentry__ as well? I will have to run misc/bench.sh, but I guess having 3 more instructions doesn't look like a serious overhead.

honggyukim commented 1 year ago

Also I'm thinking of SSE registers too as float type argument is broken some cases.

Yeah, we also need to check it because the wiki describes it as follows.

In x86-64, Visual Studio 2008 stores floating point numbers in XMM6 and XMM7 (as well as XMM8 through XMM15); consequently, for x86-64, user-written assembly language routines must preserve XMM6 and XMM7 (as compared to x86 wherein user-written assembly language routines did not need to preserve XMM6 and XMM7). In other words, user-written assembly language routines must be updated to save/restore XMM6 and XMM7 before/after the function when being ported from x86 to x86-64.

namhyung commented 1 year ago

Sure, please save them in __fentry__, mcount and plt_hooker. We should have the same logic in those functions - but maybe plt_hooker can be little different due to the r11. I've checked the overhead with the bench test and it was negligible.

When it comes to SSE registers, I tried it with saving the xmm0 and xmm1 but the result was still not good. :(

honggyukim commented 1 year ago

Sure, please save them in fentry, mcount and plt_hooker. We should have the same logic in those functions - but maybe plt_hooker can be little different due to the r11. I've checked the overhead with the bench test and it was negligible.

I would do that later when I have time.

When it comes to SSE registers, I tried it with saving the xmm0 and xmm1 but the result was still not good. :(

That's not related to this issue. Please see #1631.