Closed honggyukim closed 1 year ago
This problem is found from runtest result.
$ ./runtest.py 052
Start 1 tests without worker pool
Compiler gcc clang
Test case pg finstrument-fu fpatchable-fun pg finstrument-fu fpatchable-fun
------------------------: O0 O1 O2 O3 Os O0 O1 O2 O3 Os O0 O1 O2 O3 Os O0 O1 O2 O3 Os O0 O1 O2 O3 Os O0 O1 O2 O3 Os
052 nested_func : OK OK OK OK OK OK OK OK OK OK SG SG SG SG SG SK SK SK SK SK SK SK SK SK SK SK SK SK SK SK
$ ./runtest.py -vde -O0 -c gcc 052
Start 1 tests without worker pool
Compiler gc
Test case fp
------------------------: O0
build command: gcc -o t-nested -fno-inline -fno-builtin -fno-ipa-cp -fno-omit-frame-pointer -D_FORTIFY_SOURCE=0 -fpatchable-function-entry=5 -O0 s-nested.c
test command: /home/honggyu/work/uftrace/uftrace live --no-pager --no-event --libmcount-path=/home/honggyu/work/uftrace -P . t-nested
WARN: Segmentation fault: invalid permission (addr: 0x7fffe25b4080)
WARN: if this happens only with uftrace, please consider -e/--estimate-return option.
WARN: Backtrace from uftrace v0.13.1-24-g55da ( x86_64 dwarf python3 luajit tui perf sched dynamic )
WARN: =====================================
WARN: [2] (foo_internal.0[55d053dea18e] <= foo[55d053dea1de])
WARN: [1] (foo[55d053dea1ab] <= main[55d053dea342])
WARN: [0] (main[55d053dea2ff] <= <7f1562473d90>[7f1562473d90])
WARN: child terminated by signal: 11: Segmentation fault
052 nested_func : SG
I see that r10
register also has to be pushed onto stack and this change fixes the problem.
diff --git a/arch/x86_64/fentry.S b/arch/x86_64/fentry.S
index c25bbec6..76455589 100644
--- a/arch/x86_64/fentry.S
+++ b/arch/x86_64/fentry.S
@@ -28,10 +28,11 @@
GLOBAL(__fentry__)
.cfi_startproc
- sub $48, %rsp
- .cfi_adjust_cfa_offset 48
+ sub $56, %rsp
+ .cfi_adjust_cfa_offset 56
/* save register arguments in mcount_args */
+ movq %r10, 48(%rsp)
movq %rdi, 40(%rsp)
movq %rsi, 32(%rsp)
movq %rdx, 24(%rsp)
@@ -40,10 +41,10 @@ GLOBAL(__fentry__)
movq %r9, 0(%rsp)
/* child addr */
- movq 48(%rsp), %rsi
+ movq 56(%rsp), %rsi
/* parent location */
- lea 56(%rsp), %rdi
+ lea 64(%rsp), %rdi
/* mcount_args */
movq %rsp, %rdx
@@ -72,9 +73,10 @@ GLOBAL(__fentry__)
movq 24(%rsp), %rdx
movq 32(%rsp), %rsi
movq 40(%rsp), %rdi
+ movq 48(%rsp), %r10
- add $48, %rsp
- .cfi_adjust_cfa_offset -48
+ add $56, %rsp
+ .cfi_adjust_cfa_offset -56
retq
.cfi_endproc
END(__fentry__)
It works fine as follows.
$ uftrace -P. a.out
# DURATION TID FUNCTION
[167273] | main() {
[167273] | foo() {
0.065 us [167273] | foo_internal.0();
0.723 us [167273] | } /* foo */
[167273] | bar() {
[167273] | qsort() {
0.088 us [167273] | compar.1();
0.086 us [167273] | compar.1();
0.086 us [167273] | compar.1();
1.269 us [167273] | } /* qsort */
1.730 us [167273] | } /* bar */
3.087 us [167273] | } /* main */
But I need to study what r10
register does. If r10
really have to be preserved then the same is needed in the normal mcount
entry.
I also have to adjust the above diff by relocate r10
push to the below of r9
push.
It seems that r10
is used to store the outter function’s stack pointer.
So, the trampoline loads the outer function's stack pointer into %r10 and jumps to the nested function's body. … As you can see, the nested function uses %r10 to access the outer function's variables.
https://stackoverflow.com/questions/8179521/implementation-of-nested-functions
https://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions says
R10 is used as a static chain pointer in case of nested functions[28]: 21
There is a related commit at https://gitlab.com/x86-psABIs/x86-64-ABI/-/commit/adc986909f2c5aad7aeedbcc05b5f449a2eabfbf that is applied to the original abi.pdf,
If r10 really have to be preserved then the same is needed in the normal mcount entry.
Unlike dynamic tracing, r10
is preserved before calling mcount
when -pg
option is used.
$ gcc -pg -O2 -o t-nested s-nested.c
$ objdump -d t-nested
...
00000000000012e0 <foo_internal.0>:
12e0: 55 push %rbp
12e1: 48 89 e5 mov %rsp,%rbp
12e4: 41 52 push %r10
12e6: ff 15 fc 2c 00 00 call *0x2cfc(%rip) # 3fe8 <mcount@GLIBC_2.2.5>
12ec: 41 5a pop %r10
12ee: 41 8b 02 mov (%r10),%eax
12f1: 8d 50 01 lea 0x1(%rax),%edx
12f4: 41 89 12 mov %edx,(%r10)
12f7: 5d pop %rbp
12f8: c3 ret
12f9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
So we don't have to push r10
register in mcount
entry and r10
push is needed both in __fentry__
and __dentry__
.
r10 push is needed both in
__fentry__
and__dentry__
.
I need to check if it's needed in __dentry__
again because it just works fine.
$ gcc s-nested.c
$ uftrace -P. a.out
# DURATION TID FUNCTION
[966532] | main() {
[966532] | foo() {
0.140 us [966532] | foo_internal.0();
1.219 us [966532] | } /* foo */
[966532] | bar() {
[966532] | qsort() {
0.173 us [966532] | compar.1();
0.075 us [966532] | compar.1();
0.070 us [966532] | compar.1();
1.476 us [966532] | } /* qsort */
1.995 us [966532] | } /* bar */
4.244 us [966532] | } /* main */
The safer option would be to save r10 anyway. But I'd like to check it with misc/bench.sh
how much it'd affect.
Also I'm thinking of SSE registers too as float type argument is broken some cases. Also need to check the performance impact.
I need to check if it's needed in dentry again because it just works fine.
It may depend on the compiler version, I got a crash with the dynamic tracing on my system.
Hmm.. I found __dentry__
already saves r10. Then I need to investigate why it got the crash. :(
Ok, it was a permission problem. It required -Wl,-z,execstack
on my system. Without that, it crashed even not with uftrace.
https://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions says
This also says r11
and rax
also have to be saved along with r10
, but not sure if it applies in Linux as well because the reference is from microsoft.
The registers RAX, RCX, RDX, R8, R9, R10, R11 are considered volatile (caller-saved).[25]
As you mentioned __dentry__
already saves r10
and I see that it also saves rax
and r11
as follows.
GLOBAL(__dentry__)
...
/* save rax (implicit argument for variadic functions) */
push %rax
/* save scratch registers due to -fipa-ra */
push %r10
push %r11
call mcount_entry
...
Maybe shouldn't we save rax
, r10
and r11
in __fentry__
as well? I will have to run misc/bench.sh
, but I guess having 3 more instructions doesn't look like a serious overhead.
Also I'm thinking of SSE registers too as float type argument is broken some cases.
Yeah, we also need to check it because the wiki describes it as follows.
In x86-64, Visual Studio 2008 stores floating point numbers in XMM6 and XMM7 (as well as XMM8 through XMM15); consequently, for x86-64, user-written assembly language routines must preserve XMM6 and XMM7 (as compared to x86 wherein user-written assembly language routines did not need to preserve XMM6 and XMM7). In other words, user-written assembly language routines must be updated to save/restore XMM6 and XMM7 before/after the function when being ported from x86 to x86-64.
Sure, please save them in __fentry__
, mcount
and plt_hooker
. We should have the same logic in those functions - but maybe plt_hooker
can be little different due to the r11. I've checked the overhead with the bench test and it was negligible.
When it comes to SSE registers, I tried it with saving the xmm0 and xmm1 but the result was still not good. :(
Sure, please save them in fentry, mcount and plt_hooker. We should have the same logic in those functions - but maybe plt_hooker can be little different due to the r11. I've checked the overhead with the bench test and it was negligible.
I would do that later when I have time.
When it comes to SSE registers, I tried it with saving the xmm0 and xmm1 but the result was still not good. :(
That's not related to this issue. Please see #1631.
The
s-nested.c
can be compiled as follows.But this gets crashed when recording with
-P
.This can simply be reproduced with
-P foo_internal.0
.It looks fine in
foo_internal.0
withobjdump -d
.But this gets segfaulted as follows.