namhyung / uftrace

Function graph tracer for C/C++/Rust/Python
https://uftrace.github.io/slide/
GNU General Public License v2.0
3.04k stars 473 forks source link

unexpected segfault in clang compiled binary in aarch64 #1280

Open honggyukim opened 3 years ago

honggyukim commented 3 years ago

Here is an example.

$ cat float.c
float gf1;

__attribute__((noinline))
void float_add(float a, float b)
{
        gf1 = a + b;
}

int main()
{
        float_add(-0.2, 0.2);

        return gf1 > 0;
}

We can compile this with clang.

$ clang -pg float.c

Then we can trace it with uftrace.

$ uftrace a.out
# DURATION     TID     FUNCTION
   0.458 us [1259295] | __monstartup();
   0.166 us [1259295] | __cxa_atexit();
            [1259295] | main() {
   0.083 us [1259295] |   float_add();
   0.209 us [1259295] | } /* main */

But if this program is compiled with optimization then it crashes.

$ clang -pg -O1 float.c

$ uftrace --no-sched a.out
WARN: Segmentation fault: address not mapped (addr: (nil))
WARN:  if this happens only with uftrace, please consider -e/--estimate-return option.

WARN: Backtrace from uftrace v0.10-8-g4bfe0 ( aarch64 dwarf python luajit tui perf sched dynamic )
WARN: =====================================
WARN: [0] (main[4006e4] <= __libc_start_main[ffff9f396984])

Please report this bug to https://github.com/namhyung/uftrace/issues.

WARN: child terminated by signal: 11: Segmentation fault
# DURATION     TID     FUNCTION
   0.334 us [1260505] | __monstartup();
   0.167 us [1260505] | __cxa_atexit();
            [1260505] | main() {
   0.042 us [1260505] |   float_add();
            [1260505] |   /* linux:task-exit */

uftrace stopped tracing with remaining functions
================================================
task: 1260505
[0] main

It doesn't have this problem when compiled by gcc with the same optimization.

$ gcc -pg -O1 float.c

$ uftrace a.out
# DURATION     TID     FUNCTION
   0.375 us [1260813] | __monstartup();
   0.250 us [1260813] | __cxa_atexit();
            [1260813] | main() {
   0.041 us [1260813] |   float_add();
   0.208 us [1260813] | } /* main */
honggyukim commented 3 years ago

It might be helpful to compare the binaries for both compilers.

$ gcc -pg -O1 -o float.gcc float.c
$ clang -pg -O1 -o float.clang float.c
$ clang -pg -O0 -o float.O0.clang float.c

Here is the float_add function of gcc compiled binary(float.gcc).

00000000004006a8 <float_add>:
  4006a8:       a9be7bfd        stp     x29, x30, [sp, #-32]!
  4006ac:       910003fd        mov     x29, sp
  4006b0:       6d0127e8        stp     d8, d9, [sp, #16]
  4006b4:       1e204008        fmov    s8, s0
  4006b8:       1e204029        fmov    s9, s1
  4006bc:       d50320ff        xpaclri
  4006c0:       aa1e03e0        mov     x0, x30
  4006c4:       97ffff97        bl      400520 <_mcount@plt>
  4006c8:       1e292908        fadd    s8, s8, s9
  4006cc:       90000100        adrp    x0, 420000 <__cxa_atexit@GLIBC_2.17>
  4006d0:       bd003408        str     s8, [x0, #52]
  4006d4:       6d4127e8        ldp     d8, d9, [sp, #16]
  4006d8:       a8c27bfd        ldp     x29, x30, [sp], #32
  4006dc:       d65f03c0        ret

Here is the clang compiled float_add binary(float.clang), which gets segfault.

00000000004006a8 <float_add>:
  4006a8:       6dbe23e9        stp     d9, d8, [sp, #-32]!
  4006ac:       a9017bfd        stp     x29, x30, [sp, #16]
  4006b0:       910003fd        mov     x29, sp
  4006b4:       4ea11c28        mov     v8.16b, v1.16b
  4006b8:       4ea01c09        mov     v9.16b, v0.16b
  4006bc:       97ffff99        bl      400520 <_mcount@plt>
  4006c0:       a9417bfd        ldp     x29, x30, [sp, #16]
  4006c4:       1e282920        fadd    s0, s9, s8
  4006c8:       90000108        adrp    x8, 420000 <__cxa_atexit@GLIBC_2.17>
  4006cc:       bd003500        str     s0, [x8, #52]
  4006d0:       6cc223e9        ldp     d9, d8, [sp], #32
  4006d4:       d65f03c0        ret

The clang compiled float_add without optimization(float.O0.clang) is as follows.

00000000004006a8 <float_add>:
  4006a8:       d10083ff        sub     sp, sp, #0x20
  4006ac:       a9017bfd        stp     x29, x30, [sp, #16]
  4006b0:       910043fd        add     x29, sp, #0x10
  4006b4:       bd0007e0        str     s0, [sp, #4]
  4006b8:       bd0003e1        str     s1, [sp]
  4006bc:       97ffff99        bl      400520 <_mcount@plt>
  4006c0:       bd4007e0        ldr     s0, [sp, #4]
  4006c4:       bc1fc3a0        stur    s0, [x29, #-4]
  4006c8:       bd4003e1        ldr     s1, [sp]
  4006cc:       bd000be1        str     s1, [sp, #8]
  4006d0:       bc5fc3a2        ldur    s2, [x29, #-4]
  4006d4:       bd400be3        ldr     s3, [sp, #8]
  4006d8:       1e232842        fadd    s2, s2, s3
  4006dc:       90000108        adrp    x8, 420000 <__cxa_atexit@GLIBC_2.17>
  4006e0:       bd003502        str     s2, [x8, #52]
  4006e4:       a9417bfd        ldp     x29, x30, [sp, #16]
  4006e8:       910083ff        add     sp, sp, #0x20
  4006ec:       d65f03c0        ret

As I observed this more, it seems that uftrace can't capture the function exit of float_add in clang compiled binary.

honggyukim commented 3 years ago

The test environment is as follows.

$ uftrace info
# system information
# ==================
# program version     : v0.10-16-ge532 ( aarch64 dwarf python luajit tui perf sched dynamic )
# recorded on         : Sat Jun 26 18:53:49 2021
# cmdline             : uftrace record --no-sched a.out
# cpu info            : ARM64 (v8)
# number of cpus      : 2 / 2 (online / possible)
# memory info         : 0.5 / 1.8 GB (free / total)
# system load         : 1.13 / 1.12 / 1.09 (1 / 5 / 15 min)
# kernel version      : Linux 5.10.21-200.fc33.aarch64
# hostname            : fedora
# distro              : "Fedora 33 (Workstation Edition)"
#
# process information
# ===================
# number of tasks     : 1
# task list           : 1256929(a.out)
# exe image           : /home/parallels/work/uftrace/tests/a.out
# build id            : e7ca28ead08be9684c9204081277928d21186878
# pattern             : regex
# exit status         : terminated by signal: 11 (Segmentation fault)
# elapsed time        : 0.085246028 sec
# cpu time            : 0.035 / 0.000 sec (sys / user)
# context switch      : 5 / 34 (voluntary / involuntary)
# max rss             : 13348 KB
# page fault          : 1 / 1216 (major / minor)
# disk iops           : 0 / 16 (read / write)
honggyukim commented 3 years ago

Here is the clang version info.

$ clang --version
clang version 11.0.0 (Fedora 11.0.0-3.fc33)
Target: aarch64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
honggyukim commented 3 years ago

We can compare the assembly output at https://godbolt.org/z/jT7G9fPsP.

honggyukim commented 3 years ago

I'm just wondering what the following instructions are in the broken binary.

  4006b4:       4ea11c28        mov     v8.16b, v1.16b
  4006b8:       4ea01c09        mov     v9.16b, v0.16b
honggyukim commented 3 years ago

After digging into the problem with gdb, I see that the last br x16 goes to NULL when returning from main function.

ENTRY(mcount_return)
        /* setup frame pointer */
        stp     x29, x30, [sp, #-16]!

        /* save return values */
        stp     x0, x1, [sp, #-16]!
        str     q0, [sp, #-16]!

        /*
         * save indirect result location register
         * used in C++ for returning non-trivial objects
         */
        stp     x8, x18, [sp, #-16]!

        add     x0, sp, #32

        bl      mcount_exit
        mov     x16, x0          /* FIXME: here x16 becomes 0 */

        /* restore indirect result location register */
        ldp     x8, x18, [sp], #16

        /* restore return values */
        ldr     q0, [sp], #16
        ldp     x0, x1, [sp], #16

        /* restore frame pointer */
        ldp     x29, x30, [sp], #16

        br      x16
END(mcount_return)
namhyung commented 3 years ago

I suspect it's a clang's fault to set the frame pointer to an invalid value. In the _mcount, we expect x29 points to a stack frame containing 'x29' and 'x30' of the previous frame. But in the float.clang binary, it's saved in x29 + 0x10 so uftrace will use an invalid value (in d8 probably) as a parent location and it'll lead to a crash.