koute / not-perf

A sampling CPU profiler for Linux
Apache License 2.0
868 stars 40 forks source link

abort() called from nwind_on_exception_through_trampoline on AArch64 #14

Closed stoperro closed 4 years ago

stoperro commented 4 years ago

We have an occasional issue where using not-perf crashes profiled process in AArch64 HW. Don't have much data yet, but the callstack is as follows:

  - tid: 23863 # --------------------------------------------------
    proc_dump: ~
    user_time: 4.170000
    system_time: 2.450000
    registers: [
      0x0000000000000000, 0x0000007ff4cdfdf0, 0x0000000000000000, 0x0000000000000008,
      0x0000000000000000, 0x0000007ff4cdfdf0, 0xffffffffffffffff, 0xffffffffffffffff,
      0x0000000000000087, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff,
      0xffffffffffffffff, 0xffffffffffffffff, 0x0000000000000000, 0x0000000000000035,
      0x0000007f7d2b2a60, 0x0000007f78d08ec8, 0x0000007f78e3a7f4, 0x0000000000000006,
      0x0000007f783b2020, 0x0000007f783b2720, 0x0000000000000100, 0x0000007f77f66028,
      0x0000007ff4ce0618, 0x0000007ff4ce06b0, 0x0000007ff4ce0638, 0x0000000000000000,
      0x0000000000000000, 0x0000007ff4cdfdd0, 0x0000007f78d07f0c, 0x0000007ff4cdfdd0,
      0x0000007f78d07f0c, 0x0000000000000000 ]
    backtrace: [
      { a: 0000007f78d07f0c, s: gsignal,              o:  0x9c, l:  0xcc, e: 0, S: 0, f: "/usr/lib64/libc-2.28.so" },
      { a: 0000007f78d09000, s: abort,                o: 0x138, l: 0x22c, e: 0, S: 0, f: "/usr/lib64/libc-2.28.so" },
      { a: 0000007f7d1b5e1c, s: _ZN5nwind15local_unwinding5abort17h255a5769eb294e0dE,                        o:  0xcc, l:  0xd0, e: 0, S: 0, f: "/opt/memprof/aarch64/libmemory_profiler.so" },
      { a: 0000007f7d1b6a4c, s: nwind_on_exception_through_trampoline,                        o: 0x444, l: 0x448, e: 0, S: 0, f: "/opt/memprof/aarch64/libmemory_profiler.so" },
      { a: 0000007f7d27c1f8, s: nwind_ret_trampoline, o:  0x44, l:  0x50, e: 1, S: 0, f: "/opt/memprof/aarch64/libmemory_profiler.so" } ]

Unfortunately I don't know yet if it's first or second abort() in said function :(

Did this happen before? Could it be issue of application (e.g. memory corruption), or some corner case bug in not-perf as this seems to be during exception handling (?).

koute commented 4 years ago

If you can grab the logs it should tell you which one was triggered.

Anyhow, hmm.... if I had to guess then most likely it has triggered the second abort (the first one is very unlikely and something would have to go extremely wrong for it to trigger).

Now that I think about it that check actually might not be correct and doesn't really make sense since when the control is given to the landing pad the value of the stack pointer will be higher than the address of the slot where the return address was stored, so we're basically checking whenever we've clobbered the stack ourselves. I'll remove it and hopefully you won't see this crash anymore.

koute commented 4 years ago

I've pushed the fix on master; can you check if it works fine now?

stoperro commented 4 years ago

Thanks, we will check and update :)

stoperro commented 4 years ago

Almost tested, had hard time compiling from scratch. The key for me was to use rustup target add mips64-unknown-linux-gnuabi64 to install toolset (yes, different arch), previously tried some other commands in rust and failed... maybe this is obvious for rust developers though.

Also, readme.md mentions "Install at least Rust 1.31", so I literally used 1.31.0 afraid of what newer versions may bring, which fails on:

error[E0658]: `Self` struct constructors are unstable (see issue #51994)
   --> /home/stoper/.cargo/registry/src/github.com-1ecc6299db9ec823/rgb-0.8.14/src/alt.rs:111:9
    |
111 |         Self(self.0, a)
    |         ^^^^

   Compiling proc-maps v0.1.0 (/mnt/c/buildy/not-perf/proc-maps)
error: aborting due to previous error

Seems to work on current stable branch though.

koute commented 4 years ago

Yeah, I need to update (or probably just remove it outright) the version mentioned in the readme and probably improve the cross-compiling instructions.

In general you should always use the newest available stable version.

koute commented 4 years ago

@stoperro By the way, didn't I leave you with a script to compile all of this automatically? You might want to look around for it. (:

stoperro commented 4 years ago

Funny thing, I learned about that script 1 hour ago :) Certainly that would have been better approach. Still, already compiled manually on my PC with our SDK, and likely said script is also proper.

stoperro commented 4 years ago

Took a while... but we finally were able to test it and the crash we had doesn't occur anymore after the fix 👍