tikv / pprof-rs

A Rust CPU profiler implemented with the help of backtrace-rs
Apache License 2.0
1.32k stars 104 forks source link

Unwind through frame pointer #116

Closed YangKeao closed 2 years ago

YangKeao commented 2 years ago

Add the feature to unwind through frame pointer in pprof-rs.

The frame pointer has much better performance and stability. The only problem is that we cannot ensure every components have correct frame pointer. During my testing, I have met several problems:

  1. The rust std binary, doesn't have a correct frame pointer for every function. Ask the cargo to rebuild the std library from source can solve this.
  2. The libstdc++ in my PC doesn't enable the frame pointer (but I used gentoo, so I could recompile it simply)

Compiling with CFLAGS="-fno-omit-frame-pointer", CXXFLAGS="-fno-omit-frame-pointer" RUSTFLAGS="-Cforce-frame-pointers=yes" cargo +nightly build -Z build-std will get a correct behavior.

sticnarf commented 2 years ago

If the Rust std library or other C/C++ dependencies are not compiled with frame pointers, what will the result look like? Will it panic or just drop some information?

YangKeao commented 2 years ago

@sticnarf panic. As I cannot give a proper upper bound of the stack :cry: . (And that's exactly how I found those problem)

I remember @mornyx was working on getting a stack boundary, but I don't know the progress. And I still have some confusion about how @mornyx will implement the boundary check, as the stack for non-main thread is allocated dynamically.

mornyx commented 2 years ago

@sticnarf panic. As I cannot give a proper upper bound of the stack 😢 . (And that's exactly how I found those problem)

I remember @mornyx was working on getting a stack boundary, but I don't know the progress. And I still have some confusion about how @mornyx will implement the boundary check, as the stack for non-main thread is allocated dynamically.

I use thread_local! here to initialize its own address range for each thread (via /proc/<PID>/tasks/<TID>/maps).

https://github.com/mornyx/unwind/blob/master/src/utils/maps.rs

(I just judge whether the address is readable or not, without distinguishing the stack address or instruction address.)

The only thing that needs to be confirmed is whether the address range of each thread will change once it is allocated? If so, I think we should remove thread_local! and re-parse each time instead, but this has a performance overhead.

YangKeao commented 2 years ago

The only thing that needs to be confirmed is whether the address range of each thread will change once it is allocated? If so, I think we should remove thread_local! and re-parse each time instead, but this has a performance overhead.

@mornyx It will, but I think it doesn't matter (at least for unwinding). Because though there will be new maps created through mmap syscall (by mallocator), the stack range will not change. While unwinding, we only care for the stack.

mornyx commented 2 years ago

The only thing that needs to be confirmed is whether the address range of each thread will change once it is allocated? If so, I think we should remove thread_local! and re-parse each time instead, but this has a performance overhead.

@mornyx It will, but I think it doesn't matter (at least for unwinding). Because though there will be new maps created through mmap syscall (by mallocator), the stack range will not change. While unwinding, we only care for the stack.

When I write the following test in Rust:

#[test]
fn test_address_is_readable() {
     let v = 0;
     assert!(address_is_readable(&v as *const i32 as u64));
}

I found that the address of the variable v is not in the range of [stack], but in the range of anonymous addresses in maps. I guess I still need to figure this out to be sure it's working correctly.

YangKeao commented 2 years ago

I found that the address of the variable v is not in the range of [stack], but in the range of anonymous addresses in maps. I guess I still need to figure this out to be sure it's working correctly.

@mornyx Sure. The [stack] is allocated by the system for the execution. The later thread is created by pthread_create (the corresponding clone system call), and their stack is allocated by the glibc (through malloc, or cached previous allocated stack), and the system cannot distinguish the "stack" memory and "heap" memeory, so there will not be special mark in the maps file.

As the glibc doesn't provide any AS-safe api to get the stack boundary, I think getting the whole accessible address space is fine. We have three different ranges here:

  1. The "stack" memory
  2. The accessible memory space, recorded when we first call the with
  3. The real whole accessible memory space.

We can assume that, 1 is the subset of the intersection of 2 and 3.

(This is only about the glibc implementation of stack. Theoretically, a userspace program can use the userfaultd to handle the page fault, and increase the stack dynamically, but as I know, no famous implementation did this.)

YangKeao commented 2 years ago

@mornyx Oops, there is still a problem: the 2 and 3 doesn't have other meaningful relationship, which means the accessible memory in 2 is not necessarily accessible in 3 (because the mmap can be unmapped). It's still not 100% safe to only read once for every thread.

mornyx commented 2 years ago

the 2 and 3 doesn't have other meaningful relationship, which means the accessible memory in 2 is not necessarily accessible in 3 (because the mmap can be unmapped). It's still not 100% safe to only read once for every thread.

As you said, the space for the thread stack is allocated by glibc and passed to pthread_create, so when does the munmap happen if the user doesn't explicitly call the pthread API? If user does not actively access the pthread library, can it be considered that the original maps have not been modified?

YangKeao commented 2 years ago

the 2 and 3 doesn't have other meaningful relationship, which means the accessible memory in 2 is not necessarily accessible in 3 (because the mmap can be unmapped). It's still not 100% safe to only read once for every thread.

As you said, the space for the thread stack is allocated by glibc and passed to pthread_create, so when does the munmap happen if the user doesn't explicitly call the pthread API? If user does not actively access the pthread library, can it be considered that the original maps have not been modified?

@mornyx The "2" contains not only the "stack", it also contains all heaps, mmapped files, etc. The mallocator will munmap the allocated memory (if these memory are not needed), and the storage engine may munmap unneeded files.

Take the frame pointer for an example. Consider the situation: the program (which omit the frame pointer) use rbp register to store an address in a mmap backed by a file, then this file is munmapped. At this time, the signal handler is triggered to get the frame pointer and find that the content in rbp register is accessible (based on outdated "2"), but read it will cause a panic.

We can shrink the possibility by ignoring any mmap backed by a file, but the mallocator is still possible to call munmap.

mornyx commented 2 years ago

@YangKeao Thanks for the detailed explanation! I think we do need to parse each time. I'm going to measure the performance overhead and decide if we need a better idea..

YangKeao commented 2 years ago

@YangKeao Thanks for the detailed explanation! I think we do need to parse each time. I'm going to measure the performance overhead and decide if we need a better idea..

Another option is to write to a pipe: https://github.com/libunwind/libunwind/blob/master/src/x86_64/Ginit.c#L120 (Actually, there are a lot of options in this file: msync, mincore... but it seems that they are not signal safe)

I will verify these options in this PR, as the frame pointer needs much less memory access.

YangKeao commented 2 years ago

I have tested this frame pointer implementation is usable in ARM, this can be a progression for https://github.com/tikv/pprof-rs/issues/75

YangKeao commented 2 years ago

@mornyx I benchmarked by validating 100 addresses (in stack or heap), the result is:

image

I don't know whether they are acceptable, maybe we need a further test on TiKV.

Assume we have 99 samples, and every sample the backtrace has 20 depths, it will need 400us / s

I used the read/write to the pipe method.

mornyx commented 2 years ago

I don't know whether they are acceptable, maybe we need a further test on TiKV.

Assume we have 99 samples, and every sample the backtrace has 20 depths, it will need 400us / s

I used the read/write to the pipe method.

This value (400 us/s) does not seem intuitive, I'm also not sure if its overhead is acceptable..

I'm testing the performance overhead of unwind on a fully loaded TiDB cluster. I would do an additional comparison by using your branch (pipe method).

sticnarf commented 2 years ago

400us/s sounds like an acceptable overhead. Usually a TiKV takes at least 4 cores of CPU to run, so it will increase 1/10000 (if not considering potential cache miss introduced by this)

To perform a real test, we can use sysbench point select as the most latency-sensitive one.

YangKeao commented 2 years ago

Before benchmark the result. I would like to ask @sticnarf

Do you think the following result (under tpcc) is correct? (It compiles with this branch of pprof-rs, and without recompiling std)

image

https://drive.google.com/file/d/1FzeLlbM8iwaxH12sKFxIXCvvBQTO9jv9/view?usp=sharing This is a manually profiling data.

sticnarf commented 2 years ago

Do you think the following result (under tpcc) is correct? (It compiles with this branch of pprof-rs, and without recompiling std)

It looks generally correct. But it's hard to tell how accurate it is.

And another small problem is that the thread name is missing from the root. Sometimes, the thread name is also useful.

YangKeao commented 2 years ago

the thread name is missing from the root

@sticnarf It sounds like a problem from dashboard (or the protobuf output, I'm not sure). The flamegraph output is correct:

some

mornyx commented 2 years ago

I tested it under ARM Linux, but the flame graph looks a little weird..

Results based on DWARF: 2

Results based on frame-pointer: 1002

Also, the sysbench results seem to confirm this: I'm running frame-pointer based CPU profiling at almost full profiling load (run a script drive to profiling 120s every 10s), but the sysbench results did not show any performance regression.

YangKeao commented 2 years ago

@mornyx Try to compile with RUSTFLAGS="-Cforce-frame-pointers=yes"? I guess the TiKV hasn't enabled the frame pointer. (Though, I wonder how much the frame pointer will affect the performance)

mornyx commented 2 years ago

@mornyx Try to compile with RUSTFLAGS="-Cforce-frame-pointers=yes"? I guess the TiKV hasn't enabled the frame pointer. (Though, I wonder how much the frame pointer will affect the performance)

OK, I'll try. I guess the frame-pointer will perform far better than the current DWARF based backtrace (because DWARF is so complicated...), the only uncertain overhead is the pipe.

sticnarf commented 2 years ago

@mornyx Try to compile with RUSTFLAGS="-Cforce-frame-pointers=yes"? I guess the TiKV hasn't enabled the frame pointer. (Though, I wonder how much the frame pointer will affect the performance)

OK, I'll try. I guess the frame-pointer will perform far better than the current DWARF based backtrace (because DWARF is so complicated...), the only uncertain overhead is the pipe.

Keao's concern might be the effect of losing one register (which is used for fp now). That affects not the backtrace performance but the overall execution performance.

mornyx commented 2 years ago

Keao's concern might be the effect of losing one register (which is used for fp now). That affects not the backtrace performance but the overall execution performance.

Oh! Got it

YangKeao commented 2 years ago

@mornyx Ping. Is there any result about the benchmark and test?

mornyx commented 2 years ago

@mornyx Ping. Is there any result about the benchmark and test?

I've done sysbench tests on my local cluster of VMs, but have not observed a noticeable performance regression (PS. with RUSTFLAGS="-Cforce-frame-pointers=yes, the result looks correct). I guess the VM env might be somewhat inaccurate. Since benchbot's testing resources are occupied for a long time, I am preparing to deploy a dedicated cluster on aws for benchmarking, which will be done today.

mornyx commented 2 years ago

In general, after opening framer-pointer (with branch YangKeao/frame-pointer, with flag -Cforce-frame-pointers=yes), the performance dropped by about 1%~3% (based on sysbench and tpcc load). Note that this is in the case of "long term" profiling, the overhead of continuous profiling should be lower.

My test environment is as follows: env value
Image Ubuntu 20.04 AArch64 (ami-02af65b2d1ebdfafc)
Machine ARM64 8c16g (a1.2xlarge)
Topology 3 TiDB, 1 TiKV, 1 PD
YangKeao commented 2 years ago

Thanks for @mornyx

@BusyJay @sticnarf I think this PR is ready for review.

mornyx commented 2 years ago

Hi @sticnarf @BusyJay, could you please review this PR? Let's bring this to the TiKV v6.1 to fix profiling's crash~

mornyx commented 2 years ago

LGTM