Enhance threads stacks collection and sampling frequency on Linux

This PR tries to address inconsistencies regarding the collection of samples across time and threads (somehow related to #177).

Currently to collect a sample (stack) we rely on the SIGPROF signal, which is a standard signal. This has the following two limitations:

for standard signal, the set of pending signals is only a mask, meaning that, if multiple instances of a standard signal are generated while that signal is blocked, then only one instance of the signal is marked as pending (and the signal will be delivered just once when it is unblocked)
the SIGPROF signal is directed to the process and will then be handled by one of the thread, hence we'll only collect the stack for the thread that in a given moment is "handling" the signal.

The changes in this PR try to address both of the above mentioned limitations by making use of real-time signals and per-thread delivery.

Real-Time Signal

An important distinction here is that real-time signals are queued. If multiple instances of a realtime signal are sent to a process, then the signal is delivered multiple times (hence we're not going to waste a potential "sampling round").

per-thread delivery

Instead of having a timer that will generate a SIGPROF signal that will be handled by a signal, we propose having an external thread which, with a given frequency, collects the tid (kernel thread id) of the running threads and send the real-time signal to each one of them.

For the normal situation (e.g. 99 frequency with a fast enough CPU), this patch doesn't have many differences with the original one. Then I tried this PR with a simpler situation (the prime_number example), and a higher frequency (999). It becomes worse:

What I have done

https://github.com/tikv/pprof-rs/blob/master/src/profiler.rs#L334-L336

There is already a sample time in the signal handler. I append this sample time into the /tmp/stats:

diff --git a/src/profiler.rs b/src/profiler.rs
index c78f0dc..c826310 100644
--- a/src/profiler.rs
+++ b/src/profiler.rs
@@ -1,6 +1,7 @@
 // Copyright 2019 TiKV Project Authors. Licensed under Apache-2.0.

 use std::convert::TryInto;
+use std::io::Write;
 use std::os::raw::c_int;
 use std::time::SystemTime;

@@ -350,6 +351,8 @@ extern "C" fn perf_signal_handler(
                     false
                 }
             });
+            let mut stats = std::fs::OpenOptions::new().write(true).read(true).create(true).append(true).open("/tmp/stats").unwrap();
+            write!(stats, "{:?}\n", sample_timestamp).unwrap();

             let current_thread = unsafe { libc::pthread_self() };
             let mut name = [0; MAX_THREAD_NAME];

Then, handle the /tmp/stats with the following codes (%1000 because js doesn't have high precision. Why didn't I use python/julia :facepalm: ):

const fs= require("fs")

const stats = fs.readFileSync("/tmp/stats", {"encoding":"utf8"})
    .split("\n")
    .filter(item => item.split(" ")[3] != undefined)
    .map(item => [item.split(" ")[3].slice(0, -1), item.split(" ")[5]])
    .map(item => [parseInt(item[0]), parseInt(item[1])])
    .map(item => item[0] % 1000 * 1000000000 + item[1])
for(let i =0;i<stats.length;i++) {
    console.log(`${i} ${stats[i]}`)
}

node main.js > ./stats.dat, and then plot the output with gnuplot:

plot "./stats.dat"

Result

For the master pprof-rs, this graph is "smooth" everywhere:

Screenshot from 2023-01-09 22-53-16

For the pprof-rs with this PR, this line "jump" at some places:

(zoomed)

Screenshot from 2023-01-09 22-48-45

I didn't analyze it with LSM or some other statistics methods, as the difference can be found simply through eyes.

Others

I have other thoughts about this idea. As you have described, the main benefit brought by the real-time signal is that no signal is lost. Does it mean some stacktrace is duplicated (because the signal handler runs continuously, they'll all get the same stack trace)? Duplication and missing samples are both errors, and I don't think duplication is better :thinking:.

After all, this PR is really a good attempt on the interaction between real-time semantics and the profiling. I'll also study more and do some experiments with https://github.com/thoughtspot/threadstacks to see where is actually the problem. Thanks :beers: .

Hi @YangKeao ,

Many thanks for taking a look at this and for benchmarking this.

When you say for that the same frequency you didn't see much better results and for higher frequency it actually got worse, do you mean in terms of sample collected across time and threads or are you referring at the plot shown above that plots exclusively the timestamp at which samples were collected?

There is already a sample time in the signal handler. I append this sample time into the /tmp/stats:

I'm not sure that using sample_timestamp is a "robust" method. According to the Rust doc this is not a monotonic measurement:

Distinct from the [Instant](https://doc.rust-lang.org/std/time/struct.Instant.html) type, 
this time measurement is not monotonic. This means that you can save a file to the 
file system, then save another file to the file system, and the second file has a 
SystemTime measurement earlier than the first. In other words, an operation that 
happens after another operation in real time may have an earlier SystemTime!

As you have described, the main benefit brought by the real-time signal is that no signal is lost. Does it mean some stacktrace is duplicated

In principle it shouldn't, since we fire the signal at the specific tid with a given frequency. Should the thread not being able to handle that in a timely fashion though, the signal would be queued leading to the possibility that 2 samples might (or might not) being collected at a shorter interval should the signal handler be executed immediately and repeatedly for the same tid. In this case then we'd have collected 2 samples at a more fine-grained interval. If you consider 2 samples belonging to the same interval (let's say 10ms) even if they're collected at different instants as duplicates then yes, this could lead to that.

I'm available if you want to discuss it further.

Thanks again for the valuable feedback!

@YangKeao can I ask a couple more questions to contextualize the tests/experiments?

how many threads did the code you were testing the profiler on launch?
Could you share the hardware spec. of the environment on which it was running?

Thanks

tikv / pprof-rs