tikv / pprof-rs

A Rust CPU profiler implemented with the help of backtrace-rs
Apache License 2.0
1.29k stars 99 forks source link

pipe fd(s) opened in addr_validate::validate will not be closed for short-term threads #197

Closed alston111111 closed 1 year ago

alston111111 commented 1 year ago

The problem description

We are running tidb clusters, in which tikv plays an important role. One day, one instance in one of our tikv clusters (3 nodes, v6.1.0) crashed with "Too many open files" error in tikv.log. And we found fd leaks on tikv, the type of which is (anonymous) pipe.

After our trouboleshooting, in short, we found that fd leaks came from the pprof::addr_validate module in pprof-rs which is referenced by tikv-6.1.0. It happens only for short-term threads.

Short-term threads here refer to the threads created to do their jobs and then exit in the middle of the owner process life-time.

For more details of the original case in tikv, please refer to the issue below: tikv/tikv#14224

How to reproduce?

During test, one can change the value of T and the number of threads as wanted. Note: in the "The test result for Sol-3" section, T = 0.1 seconds.

The source code for Way-2: demo_fix.zip Here is the pruned main() function code for Way-2:

fn main() {
    #![warn(unused_imports)]
    // (1) Here we should build a pprof::ProfilerGuard object and start the profiler.
    // ... profiler starting code ...
    // Since no pprof-rs project in rust.playground's Cargo.toml, we use another way to simulate the caller of pprof-rs.
    // Please view the source code demo_fix.zip, from function `run_one_thread(bool /* short-term */)`.

    // watcher thread.
    let _ = thread::Builder::new()
        .name("monitor-thd".to_string())
        .spawn(move || loop {
            println!("===> open files count: {}", debug_get_open_files(my_getpid()));
            thread::sleep(SLEEP_WATCH_THD);
        });

    // (2) long-term threads
    let n_long_threads: u32 = 10;
    for i in 1..=n_long_threads {
        let thd_name: String = format!("long-thd-{:04}", i);
        let _ = thread::Builder::new().name(thd_name.clone()).spawn(move || { run_one_thread(false /* short-term */); });
        println!("Thread created: {}", thd_name.clone());
    }

    // (3) short-term thread
    let mut n_short_threads: u32 = 1;
    loop {
        let thd_name: String = format!("short-thd-{:04}", n_short_threads);
        let _ = thread::Builder::new().name(thd_name.clone()).spawn(move || { run_one_thread(true /* short-term */); });
        println!("Thread created: {}", thd_name.clone());

        n_short_threads += 1;
        thread::sleep(SLEEP_SHORT_THD);
    }
}

Root Cause: (my opinion)

Explanation of the root cause

Since the twin fds of a pipe, in pprof::addr_validate module, are declared as thread-local, about 2 pipe fds will be opened for each thread. For long-term threads, I mean, whose life-time are as long as the owner process, at most 2 pipe fds are kept opened for each thread until the process exits or due to the close(old-fd) actions in open_pipe(). However, short-term threads in their owner process are created to execute some tasks and then exit. They might have a chance to handle the SIGPROF signal and might open a pipe with its twin fds that would not be closed forever.

Possible solutions:

Fast way: in caller's code, do not use frame-pointer feature in pprof-rs.

Sol-1. Configure the project that uses pprof-rs, setting pprof module without frame-pointer feature in Cargo.toml.

Note: I used to meet the problem of not reprocuding fd leaks in my test program. If frame-pointer feature is not set, pprof just uses the old way (in backtrace_rs.rs) to extract the call stack. That would not call addr_validate::validate() to create any pipes.

source code change

Sol-2. Accurate method: hook a function at thread exit to close both the thread-local pipe fds, if possible.

(I'm not sure about the exact implementation. Maybe it requires some work in OS kernel level and is platform-dependent ?)

Sol-3. Workaround: remember the opened fds for each thread and asynchronously close them (i.e. GC).

Note: this is a naive way and can be optimized.

fn open_pipe() -> Result<(), ()> {
    ////////////////////////////////
    // by alston, to modify open_pipe()
    ////////////////////////////////
    + try_start_pipe_fd_gc();

    // original code in open_pipe
    MEM_VALIDATE_PIPE.with(|pipes| {
        let mut pipes = pipes.borrow_mut();

        // ignore the result
        let _ = close(pipes[0]);
        let _ = close(pipes[1]);

        let (read_fd, write_fd) = create_pipe()?;

        ////////////////////////////////
        // by alston, to modify open_pipe()
        ////////////////////////////////
        + record_pipe_fds(&vec![pipes[0], pipes[1]], &vec![read_fd, write_fd]);

        pipes[0] = read_fd;
        pipes[1] = write_fd;

        Ok(())
    })
}

The test program for Sol-3 and the result

The test program for Sol-3.

(I wrote a test program and a demo fix to help make a proof. But I've lost the whole environment :(. Any way, the idea is described as Sol-3, and I've written another one that can run on playground. This one does not use pprof-rs, but simulate it by opening pipes based on target_os=linux.) Please refer to the attachment OR the playground: demo_fix.rs in rust playground

The test result for Sol-3.

Take the reproduce steps in the test program. The count of opened pipe fd(s) when running the test program was supposed to be between (#basic_fds + 2 * 10-long-term threads) and (#basic_fds + 20 + 2 * 5-short-term-threads), i.e. [20+, 30+], inaccurately. (0 < #basic_fds < 10)

  1. If I enabled the fd GC, then only 20+ to 30+ opened pipe fds existed at one moment.
    
    Standard Error
    Compiling playground v0.0.1 (/playground)
    Finished dev [unoptimized + debuginfo] target(s) in 1.26s
     Running `target/debug/playground`
    /playground/tools/entrypoint.sh: line 11:     8 Killed                  timeout --signal=KILL ${timeout} "$@"
    Standard Output
    Hello, pprof! PIPE_FD_GC_ENABLE: true

//// "open files count" occurrences:

===> open files count: 26 ===> open files count: 30 ===> open files count: 30 ===> open files count: 30 ===> open files count: 30 ===> open files count: 30 ===> open files count: 30 ===> open files count: 30 ===> open files count: 30

2. Else the number of fds kept increasing.
(until the process crashed when reaching the `ulimit -f` limit, if on a linux host instead of the rust playground)

Standard Error Compiling playground v0.0.1 (/playground) Finished dev [unoptimized + debuginfo] target(s) in 3.63s Running target/debug/playground /playground/tools/entrypoint.sh: line 11: 7 Killed timeout --signal=KILL ${timeout} "$@" Standard Output Hello, pprof! PIPE_FD_GC_ENABLE: false

//// "open files count" occurrences:

===> open files count: 6 ===> open files count: 34 ===> open files count: 44 ===> open files count: 54 ===> open files count: 64 ===> open files count: 74 ===> open files count: 84



#### potential corner-case of Sol-3
If one tid is reused very frequently, then the pipe fds might not be closed forever.
But this is almost impossible in production environments, where `/proc/sys/kernel/pid_max` is tens of thousands (our is 400,000+).