Interrupting the test suite doesn't kill the server

viperproject / prusti-dev

A static verifier for Rust, based on the Viper verification infrastructure.

http://prusti.org

Other

1.56k stars 107 forks source link

Interrupting the test suite doesn't kill the server #754

Open fpoli opened 2 years ago

fpoli commented 2 years ago

I currently have 2 prusti-server and 5 prusti-server-driver unused programs running in the background:

$ ps ax | grep [p]rusti
 5638 pts/18   Sl     0:00 ../target/debug/prusti-server --port=0
 5640 pts/18   Sl     0:40 /home/fpoli/src/prusti/prusti-dev/target/debug/prusti-server-driver --port=0
 7408 pts/18   Sl     0:00 ../target/debug/prusti-server --port=0
 7410 pts/18   Sl     0:39 /home/fpoli/src/prusti/prusti-dev/target/debug/prusti-server-driver --port=0
14870 pts/18   Sl     0:39 /home/fpoli/src/prusti/prusti-dev/target/debug/prusti-server-driver --port=0
18838 pts/18   Sl     0:39 /home/fpoli/src/prusti/prusti-dev/target/debug/prusti-server-driver --port=0
21675 pts/18   Sl     1:15 /home/fpoli/src/prusti/prusti-dev/target/debug/prusti-server-driver --port=0

I suspect that they are leaked when I Ctrl + C a test suite run.

gavinleroy commented 2 years ago

Hi @fpoli 👋 I think I have an idea as to what is happening here.

I did a few experiments and the zombie processes are also leaked in a normal test suite run (without a CTRL+C event)!

For example (on MacOS):

% ./x.pyt test test_prusti_rustc_with_server &> /dev/null
% ps ax | grep [p]rusti                                   
35549 s003  S      0:00.15 /Users/gavinleroy/dev/prj/prusti-dev/prusti-launch/../target/debug/prusti-server-driver --port=0

So what is happening? If you recall from MR #344 when handling a SIGINT prusti-server will kill its entire process tree. Examining the above test code we see that a server process is spawned and wrapped in a ChildGuard such that when the guard goes out of scope the process is killed.

However, the interesting part, is how the ChildGuard kills its process. Looking at this line here we see that self.0.kill() is invoked, which actually sends a SIGKILL to the child process thus not triggering the SIGINT handler which kills the process tree.

I experimented with the following code which merely implements Drop to send a SIGINT instead of a SIGKILL.

impl Drop for ChildGuard {
    fn drop(&mut self) {
        if let Err(e) = signal::kill(
            Pid::from_raw(self.0.id().try_into().unwrap()),
            Signal::SIGINT,
        ) {
            panic!("Could not kill child process: {}", e);
        }
    }
}

I find this solution rather ugly, not to mention that it is *nix specific.

I'm not sure what the best solution would be, there are also possibly other occurrences in the test suite where something similar is happening such that a SIGKILL is getting sent to the server process (I can't say as I haven't looked further past this one test). I thought it might be useful to log this information here.

fpoli commented 2 years ago

Wow, thanks for finding and explaining that case @gavinleroy!

I cannot find a crate to gracefully terminate a process depending on the platform. What you wrote seems the best short-term solution to me, with a if cfg!(target_family = "unix") to use self.0.kill() on Windows.

Pointerbender commented 2 years ago

I've also noticed this when pressing CTRL+C when running a cargo_verify test case:

./x.py test -p prusti-tests --test cargotest
<wait a bit>
<CTRL+C>

This leaves prusti-driver instances running in the background.

gavinleroy commented 2 years ago

That's an interesting case @Pointerbender, however, I don't seem to be able to reproduce that locally. Perhaps my waiting a bit is off. Is this easily reproducible for you? If so, maybe you could post some more information about it.

@fpoli I could allocate some time in the next day or two to find all the instances in the test suite where something similar to the above is happening (i.e. a SIGKILL is terminating a process) to see if we can apply the aforementioned short-term fix to it. However, if there are many separate cases, it would be dissatisfying to apply this patch and then have to change them later for a long-term solution (or worse have someone write a new test that again introduces the bug).

Pointerbender commented 2 years ago

Is this easily reproducible for you? If so, maybe you could post some more information about it.

It takes a few steps, but I think I was able to narrow it down :) I can currently only reliably reproduce this on the branch from PR #762 (heads up that this PR is still in flux). One way to reproduce it is to change this line:

https://github.com/viperproject/prusti-dev/blob/ad69f2dcf6e7e5c1aeb5f13cc15e88416282519c/prusti-tests/tests/cargotest.rs#L164

into:

test_builder.arg("-vv");

then run (on linux or mac, not windows):

./x.py test -p prusti-tests --test cargotest

and then hit CTRL+C when the test test_no_std is running (with the above code change, this should take a long time). In order to time it right, you should observe prusti-driver as a background process first, because this test first runs cargo build (this does not invoke prusti-driver) before running cargo-prusti (which invokes prusti-rustc which in turn invokes prusti-driver). Then immediately after hitting CTRL+C you should be able to observe that prusti-driver is still running in the background.

fpoli commented 2 years ago

@fpoli I could allocate some time in the next day or two to find all the instances in the test suite where something similar to the above is happening (i.e. a SIGKILL is terminating a process) to see if we can apply the aforementioned short-term fix to it. However, if there are many separate cases, it would be dissatisfying to apply this patch and then have to change them later for a long-term solution (or worse have someone write a new test that again introduces the bug).

I didn't mean to push for the short-term solution :) So far we almost didn't notice this issue. It's fine to wait until we find a robust solution.

gavinleroy commented 2 years ago

I overestimated the amount of free time I'd have towards the end of the semester but had some time today to look into this again, and here's what I've found:

Originally, the concern was with zombie prusti-server and prusti-server-driver processes probably caused by interrupting the testing suite. Looking at the Python src for subprocess we can see that a KeyboardInterrupt will cause a SIGKILL to be sent to the subprocess. This was already identified as being a potential issue, thus on my personal fork I updated this to send a SIGINT instead. This should in the general case help clean-up processes when interrupting a testing suite run.

Regarding the specific case I had previously pointed out found here. This is the single place in the testing suite where something like this occurs and I believe a suitable fix would be to alter the code to use a SIGINT.

Since the above two changes, I haven't been able to leak a child process on my machine though I'm sure it might still be possible.

However, there still exists a small problem with the RAII pattern ChildGuard, namely, Rust programs do not unwind on SIGINT. Thus, interrupting a testing suite run with Ctrl-C during this specific test would still leak the processes because the program is aborted and the ChildGuard destructor never runs.

I don't personally know the best way to deal with the last mentioned issue but if my first two changes seem reasonable I'd be happy to open a small PR for them.

fpoli commented 1 year ago

The recent Windows CI failures seem related to this issue. After the test suite successfully terminates, prusti-server-driver.exe is still running while prusti-server.exe is not.

The proper solution used in cargo and rustup is to use the CreateJobObjectW API of Windows to reliably kill a process tree. This file seems ready to be copy-pasted and used in every binary of prusti-launch: https://github.com/rust-lang/rustup/blob/master/src/cli/job.rs This might also make the ChildGuard solution no longer necessary.

//! Job management (mostly for Windows)
//!
//! Most of the time when you're running cargo you expect Ctrl-C to actually
//! terminate the entire tree of processes in play, not just the one at the top
//! (cargo). This currently works "by default" on Unix platforms because Ctrl-C
//! actually sends a signal to the *process group* rather than the parent
//! process, so everything will get torn down. On Windows, however, this does
//! not happen and Ctrl-C just kills cargo.
//!
//! To achieve the same semantics on Windows we use Job Objects to ensure that
//! all processes die at the same time. Job objects have a mode of operation
//! where when all handles to the object are closed it causes all child
//! processes associated with the object to be terminated immediately.
//! Conveniently whenever a process in the job object spawns a new process the
//! child will be associated with the job object as well. This means if we add
//! ourselves to the job object we create then everything will get torn down!