Discussion : using pid namespace to detect existance of leaky tests

YJDoc2 commented 2 years ago

Hey, I just found about this project recently, and it seems pretty cool. I was going through the docs, and your blog post of this, and it states that currently this only detects leaky tests which forward the stdio to children.

I was wondering if we could use pid namespaces on linux to find other kinds of leaks, i.e. those which do not forward the handles?

First, the question of portability, pid namespaces concept is only present on linux-like OSs. For Windows I think it might be possible to do something similar using Windows console process groups (?). I'm not much familiar with Windows API, so this may or may not be possible on Windows, some more investigation is needed. For others such as BSD or other OSs, I do not have much Idea, (and actually not sure how much support this crate has for them either :sweat_smile: )

Getting back to Linux, I think we might be able detect if there are any processes still running, created by the tests, after all tests are finished using pid namespace unsharing. I haven't gone through the code of this crate in detail, but from what I understood from your blog was that for each test to be run, the runner creates an async future, futures for their stdios and then waits on them, bubbling up the result to some "handler" process which takes care of collecting the results and printing o/p. Leak is detected if std handles of a task stay open for certain time after the test exists/gets killed. If this is correct, then I suggest the following for using pid ns to detect leaks.

start the handler process
do 3 unshare calls :
- For user namespace
- For pid namespace
- For mount namespace (for remounting the /proc)
fork the process, and in child remount the /proc, so that stuff like ps works correctly (? the proc section in https://man7.org/linux/man-pages/man7/pid_namespaces.7.html)
In the child run the runner, (*) and when all tests are over see if there is any other process running other than the runner itself. If so, some test has leaked.

The idea is even if we can assert that some testcase has leaked, user can do a binary search on cases until they find which one. Also has added benefit that when the runner will exit, the leaked processes will automatically killed by SIGKILL from OS side, so better cleanup (?)

This has some glaringly obvious problems (that I could think of) :

(*) as runner becomes a child process, we have to do something other than mpsc channels for sending test results back to handler, maybe a fifo to serde the test res struct, but that would be a considerably big change, and add serde stuff as dependency
There are ways we can overcome above, but the solutions I can think involve some kind of pipe based communication , though without serde
as we are changing namespaces, if tests rely on something related to that, it is possible that they fail because of us.
Again, portability.

A basic working example of above approach is (needs nix and duct) (uses a lot of unwrap for demonstration purposes)

```rust use nix::sched::{unshare,CloneFlags}; use nix::unistd::{fork,ForkResult}; use duct::cmd; fn main() { unshare(CloneFlags::CLONE_NEWUSER).unwrap(); unshare(CloneFlags::CLONE_NEWPID).unwrap(); unshare(CloneFlags::CLONE_NEWNS).unwrap(); unsafe{ // only for the fork call, everything else is a safe fn match fork().unwrap(){ ForkResult::Child =>{ nix::mount::mount::(Some("proc"), "/proc",Some("proc") , nix::mount::MsFlags::empty(), None).unwrap(); match fork().unwrap(){ // this fork emulates the test case itself forking/spawning ForkResult::Child=>{ // this emulates the "leaked" process println!("child's child {:?}",nix::unistd::getpid()); // thread to verify that this also works with threads let t = std::thread::spawn(||{ cmd!("bash","-c","ls -l /proc/self && sleep 5").read().unwrap(); }); cmd!("sleep","3").read().unwrap(); println!("here here 3"); t.join().unwrap(); } ForkResult::Parent { child }=>{ // the pid 1 process, when this exits, all others in this ns will get killed // this is child of the first fork println!("child's pid {:?}",nix::unistd::getpid()); println!("child'd child's pid is {:?}",child); let t = cmd!("ps").read().unwrap(); println!("{}",t); cmd!("sleep","10").read().unwrap(); println!("here here 2"); } } } ForkResult::Parent { child }=>{ // this is the "handler" which will fork println!("in parent : child = {}",child); cmd!("sleep","10").read().unwrap(); println!("done"); } } } } ```

I wrote this up quickly hoping to catch you in your timezone, so of course there might be some loopholes in above. If you want me to test something, or make a more detailed example for a specific case to check if it'll work, let me know, I'll try to get that done as well.

I'm also thinking of opening this issue/approach on duct, as this might be useful for them as well to properly handle (?) grand-child process.

Hope this was somewhat useful. Thanks :)

sunshowers commented 2 years ago

Hi there -- thanks for the suggestion! I've spent a bit of time looking at pid namespaces, cgroups and such, and here's what my general takeaways have been.

I think the overall scheme you describe to detect leaks will definitely work. However, the biggest stumbling block is that unshare(CLONE_NEWPID) requires CAP_SYS_ADMIN, or in other words that it requires nextest to run as a privileged process. I think this makes it a nonstarter for many use cases.

(edit: I was wrong about this, see below)

~~Assuming that nextest is run as a privileged process,~~ some notes:

There's an interesting question about whether each test should be in its own namespace or whether all tests should be in the same namespace.

One of the issues here is that nextest tries as hard as possible to use posix_spawn to execute each test, for both performance and race condition (around signals) reasons.

However, I would like to spend some time experimenting with a double-spawn approach for tests (there are some complex races that are solved that way). Pid namespace creation could presumably just fit into that.

I think if we're going through this much effort to isolate tests, overall it makes sense for each test to be in its own namespace. That would also eliminate the need for much of a communication protocol.
Forking a Rust process is super unsafe, so we'd probably spawn the current cargo-nextest exe with certain command-line args (or env vars) as a subprocess rather than forking it.
This would be an opt-in config so existing tests don't break by default. We could even add per-test overrides to choose which tests get namespace isolation.
We only need to support Linux for this (nextest does support other Unixes but it already has plenty of platform-specific code).
Maintainability is a big concern for me. We'd have to write a bunch of tests to support this.

sunshowers commented 2 years ago

I'd love to hear ideas about what can be done to handle the privilege issue. I'd also like to understand how potential support for cgroups would interact with this.

sunshowers commented 2 years ago

Oh, never mind, looks like creating a user namespace (like in https://unix.stackexchange.com/a/672462 or in your example) is enough to provide privileges to that process. OK, that makes it a lot more feasible!

sunshowers commented 2 years ago

(I've confirmed that unshare --user --pid ls works. This would presumably also cause cgroups to work...)

YJDoc2 commented 2 years ago

Hey, thanks for the details reply and explanation! (And apologies for my long reply!)

However, the biggest stumbling block is that unshare(CLONE_NEWPID) requires CAP_SYS_ADMIN, or in other words that it requires nextest to run as a privileged process. I think this makes it a nonstarter for many use cases. (edit: I was wrong about this, see below) ... Oh, never mind, looks like creating a user namespace (like in https://unix.stackexchange.com/a/672462 or in your example) is enough to provide privileges to that process. OK, that makes it a lot more feasible!

Yep, my first check after thinking of pid namespaces was checking the privileges, and I stumbled on the same post which you have linked. :smile: That is why the example first does user namespace unshare and then pid ns, so it can work without sudo. My personal thought is that a test runner should not require sudo privileges unless the tests themselves need sudo, so I tried best to check that we will not need to run nextest with admin privileges.

Forking a Rust process is super unsafe, so we'd probably spawn the current cargo-nextest exe with certain command-line args (or env vars) as a subprocess rather than forking it.

The main reason I used fork in example was to get a working prototype quick-and-dirty (I would not use so many unwraps otherwise). The key point in ns unsharing is that the process that called unshare does not become part of the new ns. Its children or spawned threads as shown in eg, do become part of the new ns. So if nextest does unshare call, and then spawns tests then the tests should become part of the new namespace, removing the need of fork. However that might complicate doing a per test opt-in as every process spawned after the unshare will be in a new ns. Double spawning might be a good way to avoid this. I'll also take a look at the complex races you have mentioned to see how this would fit with it.

EDIT : only caveat about the spawn mentioned above is that it must be a "true" OS spawn, and something like tokio green therads might not shift the namespace

Maintainability is a big concern for me. We'd have to write a bunch of tests to support this.

Yep! Would be happy to help as much as I can!

I'd also like to understand how potential support for cgroups would interact with this.

I don't have that much deep understanding of cgroups, but from what I know, I don't think we should use cgroups to isolate test process, for the following reasons :

default cgroup creation needs ADMIN prev. for non-root user, this needs either a unshare of user namespace (which is needed for pid ns approach as well), or a dependency on systemd to create a cgroup for the process
As far as I understand, cgroup creation is a costly operation. We would need to setup correct "stuff" in order to create a cgroup, and avoid potential name-clashes with user cgroups (though we can get around it by using some random hash). As this feature will need to be opt-in and potentially per-test basis, creating cgroup for each would be much costlier
I think cgroups doesn't fit this case, because we are not really controlling access to any OS resource, eg limiting ram or cpu etc for a test. This seems mostly of a namespacing thing, where we want bunch of processes to be grouped as one group. So I don't think using cgroups to only groups processes is much useful, rather than resource limiting is good idea
Finally cgroup cleanup is also complex : when pid ns process 1 exists, OS sends SIGKILL to all others in the ns, which is a OS handled behavior we don't have to worry about. On the other hand we need to explicitly remove cgroups when we are done, and for that a cgroup must not have any process in it in order to be removed. This can be tricky in case test is in inf loop spawning shot-lived processes eg while true; do sleep 1; done in bash will spawn a sleep1 process infinitely. This will have one long running process in cgroup which is this script, and a lot of 1 sec processes. This will lead to race conditions where by the time we have read which processes are in cgroup, a read process would have exited and new would have spawned. making it almost impossible to kill unless killing the spawning process. (Sure killing the spawning process will remove all process, but we will have to check repeatedly that SIGKILL has killed all processes).
- cgroups v2 handles this by allowing freezing the cgruop so no new process is spawned

All this said and done, I think just making and handling cgroup for tests jsut to isolate them, would be not-that-great, and leave us with a lot of cleaning headache.

So yeah, that's what I think, let me know what do you think of this, and how can we go ahead with this. Let me know how can I help!

Thanks :)

sunshowers commented 2 years ago

OK, so in #627 I added the ability for nextest to double-spawn test processes. It's gated behind an environment variable NEXTEST_EXPERIMENTAL_DOUBLE_SPAWN=1. Feel free to experiment with it in cargo-nextest/src/double_spawn.rs.

YJDoc2 commented 1 year ago

Hey, I have been trying out various approaches to this, and wanted to check some things with you :

For this to work, we will need to launch a separate process from the first spawn and then start the actual test. So is it fine if I add an cmd arg to the double spawn which basically tells it if this is the first spawn or the second, so it can accordingly do setup or run the test? Is there any other way to do it more elegantly?

One other way is to run a dummy process using spawn first like tail -f /dev/null and then mount the /proc and spawn the test ; however, I don't think this is particularly good way.

Currently I'm thinking of using exit status to signify from the first spawned process to the main process if any resource has leaked or not, as there is no other way of communication between them right now. Is it fine, or should we use something different?

Thanks :)

sunshowers commented 1 year ago

@YJDoc2 I don't fully understand -- are you saying that we'll need to do a "triple spawn" to make it work? Hmm, that's a bit surprising to me.

re exit status, we probably need a communication protocol here -- the exit code likely isn't enough for a production-quality solution. I think creating and passing in a fifo (named pipe) would be the best way to do this. But it's OK to do a prototype with exit codes (and I'd strongly recommend doing and putting up a prototype first so we can see how complex this is).

YJDoc2 commented 1 year ago

Hey, I might have explained it a bit wrong, because I still mean two spawns, not three. I think a better way would be me opening a temp PR with what I'm thinking as a prototype, so we can discuss it in a better way. I'll try to get that done by this weekend :)

Also, I agree with fifo, but for prototype, I'll probably use exit code to get it working. When we will get the fifo-or-equivalent working, I think we won't need the stdio futures anymore, and all leaks can be detected by the double spawn itself.

YJDoc2 commented 1 year ago

Hey @sunshowers so I worked around a bit on this, and have reached to an unfortunate conclusion that we might have to use some unsafe code inn order to get pid ns isolation working. The reasoning is as follows :

currently we invoke the nextest by the command cargo-nextest nextest run/list which then again invokes itself through spawn with the compiled test binary as an argument. In non-double spawn approach, this invocation then does some setup and runs exec with the test binary as arg, and the "main" process waits on it through async futures.
for double spawn we want that the test binary to run in a separate pid ns, so that we can track if there are any leaked processes. For this we need to unshare the pid ns (and user ns in order to unshare pid ns) so that the tests are executed in a new ns and we can track what all processes they spawn.
we also need to mount /proc so that we can filter out which processes are in this new ns, otherwise we see processes from all ns. For this we also need to unshare mount ns
But the process calling unshare does not itself becomes part of new ns, thus the flow of this must be something like :
- "main" binary will spawn nextest like before (call process 0)
- this process will call unshare and (ideally) spawn another process which will now be in a new ns (call process 1)
- this 2nd process will now run the actual test binary and wait on it. Once the test is done, it will check if there are any extra processes running or not and report it to the process 1, and process 1 can report the data to process 0.
- We have to take this long way, as we cannot unshare in process 0 as all test will become part of same new ns, making difficult to check which process leaked. process 1 calls unshare and cannot be part of new ns, thus process 2 must report the leaked process to it.
- Now the point is for process 2 to report this correctly, it must mount the procfs, and to mount the procfs one must have CAP_ADMIN. Usually this would need sudo, but because we are also unsharing the user ns, we can mount it without requiring sudo.
- HOWEVER, the catch is that we have this CAP_ADMIN only till we do an exec call ref :
  
  Note that a call to execve(2) will cause a process's capabilities to be recalculated in the usual way
- which means that we cannot use spawned process to mount the proc ns, as spawn uses exec, and the process will lose the CAP_ADMIN.

There are only two ways of running something between creation of a new process and exec-in a new command in it :

use fork and exec manually (unsafe)
use pre_exec command in spawn (also unsafe)

Thus, we kind of have to use some unsafe code in order to setup the /proc mount for deciding if a test has leaked or not.

I agree that fork is unsafe, but we can carefully make a wrapper on it so that it is invoked properly. For eg, see this impl.

Is there something I missed? Let me know your thoughts. Thanks :)

sunshowers commented 1 year ago

Thanks for the detailed report! I think what you described makes sense, though I haven't thought enough about it to see if there's some sort of alternate approach.

Unfortunately, fork/exec (or pre_exec, which is equivalent) is an approach that is: (a) significantly slower than the posix_spawn fast path that nextest currently uses (up to twice as slow in real-world benchmarks) (b) more susceptible to signal handling races than posix_spawn.

I've put a lot of effort into making sure we never use fork/exec (including upstream Rust contributions), and I'd hate to see nextest regress on that. I think pid namespacing might just not be possible to do, unless there's some sort of clever workaround.

YJDoc2 commented 1 year ago

Hey @sunshowers , I'm afraid that at least for now I am not able to find any workaround. It seems that we will have to let go of pid namespacing for now. I'm sorry that I didn't do better research originally and you had to add experimental feature to support this, which ultimately could not be realized :sweat: :sweat:

Apart from this , I am curious about the thing you mentioned about fork/exec. I can understand that it might be more susceptible to signal races, but I didn't know that it is slower than spawn. Can you point me to any place where this might be explained ? I was under assumption that because spawn itself uses a fork/exec like mechanism, it should not be that much different than manually doing fork/exec (however adding any logic after fork-before exec would be slower - I guess because of COW page setup and stuff(?))

Again, thanks for showing interest in this and helping me, apologies that I didn't do better research beforehand. You can keep this open or close this as you wish. Thanks :)

YJDoc2 commented 1 year ago

The problem here isn't even that we absolutely need mount call ... I was using mount so that we can easily get list of processes in the new ns. If we can get that list, even with a tedious way, we don't really need mount call, so we can do pid isolation....

sunshowers commented 1 year ago

Thank you so much for doing the research! Even if we figure out that pid namespacing won't work, at least we'll have gathered a lot of useful information.

I'm sorry that I didn't do better research originally and you had to add experimental feature to support this, which ultimately could not be realized

Oh no, I was meaning to add support for this anyway! See https://nexte.st/CHANGELOG.html#double-spawning-test-processes. You just gave me the final incentive to implement it.

Apart from this , I am curious about the thing you mentioned about fork/exec. I can understand that it might be more susceptible to signal races, but I didn't know that it is slower than spawn. Can you point me to any place where this might be explained ?

So basically on Linux and other OSes, posix_spawn uses a special type of fork called a vfork. This is a super unsafe form of fork which is much faster, but is basically impossible to use correctly. posix_spawn is effectively a safe wrapper around vfork (you can see the implementation in this file.)

nextest-rs / nextest

Discussion : using pid namespace to detect existance of leaky tests #624