Open YJDoc2 opened 2 years ago
Hi there -- thanks for the suggestion! I've spent a bit of time looking at pid namespaces, cgroups and such, and here's what my general takeaways have been.
I think the overall scheme you describe to detect leaks will definitely work. However, the biggest stumbling block is that unshare(CLONE_NEWPID)
requires CAP_SYS_ADMIN
, or in other words that it requires nextest to run as a privileged process. I think this makes it a nonstarter for many use cases.
(edit: I was wrong about this, see below)
Assuming that nextest is run as a privileged process, some notes:
There's an interesting question about whether each test should be in its own namespace or whether all tests should be in the same namespace.
One of the issues here is that nextest tries as hard as possible to use posix_spawn
to execute each test, for both performance and race condition (around signals) reasons.
However, I would like to spend some time experimenting with a double-spawn approach for tests (there are some complex races that are solved that way). Pid namespace creation could presumably just fit into that.
I think if we're going through this much effort to isolate tests, overall it makes sense for each test to be in its own namespace. That would also eliminate the need for much of a communication protocol.
Forking a Rust process is super unsafe, so we'd probably spawn the current cargo-nextest exe with certain command-line args (or env vars) as a subprocess rather than forking it.
This would be an opt-in config so existing tests don't break by default. We could even add per-test overrides to choose which tests get namespace isolation.
We only need to support Linux for this (nextest does support other Unixes but it already has plenty of platform-specific code).
Maintainability is a big concern for me. We'd have to write a bunch of tests to support this.
I'd love to hear ideas about what can be done to handle the privilege issue. I'd also like to understand how potential support for cgroups would interact with this.
Oh, never mind, looks like creating a user namespace (like in https://unix.stackexchange.com/a/672462 or in your example) is enough to provide privileges to that process. OK, that makes it a lot more feasible!
(I've confirmed that unshare --user --pid ls
works. This would presumably also cause cgroups to work...)
Hey, thanks for the details reply and explanation! (And apologies for my long reply!)
However, the biggest stumbling block is that unshare(CLONE_NEWPID) requires CAP_SYS_ADMIN, or in other words that it requires nextest to run as a privileged process. I think this makes it a nonstarter for many use cases. (edit: I was wrong about this, see below) ... Oh, never mind, looks like creating a user namespace (like in https://unix.stackexchange.com/a/672462 or in your example) is enough to provide privileges to that process. OK, that makes it a lot more feasible!
Yep, my first check after thinking of pid namespaces was checking the privileges, and I stumbled on the same post which you have linked. :smile: That is why the example first does user namespace unshare and then pid ns, so it can work without sudo. My personal thought is that a test runner should not require sudo privileges unless the tests themselves need sudo, so I tried best to check that we will not need to run nextest with admin privileges.
Forking a Rust process is super unsafe, so we'd probably spawn the current cargo-nextest exe with certain command-line args (or env vars) as a subprocess rather than forking it.
The main reason I used fork in example was to get a working prototype quick-and-dirty (I would not use so many unwraps otherwise). The key point in ns unsharing is that the process that called unshare does not become part of the new ns. Its children or spawned threads as shown in eg, do become part of the new ns. So if nextest does unshare call, and then spawns tests then the tests should become part of the new namespace, removing the need of fork. However that might complicate doing a per test opt-in as every process spawned after the unshare will be in a new ns. Double spawning might be a good way to avoid this. I'll also take a look at the complex races you have mentioned to see how this would fit with it.
EDIT : only caveat about the spawn mentioned above is that it must be a "true" OS spawn, and something like tokio green therads might not shift the namespace
Maintainability is a big concern for me. We'd have to write a bunch of tests to support this.
Yep! Would be happy to help as much as I can!
I'd also like to understand how potential support for cgroups would interact with this.
I don't have that much deep understanding of cgroups, but from what I know, I don't think we should use cgroups to isolate test process, for the following reasons :
while true; do sleep 1; done
in bash will spawn a sleep
1 process infinitely. This will have one long running process in cgroup which is this script, and a lot of 1 sec processes. This will lead to race conditions where by the time we have read which processes are in cgroup, a read process would have exited and new would have spawned. making it almost impossible to kill unless killing the spawning process. (Sure killing the spawning process will remove all process, but we will have to check repeatedly that SIGKILL has killed all processes).
All this said and done, I think just making and handling cgroup for tests jsut to isolate them, would be not-that-great, and leave us with a lot of cleaning headache.
So yeah, that's what I think, let me know what do you think of this, and how can we go ahead with this. Let me know how can I help!
Thanks :)
OK, so in #627 I added the ability for nextest to double-spawn test processes. It's gated behind an environment variable NEXTEST_EXPERIMENTAL_DOUBLE_SPAWN=1
. Feel free to experiment with it in cargo-nextest/src/double_spawn.rs
.
Hey, I have been trying out various approaches to this, and wanted to check some things with you :
For this to work, we will need to launch a separate process from the first spawn and then start the actual test. So is it fine if I add an cmd arg to the double spawn which basically tells it if this is the first spawn or the second, so it can accordingly do setup or run the test? Is there any other way to do it more elegantly?
One other way is to run a dummy process using spawn first like tail -f /dev/null
and then mount the /proc
and spawn the test ; however, I don't think this is particularly good way.
Currently I'm thinking of using exit status to signify from the first spawned process to the main process if any resource has leaked or not, as there is no other way of communication between them right now. Is it fine, or should we use something different?
Thanks :)
@YJDoc2 I don't fully understand -- are you saying that we'll need to do a "triple spawn" to make it work? Hmm, that's a bit surprising to me.
re exit status, we probably need a communication protocol here -- the exit code likely isn't enough for a production-quality solution. I think creating and passing in a fifo (named pipe) would be the best way to do this. But it's OK to do a prototype with exit codes (and I'd strongly recommend doing and putting up a prototype first so we can see how complex this is).
Hey, I might have explained it a bit wrong, because I still mean two spawns, not three. I think a better way would be me opening a temp PR with what I'm thinking as a prototype, so we can discuss it in a better way. I'll try to get that done by this weekend :)
Also, I agree with fifo, but for prototype, I'll probably use exit code to get it working. When we will get the fifo-or-equivalent working, I think we won't need the stdio futures anymore, and all leaks can be detected by the double spawn itself.
Hey @sunshowers so I worked around a bit on this, and have reached to an unfortunate conclusion that we might have to use some unsafe code inn order to get pid ns isolation working. The reasoning is as follows :
cargo-nextest nextest run/list
which then again invokes itself through spawn with the compiled test binary as an argument. In non-double spawn approach, this invocation then does some setup and runs exec with the test binary as arg, and the "main" process waits on it through async futures. Note that a call to execve(2) will cause a process's capabilities to be recalculated in the usual way
There are only two ways of running something between creation of a new process and exec-in a new command in it :
Thus, we kind of have to use some unsafe code in order to setup the /proc mount for deciding if a test has leaked or not.
I agree that fork is unsafe, but we can carefully make a wrapper on it so that it is invoked properly. For eg, see this impl.
Is there something I missed? Let me know your thoughts. Thanks :)
Thanks for the detailed report! I think what you described makes sense, though I haven't thought enough about it to see if there's some sort of alternate approach.
Unfortunately, fork/exec (or pre_exec, which is equivalent) is an approach that is:
(a) significantly slower than the posix_spawn
fast path that nextest currently uses (up to twice as slow in real-world benchmarks)
(b) more susceptible to signal handling races than posix_spawn.
I've put a lot of effort into making sure we never use fork/exec (including upstream Rust contributions), and I'd hate to see nextest regress on that. I think pid namespacing might just not be possible to do, unless there's some sort of clever workaround.
Hey @sunshowers , I'm afraid that at least for now I am not able to find any workaround. It seems that we will have to let go of pid namespacing for now. I'm sorry that I didn't do better research originally and you had to add experimental feature to support this, which ultimately could not be realized :sweat: :sweat:
Apart from this , I am curious about the thing you mentioned about fork/exec. I can understand that it might be more susceptible to signal races, but I didn't know that it is slower than spawn. Can you point me to any place where this might be explained ? I was under assumption that because spawn itself uses a fork/exec like mechanism, it should not be that much different than manually doing fork/exec (however adding any logic after fork-before exec would be slower - I guess because of COW page setup and stuff(?))
Again, thanks for showing interest in this and helping me, apologies that I didn't do better research beforehand. You can keep this open or close this as you wish. Thanks :)
The problem here isn't even that we absolutely need mount call ... I was using mount so that we can easily get list of processes in the new ns. If we can get that list, even with a tedious way, we don't really need mount call, so we can do pid isolation....
Thank you so much for doing the research! Even if we figure out that pid namespacing won't work, at least we'll have gathered a lot of useful information.
I'm sorry that I didn't do better research originally and you had to add experimental feature to support this, which ultimately could not be realized
Oh no, I was meaning to add support for this anyway! See https://nexte.st/CHANGELOG.html#double-spawning-test-processes. You just gave me the final incentive to implement it.
Apart from this , I am curious about the thing you mentioned about fork/exec. I can understand that it might be more susceptible to signal races, but I didn't know that it is slower than spawn. Can you point me to any place where this might be explained ?
So basically on Linux and other OSes, posix_spawn uses a special type of fork called a vfork. This is a super unsafe form of fork which is much faster, but is basically impossible to use correctly. posix_spawn is effectively a safe wrapper around vfork (you can see the implementation in this file.)
Hey, I just found about this project recently, and it seems pretty cool. I was going through the docs, and your blog post of this, and it states that currently this only detects leaky tests which forward the stdio to children.
I was wondering if we could use pid namespaces on linux to find other kinds of leaks, i.e. those which do not forward the handles?
First, the question of portability, pid namespaces concept is only present on linux-like OSs. For Windows I think it might be possible to do something similar using Windows console process groups (?). I'm not much familiar with Windows API, so this may or may not be possible on Windows, some more investigation is needed. For others such as BSD or other OSs, I do not have much Idea, (and actually not sure how much support this crate has for them either :sweat_smile: )
Getting back to Linux, I think we might be able detect if there are any processes still running, created by the tests, after all tests are finished using pid namespace unsharing. I haven't gone through the code of this crate in detail, but from what I understood from your blog was that for each test to be run, the runner creates an async future, futures for their stdios and then waits on them, bubbling up the result to some "handler" process which takes care of collecting the results and printing o/p. Leak is detected if std handles of a task stay open for certain time after the test exists/gets killed. If this is correct, then I suggest the following for using pid ns to detect leaks.
The idea is even if we can assert that some testcase has leaked, user can do a binary search on cases until they find which one. Also has added benefit that when the runner will exit, the leaked processes will automatically killed by SIGKILL from OS side, so better cleanup (?)
This has some glaringly obvious problems (that I could think of) :
A basic working example of above approach is (needs nix and duct) (uses a lot of unwrap for demonstration purposes)
I wrote this up quickly hoping to catch you in your timezone, so of course there might be some loopholes in above. If you want me to test something, or make a more detailed example for a specific case to check if it'll work, let me know, I'll try to get that done as well.
I'm also thinking of opening this issue/approach on duct, as this might be useful for them as well to properly handle (?) grand-child process.
Hope this was somewhat useful. Thanks :)