thrumdev / blobs

Blobchains on Polkadot and Kusama
https://docs-site-pi.vercel.app
Apache License 2.0
64 stars 8 forks source link

xtask: sugondat-node zombie process #228

Open gabriele-0201 opened 9 months ago

gabriele-0201 commented 9 months ago

The Zombienet process does not seem to kill the Sugondat-node spawned process. Usually, when we press Ctrl+C on the Zombienet executed in the terminal, all grandchild processes are killed (by the shell I think). However, Duct behaves differently and "does not kill any grandchild processes that the children have spawned on their own." Since Zombienet apparently does nothing on its own, we need to either keep track of the 'sugondat-node' process or request Zombienet to do that for us

pepyakin commented 9 months ago

This is a highly non-trivial task actually. At least in theory.

In Linux, when a process dies all its children usually reparented to PID1 (init or systemd), not children's parent. For example, if xtask spawns zombienet and zombienet spawns polkadot, and then zombienet is killed, then polkadot will be reparented to PID1 and continue running.

It is possible to sign up for a signal when the parent dies (using PR_SET_PDEATHSIG). However, I am not sure if that makes any difference.

There is PR_SET_CHILD_SUBREAPER that designates a process to receive orphans (due to reparenting). I am not aware if there is a notification to the parent process about that happening, so not sure how to make the parent to react.

Also, the elephant in the room, is that the options above obviously won't work for macOS.

Well, to clarify, I don't mean that it's a futile goal, after all all programs having a similar functionality probably don't bother too much, and there are cases when orphans get created.

gabriele-0201 commented 9 months ago

Interesting. This is a lot trickier than I expected it to be. Here's another take on the problem: https://github.com/oconnor663/duct.py/blob/master/gotchas.md#killing-grandchild-processes.

A possible solution could be to implement a new xtask's task that will be executed by duct, spawning a new process instead of calling zombinet directly. Then, setpgid can be called to establish the process group ID, followed by execve on zombinet to spawn all the grandchild processes. The trick here is that group PIDs are inherited and zombinet does not change them (as far as I know). So, later when the main cargo xtask test command kills the processes, it will also terminate all the processes under the same group ID.

What do you think? This should be discussed before implementation because my gut feeling is that it may not be straightforward

gabriele-0201 commented 9 months ago

The problem with orphans here is that we need to manually kill all processes between each 'cargo xtask test.' Otherwise, the shim would find and immediately ready sugondat-node, which would break all the testing

pepyakin commented 9 months ago

What do you think?

I also did some research and also concluded that setpgid is the best-effort option that probably will cover 99%. However, you say:

The trick here is that group PIDs are inherited

When you create a child process, the pgid is inhereted by default. You don't have to call setpgid to inherit it.

So, later when the main cargo xtask test command kills the processes, it will also terminate all the processes under the same group ID.

For that, it would be better not to inherit the pgid, but assign a new process group to zombienet. This way xtask would be able to kill only the zombienet and its children, but xtask won't kill itself. In this case, however, relevant signals sent to xtask should be caught and forwarded to the children. Otherwise, the shell will send e.g. ^C to the pg of xtask and not the zombienet's.

The problem with orphans here is that we need to manually kill all processes between each 'cargo xtask test.' Otherwise, the shim would find and immediately ready sugondat-node, which would break all the testing

Sure, although to clarify, not between, but rather at the end. We should not try to be smart and start killing processed which we detect to be leftovers.

gabriele-0201 commented 9 months ago

For that, it would be better not to inherit the pgid, but assign a new process group to zombienet. This way xtask would be able to kill only the zombienet and its children, but xtask won't kill itself.

yess! I was saying the same thing, I just express myself really badly!

however, relevant signals sent to xtask should be caught and forwarded to the children. Otherwise, the shell will send e.g. ^C to the pg of xtask and not the zombienet's.

uhh, I didn't think about it!! yes, it must be done otherwise we would solve a problem and create another

pepyakin commented 9 months ago

you previously were confused, why I would bring up tokio, so the reason is,

Signal handling is pretty complicated. Listening for signals is pretty anoying and complicated and then delivering those signals to the required place in the program is also annoying. Using the async approach simplifies it and tokio offers tokio::signal which is a decent API.