process group support and more

Part 1 (process groups)

In zos, it happens that we need to group a set of processes in a logical group (not related to linux process group, although this can be needed for the implementation).

The process group by zos is a logical group of processes that can be monitored, started, stopped in one go. Internally in that group, the processes can still has dependencies across processes defined within this group. This is beneficial on many levels as explained below:

It some times required to start multiple processes that only related to a single service, for example, a VM can has multiple virtiofs daemons running, a console, network related services, etc ... Having all services look flat make it hard to manage the vm as a single entity, it's hard fo track logs of individual services, and such. We normally then delegate the process management to yet another service (like vmd in zos) which complicate things, because now it needs to monitor each individual service, and take actions that can be complex (or not existent)
It's hard to start, stop, or forget a service that is composed of multiple processes in one go (again the vm example above)
It's sometimes required to restart the entire group if one service crashed. For example in case of VMs if one of the virtiofsd processes died un-expectdely, it's required to restart the VM as well.

What we need to do?

support a process group, a group is represented as a sub-dir under /etc/zinit (where we have the services files, this can be as:
- /etc/zinit/<group-name>, this directory then can has multiple config files (for each service).
Service name is unique of course inside a group, but not across different group, so a group0/vm.yaml and group1/vm.yaml are totally perfect.
Once a group is configured, we can use the monitor the entire group in one go with zinit monitor group-name and zinit then should take care of starting all the processes in one go. Note that the processes are started normally (respecting dependency) there can't be cross group dependency.
inspection of logs should appear normally, a service inside a group should have name log prefix like [group-name/service-name], a filter on the entire group name can be done on logs with zinit log group-name as well.
a stop command can be initiated to an entire group, or a single service inside a group using it's full name group-name/service-name
Same for other commands

Note: to implement this, maybe zinit should abstract grouping of processes and assume all services that are defined at the host level are a nameless group (host group), which has sub-groups, which means in general groups can be nested. This will make it easier to think about group and to implement

Part 2 (restart on dependency death)

Right now service dependency can only specified with the after configuration flag. It basically tells zinit that a service A can only be started after B if A is configured like

after:
  - B

This is cool and all, and it covers like 95% of uses. But after A is started, service B is never checked again, it means that if B died A is kept alive. It's up to A to "detect" loss of connection and then try again, or even completely exit until B is started again.

In some situation is's required to actually assume that A is now in a bad state, and that we need to automatically stop A until B is started again.

This is exactly the case with VMs and virtiofsd, if an virtiofsd died, the VM won't die, but we will start getting IO errors inside the vm. Starting the virtiofsd again won't fix the issue, but we will still also need to restart the VM as well.

This is why i really think we need to introduce another dependency flag that can be conditioned as well. Say requires, the only condition i can think of right now is an always-restart condition as follows:

requires:
  always-restart:
    - B

Note: this is an initial syntax that can be changed, IMHO it's confusing

I think this way to configure it is a little bit confusing because what it actually means is if B dies, you need to restart A (the one that is being configures). Note that, a requires implies after.

IMHO we need to focus on first issue first (the groups) and then see if we really need the requires flag

threefoldtech / zinit

process group support and more #59

Part 1 (process groups)

Part 2 (restart on dependency death)