hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.92k stars 1.95k forks source link

rawexec: Allow only running task process in cgroup v1 overrides #23951

Closed schmichael closed 2 days ago

schmichael commented 1 month ago

Use-cases

Nomad v1.8.0 (#20481) added the cgroup_v1_override and corresponding v2 parameters to allow placing task processes in specific cgroups.

The goal of this feature is to enable users' with precise cgroup requirements for their tasks absolute control.

cgroup v2 uses the clone3(2) CLONE_INTO_CGROUP to spawn only the task process in the custom cgroups.

cgroup v1 is not supported and uses the traditional double-fork approach:

  1. The nomad agent process forks (1) an intermediary process called the executor
  2. The executor sets up (potentially custom) cgroups and forks (2) the task command.

This has the unfortunate side effect of leaving the executor process in the custom groups with the task process and prevents users from having full control over their custom cgroups.

Proposal

The cgroup v1 behavior should match the cgroup v2 behavior: the executor should not be part of the custom cgroup.

A straightforward, but imperfect, approach would be for the executor to detach from the custom cgroups after forking the child process. Writing the executor's pid to the root cgroups after forking the task process would remove it from the tasks' cgroups. However there would be a window of time in which both the task and executor were running in the custom cgroup.

An alternative that avoids the race condition may be possible but would significantly complicate Nomad's executor: we could triple fork where the new intermediary process handles setting up and entering cgroups but exits after forking the user process.

The executor treats subprocesses exiting as the task exiting, so significant code changes would be required to support this new flow just for cgroup v1 override support.

It may be possible for the new intermediary process to avoid the third fork and exiting in favor of calling Exec directly to replace itself with the task's command.

schmichael commented 1 month ago

It may be possible for the new intermediary process to avoid the third fork and exiting in favor of calling Exec directly to replace itself with the task's command.

This seems preferable to triple-forking, and functionally the best approach. That being said I think it's considerable more effort for an EOL subsystem (cgroup v1) than merely having the executor leave the custom cgroups after forking the user process.