hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.91k stars 1.95k forks source link

cpu oversubscription #23920

Open noffke opened 2 months ago

noffke commented 2 months ago

Proposal

Please allow to opt-out of setting cpu-shares on the docker driver.

Nomad uses the "cpu" resources of a job config for planning/performing allocations and, in case of the docker driver, the value will then also be set as cpu shares (https://docs.docker.com/engine/containers/resource_constraints/#cpu) on starting a docker container. We're running nomad on our own hardware, so outside of general system performance considerations, we don't have to worry about consuming too many cpu cycles, contrarily to a paid cloud environment. As a form of low-effort, at your own risk cpu overprovisioning, it would be great if there was an option to tell nomad to simply not set any cpu shares setting when starting a docker container.

Use-cases

We're running a cpu intensive service via nomad, that, even though it gets given most of the available cpu of a machine, still gets throttled to unusability.

Attempted Solutions

Running the service via docker on a nomad client machine without nomad (i.e. starting it manually on the command line) and not setting cpu shares makes the service run fine. Ideally, we'd like to continue to run the service via nomad, so we can avoid running one service outside of nomad.

tgross commented 2 months ago

@noffke if Nomad doesn't set the cpu-shares, they're just going to be the default for the cgroup (1024), so I don't think that would actually solve the problem you've got. That is, so long as a host is configured with cgroups, all processes are being run in a cgroup (if only the root cgroup, but typically on Linux with systemd the system.slice).

Have you considered using resource.cores to give the workload cpuset isolation to a subset of cores instead?

noffke commented 2 months ago

@tgross thank you for your explanation! From looking at the running docker containers, and also from my understanding, that doesn't seem to be the case on our (plain ubuntu) systems. If I don't set any resource limits on docker run... docker won't apply any, so the containers will run like any other processes on the systems.

Have you considered using resource.cores to give the workload cpuset isolation to a subset of cores instead?

The service in question doesn't have a constantly high cpu load but rather a medium load with bursts, so if I assign cpus exclusively, they'd be unavailable to the other containers even though they're idle, in my understanding.

tgross commented 2 months ago

From looking at the running docker containers, and also from my understanding, that doesn't seem to be the case on our (plain ubuntu) systems. If I don't set any resource limits on docker run... docker won't apply any, so the containers will run like any other processes on the systems.

Right, but "any other processes on the system" are also constrained by cgroups, because that's how the so-called Completely Fair Scheduler (CFS) works. Docker doesn't apply any extra constraints, but the process is still constrained with a default cpu.share (or more likely if you're on a recent version of your distro with cgroups v2, cpu.weight). Remember when you're looking at the cgroups that they "roll up" to the parent cgroup. (I'll dig into this below if you're interested.)

Assuming you're on cgroups v2 (which is likely unless you're using a very old Ubuntu), to give this process as much cpu.weight as a typical un-containerized process on the system, you need to set the resources.cpu sufficiently high such that Docker maps that to an equivalent of cpu.weight = 100 (or more).

The catch with that is the Nomad scheduler then assumes those CPU resources aren't available. Which probably means you're really looking for "CPU oversubscription" here, which would have to be implemented in the scheduler and not the task driver. This is all an unfortunate consequence of having a generic "CPU resource" that different task drivers handle in different ways. Or even different ways for the same task driver, depending on kernel options, as we see with Docker!


Additional context about CPU resources...

For example, let's compare running a process by itself, running a process with Docker directly but without -cpu-share set, and running a process with Docker under Nomad. This is a machine using cgroups v2, so we'll be looking at cpu.weight rather than cpu.share.

First I'll run busybox httpd as a normal "unconstrained" process and can see that the process has been given a weight of 100 (the default).

$ ps afx
...
  44410 pts/2    S+     0:00  |   \_ busybox httpd -vv -f -p 8001 -h /srv/

$ cat /proc/44410/cgroup
0::/user.slice/user-1000.slice/session-c1.scope

$ cat /sys/fs/cgroup/user.slice/cpu.weight
100

Next I'll run the same process in a container:

$ docker run -it --rm busybox:1 httpd -vv -f -p 8001 -h /var/www

In another terminal, find the PID for the process and look up its cgroups and see that this also has the default cpu.weight of 100, even though we didn't explicitly set a cpu weight in the Docker invocation.

$ ps afx
...
  37297 ?        Sl     0:00 /usr/bin/containerd-shim-runc-v2 -namespace moby -id 60fa9e7d45eeac0288efda19bf0d00313b74573a3757f6dfdaa931f52f52c83d -address
  37317 pts/0    Ss+    0:00  \_ httpd -vv -f -p 8001 -h /var/www

$ cat /proc/37317/cgroup
0::/system.slice/docker-60fa9e7d45eeac0288efda19bf0d00313b74573a3757f6dfdaa931f52f52c83d.scope

$ cat /sys/fs/cgroup/system.slice/docker-60fa9e7d45eeac0288efda19bf0d00313b74573a3757f6dfdaa931f52f52c83d.scope/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc

$ cat /sys/fs/cgroup/system.slice/docker-60fa9e7d45eeac0288efda19bf0d00313b74573a3757f6dfdaa931f52f52c83d.scope/cpu.weight
100

Lastly, we'll run a minimal Nomad jobspec with resources.cpu = 200.

jobspec ```hcl job "example" { group "group" { task "task" { driver = "docker" config { image = "busybox:1" command = "httpd" args = ["-vv", "-f", "-p", "8001", "-h", "/local"] } resources { cpu = 200 memory = 50 } } } } ```

When we look up the cgroup, we can see a cpu.weight = 8.

$ ps afx
...
  49026 ?        Sl     0:00 /usr/bin/containerd-shim-runc-v2 -namespace moby -id dd8b806512c419b3ad90ac4ab820e2e6219a9f3fd0c64df7dd9b3406d0be6bb3 -address
  49046 ?        Ss     0:00  \_ httpd -vv -f -p 8001 -h /local

$ cat /proc/49046/cgroup
0::/system.slice/docker-dd8b806512c419b3ad90ac4ab820e2e6219a9f3fd0c64df7dd9b3406d0be6bb3.scope

$ cat /sys/fs/cgroup/system.slice/docker-dd8b806512c419b3ad90ac4ab820e2e6219a9f3fd0c64df7dd9b3406d0be6bb3.scope/cpu.weight
8

Why 8? That's Docker mapping the HostConfig.CpuShares to an equivalent cpu.weight. Frankly, that math smells a little dubious as I'd expect this to be something more like 20, but that decision making is upstream of Nomad. It might be worth us looking at moving away from setting HostConfig.Cpushares when cgroups v2 is in play and switching to setting CpuWeight directly, but that needs some further thinking.

$ docker inspect dd8b806512c4 | jq '.[0].HostConfig.CpuShares'
200

I'm also realizing here that the documentation in https://developer.hashicorp.com/nomad/docs/drivers/docker#cpu is stale and doesn't account for cgroups v2.

noffke commented 2 months ago

@tgross Thanks again for your very detailed explanation!