hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.96k stars 1.96k forks source link

Backport of scheduler: take all assigned cpu cores into account instead of only those part of the largest lifecycle into release/1.9.x #24530

Closed hc-github-team-nomad-core closed 1 day ago

hc-github-team-nomad-core commented 1 day ago

Backport

This PR is auto-generated from #24304 to be assessed for backporting due to the inclusion of the label backport/1.9.x.

The below text is copied from the body of the original PR.


In our production environment where we run Nomad on v1.8.2 we noticed overlapping cpusets and the Nomad reserve/share slices being out of sync. Specifically, the below setup where we have various task in prestart and poststart that are part of the main lifecycle. image

I managed to reproduce it with the below job spec on the latest main (v1.9.1) in my sandbox environment :

job "redis-job-{{SOME_SED_MAGIC}}" {
  type = "service"
  group "cache" {
    count = 1
    task "redis" {
      driver = "docker"
      config {
        image = "redis:3.2"
      }
      resources {
        cores = 4
      }
    }

    task "redis-start-side" {
      lifecycle {
        hook    = "poststart"
        sidecar = true
      }
      driver = "docker"
      config {
        image = "redis:3.2"
      }
      resources {
        cores = 4
      }
    }
  }
}

Spinning up two jobs with this spec resulted in the following overlap :

[sandbox@nomad-dev nomad]$ docker ps --format '{{.ID}}' | xargs -I {} bash -c 'grep -H . /sys/fs/cgroup/cpuset/docker/{}*/cpuset.effective_cpus' | column -s: -t | sort -n -k2
/sys/fs/cgroup/cpuset/docker/ec9220fbe2d0/cpuset.effective_cpus  0-3
/sys/fs/cgroup/cpuset/docker/6e06a9ed1631/cpuset.effective_cpus  4-7
/sys/fs/cgroup/cpuset/docker/a52a46cfa489/cpuset.effective_cpus  4-7
/sys/fs/cgroup/cpuset/docker/c9049b1b3f2c/cpuset.effective_cpus  8-11

Full output

[sandbox@nomad-dev nomad]$ docker ps
CONTAINER ID   IMAGE       COMMAND                  CREATED          STATUS          PORTS      NAMES
a52a46cfa489   redis:3.2   "docker-entrypoint.s…"   19 seconds ago   Up 18 seconds   6379/tcp   redis-start-side-4d6d1f92-fab2-f2bb-ca79-1f56ad3772c0
ec9220fbe2d0   redis:3.2   "docker-entrypoint.s…"   19 seconds ago   Up 18 seconds   6379/tcp   redis-4d6d1f92-fab2-f2bb-ca79-1f56ad3772c0

[sandbox@nomad-dev nomad]$ grep -H . /sys/fs/cgroup/cpuset/nomad/{reserve,share}/cpuset.effective_cpus
/sys/fs/cgroup/cpuset/nomad/reserve/cpuset.effective_cpus:0-7
/sys/fs/cgroup/cpuset/nomad/share/cpuset.effective_cpus:8-123

[sandbox@nomad-dev nomad]$ docker ps --format '{{.ID}}' | xargs -I {} bash -c 'grep -H . /sys/fs/cgroup/cpuset/docker/{}*/cpuset.effective_cpus' | column -s: -t | sort -n -k2
/sys/fs/cgroup/cpuset/docker/ec9220fbe2d0edef8bd9f67cabd7da226f32d346f65d196463bc4d6701864213/cpuset.effective_cpus  0-3
/sys/fs/cgroup/cpuset/docker/a52a46cfa489fe815fcbd11019c391d7fe771b878f77ddb3c993ab5cd98d8084/cpuset.effective_cpus  4-7

[sandbox@nomad-dev nomad]$ docker ps --format '{{.ID}}' | xargs docker inspect | egrep '(CpusetCpus|NOMAD_CPU_LIMIT|Id)'
        "Id": "a52a46cfa489fe815fcbd11019c391d7fe771b878f77ddb3c993ab5cd98d8084",
            "CpusetCpus": "4,5,6,7",
                "NOMAD_CPU_LIMIT=8980",
        "Id": "ec9220fbe2d0edef8bd9f67cabd7da226f32d346f65d196463bc4d6701864213",
            "CpusetCpus": "0,1,2,3",
                "NOMAD_CPU_LIMIT=8980",
[sandbox@nomad-dev nomad]$ docker ps
CONTAINER ID   IMAGE       COMMAND                  CREATED          STATUS          PORTS      NAMES
c9049b1b3f2c   redis:3.2   "docker-entrypoint.s…"   16 seconds ago   Up 15 seconds   6379/tcp   redis-start-side-50ef4e44-0e41-b273-7915-bfd0c2fc2ec2
6e06a9ed1631   redis:3.2   "docker-entrypoint.s…"   16 seconds ago   Up 16 seconds   6379/tcp   redis-50ef4e44-0e41-b273-7915-bfd0c2fc2ec2
a52a46cfa489   redis:3.2   "docker-entrypoint.s…"   3 minutes ago    Up 3 minutes    6379/tcp   redis-start-side-4d6d1f92-fab2-f2bb-ca79-1f56ad3772c0
ec9220fbe2d0   redis:3.2   "docker-entrypoint.s…"   3 minutes ago    Up 3 minutes    6379/tcp   redis-4d6d1f92-fab2-f2bb-ca79-1f56ad3772c0

[sandbox@nomad-dev nomad]$ grep -H . /sys/fs/cgroup/cpuset/nomad/{reserve,share}/cpuset.effective_cpus
/sys/fs/cgroup/cpuset/nomad/reserve/cpuset.effective_cpus:0-11
/sys/fs/cgroup/cpuset/nomad/share/cpuset.effective_cpus:12-123

[sandbox@nomad-dev nomad]$ docker ps --format '{{.ID}}' | xargs -I {} bash -c 'grep -H . /sys/fs/cgroup/cpuset/docker/{}*/cpuset.effective_cpus' | column -s: -t | sort -n -k2
/sys/fs/cgroup/cpuset/docker/ec9220fbe2d0edef8bd9f67cabd7da226f32d346f65d196463bc4d6701864213/cpuset.effective_cpus  0-3
/sys/fs/cgroup/cpuset/docker/6e06a9ed1631758827aa4136690818d04c050c55559fb9f74b780b6ff8d33728/cpuset.effective_cpus  4-7
/sys/fs/cgroup/cpuset/docker/a52a46cfa489fe815fcbd11019c391d7fe771b878f77ddb3c993ab5cd98d8084/cpuset.effective_cpus  4-7
/sys/fs/cgroup/cpuset/docker/c9049b1b3f2c2bbfebc6ec8e2f3aa280a9ab23b86322452a54575b1cba3ae179/cpuset.effective_cpus  8-11

[sandbox@nomad-dev nomad]$ docker ps --format '{{.ID}}' | xargs docker inspect | egrep '(CpusetCpus|NOMAD_CPU_LIMIT|Id)'
        "Id": "c9049b1b3f2c2bbfebc6ec8e2f3aa280a9ab23b86322452a54575b1cba3ae179",
            "CpusetCpus": "8,9,10,11",
                "NOMAD_CPU_LIMIT=8980",
        "Id": "6e06a9ed1631758827aa4136690818d04c050c55559fb9f74b780b6ff8d33728",
            "CpusetCpus": "4,5,6,7",
                "NOMAD_CPU_LIMIT=8980",
        "Id": "a52a46cfa489fe815fcbd11019c391d7fe771b878f77ddb3c993ab5cd98d8084",
            "CpusetCpus": "4,5,6,7",
                "NOMAD_CPU_LIMIT=8980",
        "Id": "ec9220fbe2d0edef8bd9f67cabd7da226f32d346f65d196463bc4d6701864213",
            "CpusetCpus": "0,1,2,3",
                "NOMAD_CPU_LIMIT=8980",
Fixes a bug in the BinPackIterator.Next method, where the scheduler would only
take into account the cpusets of the tasks in the largest lifecycle. This could
result in overlapping cgroup cpusets. By using the Allocation.ReservedCores, the
scheduler will use the same cpuset view as Partition.Reserve. Added logging in
case of future regressions thus not requiring manual inspection of cgroup files.

Overview of commits - 997da25cdb49c634749be97874955024492b9d43