hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.8k stars 1.94k forks source link

nomad client agent incorrectly indicates that it has zero cpu #23811

Closed josh-m-sharpe closed 1 month ago

josh-m-sharpe commented 1 month ago

Nomad version

nomad --version
Nomad v1.8.3
BuildDate 2024-08-13T07:37:30Z
Revision 63b636e5cbaca312cf6ea63e040f445f05f00478

Operating system and Environment details

amazonlinux 2023

Issue

client detects 0 cpu. therefore can't alloc jobs.

Reproduction steps

run this agent as root:

datacenter = "dc1"
data_dir   = "/mnt/nomad"

log_level = "TRACE"

client {
  enabled = true

  "node_class" = "es"

  host_volume "docker-sock-ro" {
    path = "/var/run/docker.sock"
    read_only = true
  }

  # cpu_total_compute = 10000
}

plugin "docker" {
  config {
    auth {
      helper = "ecr-login"
    }

    # extra Docker labels to be set by Nomad on each Docker container with the appropriate value
    extra_labels = ["job_name", "task_group_name", "task_name", "namespace", "node_name"]

    allow_privileged = true
  }
}

telemetry {
  collection_interval = "1s"
  disable_hostname = true
  prometheus_metrics = true
  publish_allocation_metrics = true
  publish_node_metrics = true
}

Expected Result

jobs would deploy, client would report that it has cpu

Actual Result

visit client UI: /ui/clients/7ecfa3d3-91ff-71a7-b7a1-8a51c87a495d See: 0 MHz / 0 MHz Total

Job file (if appropriate)

not really important other than specifying some cpu resource requirement, e.g.: cpu = 100

Notes

It would appear that uncommenting cpu_total_compute = 10000 in that client config above "fixes" this. This is not a real fix. The agent reports that it has 10000 cpu: 4% 388 MHz / 10,000 MHz Total (it doesn't really) but this at allows me to allocate jobs to these clients.

Other context

I upgraded from v1.5.8 to v1.8.3. I didn't test any versions in between.

pkazmierczak commented 1 month ago

Hi @josh-m-sharpe, thanks for reporting the issue. Sadly, I can't reproduce.

Here's what I did:

plugin "raw_exec" { config { enabled = true } }

data_dir = "/home/ubuntu/nomad_tmp" datacenter = "dc1"

log_level = "TRACE"

plugin "docker" { config { auth { helper = "ecr-login" } extra_labels = ["job_name", "task_group_name", "task_name", "namespace", "node_name"] allow_privileged = true } }

telemetry { collection_interval = "1s" disable_hostname = true prometheus_metrics = true publish_allocation_metrics = true publish_node_metrics = true }

- ran the client with `sudo nomad agent -config=nomad.hcl`

$ nomad version Nomad v1.8.3 BuildDate 2024-08-13T07:37:30Z Revision 63b636e5cbaca312cf6ea63e040f445f05f00478 $ nomad node status -self -verbose ID = c42cb411-5dfa-1608-fc63-a513b8c3e893 Name = ip-10-0-1-160 Node Pool = default Class = es DC = dc1 Drain = false Eligibility = eligible Status = ready CSI Controllers = CSI Drivers = Uptime = 48m10s

Host Volumes Name ReadOnly Source docker-sock-ro true /var/run/docker.sock

Drivers Driver Detected Healthy Message Time docker true true Healthy 2024-08-15T10:13:29+02:00 exec true true Healthy 2024-08-15T10:13:29+02:00 java false false 2024-08-15T10:13:29+02:00 qemu false false 2024-08-15T10:13:29+02:00 raw_exec true true Healthy 2024-08-15T10:13:29+02:00

Node Events Time Subsystem Message Details 2024-08-15T10:13:29+02:00 Cluster Node registered

Allocated Resources CPU Memory Disk 0/4600 MHz 0 B/7.8 GiB 0 B/90 GiB

Allocation Resource Utilization CPU Memory 0/4600 MHz 0 B/7.8 GiB

Host Resource Utilization CPU Memory Disk 22/4600 MHz 395 MiB/7.8 GiB (/dev/root)

Allocations No allocations placed

Attributes cpu.arch = amd64 cpu.frequency = 2300 cpu.modelname = Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz cpu.numcores = 2 cpu.reservablecores = 2 cpu.totalcompute = 4600 cpu.usablecompute = 4600 driver.docker = 1 driver.docker.bridge_ip = 172.17.0.1 driver.docker.os_type = linux driver.docker.privileged.enabled = true driver.docker.runtimes = io.containerd.runc.v2,runc driver.docker.version = 24.0.7 driver.exec = 1 driver.raw_exec = 1 kernel.arch = x86_64 kernel.landlock = v4 kernel.name = linux kernel.version = 6.8.0-1008-aws memory.totalbytes = 8323854336 nomad.advertise.address = 10.0.1.160:4646 nomad.bridge.hairpin_mode = false nomad.revision = 63b636e5cbaca312cf6ea63e040f445f05f00478 nomad.service_discovery = true nomad.version = 1.8.3 numa.node.count = 1 numa.node0.cores = 0-1 os.cgroups.version = 2 os.name = ubuntu os.signals = SIGWINCH,SIGXCPU,SIGFPE,SIGILL,SIGIO,SIGPROF,SIGTSTP,SIGCONT,SIGHUP,SIGTTOU,SIGINT,SIGIOT,SIGSTOP,SIGQUIT,SIGSEGV,SIGBUS,SIGALRM,SIGPIPE,SIGSYS,SIGUSR2,SIGXFSZ,SIGNULL,SIGKILL,SIGTERM,SIGTRAP,SIGTTIN,SIGABRT,SIGUSR1 os.version = 24.04 platform.aws.ami-id = ami-01e444924a2233b07 platform.aws.instance-life-cycle = on-demand platform.aws.instance-type = m4.large platform.aws.placement.availability-zone = eu-central-1b unique.hostname = ip-10-0-1-160 unique.network.ip-address = 10.0.1.160 unique.platform.aws.hostname = ip-10-0-1-160.eu-central-1.compute.internal unique.platform.aws.instance-id = i-05a5c3592ad22b67b unique.platform.aws.local-hostname = ip-10-0-1-160.eu-central-1.compute.internal unique.platform.aws.local-ipv4 = 10.0.1.160 unique.platform.aws.mac = 06:2a:0d:82:87:b5 unique.platform.aws.public-ipv4 = 52.28.111.27 unique.storage.bytesfree = 97064935424 unique.storage.bytestotal = 102888095744 unique.storage.volume = /dev/root

Meta connect.gateway_image = docker.io/envoyproxy/envoy:v${NOMAD_envoy_version} connect.log_level = info connect.proxy_concurrency = 1 connect.sidecar_image = docker.io/envoyproxy/envoy:v${NOMAD_envoy_version} connect.transparent_proxy.default_outbound_port = 15001 connect.transparent_proxy.default_uid = 101



Can you provide more details about how you're running Nomad?
pkazmierczak commented 1 month ago

Ah, I see you're running amazonlinux, my bad. Still:

[ec2-user@ip-10-0-1-14 ~]$ cat /etc/amazon-linux-release
Amazon Linux release 2023.5.20240805 (Amazon Linux)
[ec2-user@ip-10-0-1-14 ~]$ nomad node status -self -verbose|grep cpu
cpu.arch                                 = amd64
cpu.frequency                            = 2299
cpu.modelname                            = Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
cpu.numcores                             = 2
cpu.reservablecores                      = 2
cpu.totalcompute                         = 4598
cpu.usablecompute                        = 4598
[ec2-user@ip-10-0-1-14 ~]$ nomad version
Nomad v1.8.3
BuildDate 2024-08-13T07:37:30Z
Revision 63b636e5cbaca312cf6ea63e040f445f05f00478

Screenshot 2024-08-15 at 10 34 48

Same agent config.

josh-m-sharpe commented 1 month ago

One notable difference is I'm running on arm64 (graviton) ec2 instances.

I removed my cpu_total_compute and restarted my agent (using systemctl).

$ nomad node status -self -verbose
ID              = cd71fdc6-73f7-e46e-cbb8-8bf8faf71d3f
Name            = ip-10-80-11-152.eu-central-1.compute.internal
Node Pool       = default
Class           = og
DC              = dc1
Drain           = false
Eligibility     = eligible
Status          = ready
CSI Controllers = <none>
CSI Drivers     = <none>
Uptime          = 17h11m15s

Host Volumes
Name            ReadOnly  Source
acme            false     /acme
docker-sock-ro  true      /var/run/docker.sock
grafana         false     /grafana
loki            false     /loki
maxmind         false     /maxmind
mongodb         false     /mongodb
prometheus      false     /prometheus

Drivers
Driver    Detected  Healthy  Message   Time
docker    true      true     Healthy   2024-08-15T13:29:08Z
exec      true      true     Healthy   2024-08-15T13:29:08Z
java      false     false    <none>    2024-08-15T13:29:08Z
qemu      false     false    <none>    2024-08-15T13:29:08Z
raw_exec  false     false    disabled  2024-08-15T13:29:08Z

Node Events
Time                  Subsystem  Message          Details
2024-08-14T20:18:40Z  Cluster    Node registered  <none>

Allocated Resources
CPU         Memory          Disk
1250/0 MHz  2.3 GiB/15 GiB  600 MiB/94 GiB

Allocation Resource Utilization
CPU      Memory
0/0 MHz  67 MiB/15 GiB

Host Resource Utilization
CPU      Memory          Disk
0/0 MHz  644 MiB/15 GiB  6.2 GiB/100 GiB

Allocations
ID                                    Eval ID                               Node ID                               Task Group  Version  Desired  Status   Created               Modified
9f0332a2-1da1-a884-fa9a-013607d7551e  29bd5a98-f61f-cf7a-2beb-7482774a6be7  cd71fdc6-73f7-e46e-cbb8-8bf8faf71d3f  pdf         0        run      running  2024-08-14T20:24:40Z  2024-08-14T20:25:14Z
013bff22-3396-bfc4-72ca-a4dc2818cdab  825eac70-4e44-648a-43af-61054124e5a9  cd71fdc6-73f7-e46e-cbb8-8bf8faf71d3f  traefik     0        run      running  2024-08-14T20:19:18Z  2024-08-14T20:19:22Z

Attributes
consul.connect                           = true
consul.datacenter                        = dc1
consul.dns.port                          = 8600
consul.ft.namespaces                     = false
consul.grpc                              = -1
consul.revision                          = 9f62fb41
consul.server                            = false
consul.sku                               = oss
consul.version                           = 1.19.1
cpu.arch                                 = arm64
cpu.numcores                             = 4
cpu.reservablecores                      = 4
cpu.totalcompute                         = 0
cpu.usablecompute                        = 0
driver.docker                            = 1
driver.docker.bridge_ip                  = 172.17.0.1
driver.docker.os_type                    = linux
driver.docker.privileged.enabled         = true
driver.docker.runtimes                   = io.containerd.runc.v2,runc
driver.docker.version                    = 25.0.3
driver.exec                              = 1
kernel.arch                              = aarch64
kernel.name                              = linux
kernel.version                           = 6.1.82-99.168.amzn2023.aarch64
memory.totalbytes                        = 16440000512
nomad.advertise.address                  = 10.80.11.152:4646
nomad.bridge.hairpin_mode                = false
nomad.revision                           = 63b636e5cbaca312cf6ea63e040f445f05f00478
nomad.service_discovery                  = true
nomad.version                            = 1.8.3
numa.node.count                          = 1
numa.node0.cores                         = 0-3
os.cgroups.version                       = 2
os.name                                  = amazon
os.signals                               = SIGNULL,SIGIO,SIGKILL,SIGTTIN,SIGABRT,SIGALRM,SIGCONT,SIGSYS,SIGUSR1,SIGFPE,SIGIOT,SIGSEGV,SIGPROF,SIGSTOP,SIGTERM,SIGTRAP,SIGXFSZ,SIGHUP,SIGTTOU,SIGINT,SIGTSTP,SIGUSR2,SIGBUS,SIGWINCH,SIGILL,SIGPIPE,SIGQUIT,SIGXCPU
os.version                               = 2023.4.20240401
platform.aws.ami-id                      = ami-094cddbb0d4ca86ee
platform.aws.instance-life-cycle         = on-demand
platform.aws.instance-type               = m7g.xlarge
platform.aws.placement.availability-zone = eu-central-1a
unique.consul.name                       = ip-10-80-11-152.eu-central-1.compute.internal
unique.hostname                          = ip-10-80-11-152.eu-central-1.compute.internal
unique.network.ip-address                = 10.80.11.152
unique.platform.aws.hostname             = ip-10-80-11-152.eu-central-1.compute.internal
unique.platform.aws.instance-id          = i-069cf800c39c95672
unique.platform.aws.local-hostname       = ip-10-80-11-152.eu-central-1.compute.internal
unique.platform.aws.local-ipv4           = 10.80.11.152
unique.platform.aws.mac                  = 02:ca:ae:11:c2:fb
unique.storage.bytesfree                 = 100610465792
unique.storage.bytestotal                = 107295518720
unique.storage.volume                    = /dev/nvme0n1p1

Meta
connect.gateway_image                           = docker.io/envoyproxy/envoy:v${NOMAD_envoy_version}
connect.log_level                               = info
connect.proxy_concurrency                       = 1
connect.sidecar_image                           = docker.io/envoyproxy/envoy:v${NOMAD_envoy_version}
connect.transparent_proxy.default_outbound_port = 15001
connect.transparent_proxy.default_uid           = 101
pkazmierczak commented 1 month ago

thanks for the info @josh-m-sharpe

Sadly, we require dmidecode installed on the host if it has arm64 CPU. See https://github.com/hashicorp/nomad/issues/23710 and https://github.com/hashicorp/nomad/issues/18272.

Can you confirm your host has dmidecode installed? If not, does installing it fix your issue?

Our documentation mentions this, but it's not really prominent. I'll update installation instructions to highlight it.

josh-m-sharpe commented 1 month ago

sadly? best news I've heard all day dnf install -y dmidecode done -> fixed.

nomad node status -self -verbose | grep cpu
cpu.arch                                 = arm64
cpu.frequency.efficiency                 = 2600
cpu.frequency.performance                = 0
cpu.numcores                             = 4
cpu.numcores.efficiency                  = 4
cpu.numcores.performance                 = 0
cpu.reservablecores                      = 4
cpu.totalcompute                         = 10400
cpu.usablecompute                        = 10400
josh-m-sharpe commented 1 month ago

I might suggest nomad throw an error on startup if nomad detects 0 cpu. Or at minimum an error-level log. Not having reputable CPU seems like that level of a problem.

I did look in the logs for problems before reporting this - so that may have helped.

I got all the way to migrating jobs to this new infrastructure before I noticed. Had this not been staging this would have been site outage.

pkazmierczak commented 1 month ago

I updated the documentation and added an ERROR-level log message if the fingerprinter detects 0 total CPU.