Closed josh-m-sharpe closed 1 month ago
Hi @josh-m-sharpe, thanks for reporting the issue. Sadly, I can't reproduce.
Here's what I did:
created nomad client config:
client {
enabled = true
server_join {
retry_join = ["provider=aws tag_key=Nomad_role tag_value=nomad-workstation_server"]
retry_max = 5
retry_interval = "15s"
}
"node_class" = "es"
host_volume "docker-sock-ro" {
path = "/var/run/docker.sock"
read_only = true
}
}
plugin "raw_exec" { config { enabled = true } }
data_dir = "/home/ubuntu/nomad_tmp" datacenter = "dc1"
log_level = "TRACE"
plugin "docker" { config { auth { helper = "ecr-login" } extra_labels = ["job_name", "task_group_name", "task_name", "namespace", "node_name"] allow_privileged = true } }
telemetry { collection_interval = "1s" disable_hostname = true prometheus_metrics = true publish_allocation_metrics = true publish_node_metrics = true }
- ran the client with `sudo nomad agent -config=nomad.hcl`
$ nomad version
Nomad v1.8.3
BuildDate 2024-08-13T07:37:30Z
Revision 63b636e5cbaca312cf6ea63e040f445f05f00478
$ nomad node status -self -verbose
ID = c42cb411-5dfa-1608-fc63-a513b8c3e893
Name = ip-10-0-1-160
Node Pool = default
Class = es
DC = dc1
Drain = false
Eligibility = eligible
Status = ready
CSI Controllers =
Host Volumes Name ReadOnly Source docker-sock-ro true /var/run/docker.sock
Drivers
Driver Detected Healthy Message Time
docker true true Healthy 2024-08-15T10:13:29+02:00
exec true true Healthy 2024-08-15T10:13:29+02:00
java false false
Node Events
Time Subsystem Message Details
2024-08-15T10:13:29+02:00 Cluster Node registered
Allocated Resources CPU Memory Disk 0/4600 MHz 0 B/7.8 GiB 0 B/90 GiB
Allocation Resource Utilization CPU Memory 0/4600 MHz 0 B/7.8 GiB
Host Resource Utilization CPU Memory Disk 22/4600 MHz 395 MiB/7.8 GiB (/dev/root)
Allocations No allocations placed
Attributes cpu.arch = amd64 cpu.frequency = 2300 cpu.modelname = Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz cpu.numcores = 2 cpu.reservablecores = 2 cpu.totalcompute = 4600 cpu.usablecompute = 4600 driver.docker = 1 driver.docker.bridge_ip = 172.17.0.1 driver.docker.os_type = linux driver.docker.privileged.enabled = true driver.docker.runtimes = io.containerd.runc.v2,runc driver.docker.version = 24.0.7 driver.exec = 1 driver.raw_exec = 1 kernel.arch = x86_64 kernel.landlock = v4 kernel.name = linux kernel.version = 6.8.0-1008-aws memory.totalbytes = 8323854336 nomad.advertise.address = 10.0.1.160:4646 nomad.bridge.hairpin_mode = false nomad.revision = 63b636e5cbaca312cf6ea63e040f445f05f00478 nomad.service_discovery = true nomad.version = 1.8.3 numa.node.count = 1 numa.node0.cores = 0-1 os.cgroups.version = 2 os.name = ubuntu os.signals = SIGWINCH,SIGXCPU,SIGFPE,SIGILL,SIGIO,SIGPROF,SIGTSTP,SIGCONT,SIGHUP,SIGTTOU,SIGINT,SIGIOT,SIGSTOP,SIGQUIT,SIGSEGV,SIGBUS,SIGALRM,SIGPIPE,SIGSYS,SIGUSR2,SIGXFSZ,SIGNULL,SIGKILL,SIGTERM,SIGTRAP,SIGTTIN,SIGABRT,SIGUSR1 os.version = 24.04 platform.aws.ami-id = ami-01e444924a2233b07 platform.aws.instance-life-cycle = on-demand platform.aws.instance-type = m4.large platform.aws.placement.availability-zone = eu-central-1b unique.hostname = ip-10-0-1-160 unique.network.ip-address = 10.0.1.160 unique.platform.aws.hostname = ip-10-0-1-160.eu-central-1.compute.internal unique.platform.aws.instance-id = i-05a5c3592ad22b67b unique.platform.aws.local-hostname = ip-10-0-1-160.eu-central-1.compute.internal unique.platform.aws.local-ipv4 = 10.0.1.160 unique.platform.aws.mac = 06:2a:0d:82:87:b5 unique.platform.aws.public-ipv4 = 52.28.111.27 unique.storage.bytesfree = 97064935424 unique.storage.bytestotal = 102888095744 unique.storage.volume = /dev/root
Meta connect.gateway_image = docker.io/envoyproxy/envoy:v${NOMAD_envoy_version} connect.log_level = info connect.proxy_concurrency = 1 connect.sidecar_image = docker.io/envoyproxy/envoy:v${NOMAD_envoy_version} connect.transparent_proxy.default_outbound_port = 15001 connect.transparent_proxy.default_uid = 101
Can you provide more details about how you're running Nomad?
Ah, I see you're running amazonlinux, my bad. Still:
[ec2-user@ip-10-0-1-14 ~]$ cat /etc/amazon-linux-release
Amazon Linux release 2023.5.20240805 (Amazon Linux)
[ec2-user@ip-10-0-1-14 ~]$ nomad node status -self -verbose|grep cpu
cpu.arch = amd64
cpu.frequency = 2299
cpu.modelname = Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
cpu.numcores = 2
cpu.reservablecores = 2
cpu.totalcompute = 4598
cpu.usablecompute = 4598
[ec2-user@ip-10-0-1-14 ~]$ nomad version
Nomad v1.8.3
BuildDate 2024-08-13T07:37:30Z
Revision 63b636e5cbaca312cf6ea63e040f445f05f00478
Same agent config.
One notable difference is I'm running on arm64 (graviton) ec2 instances.
I removed my cpu_total_compute
and restarted my agent (using systemctl).
$ nomad node status -self -verbose
ID = cd71fdc6-73f7-e46e-cbb8-8bf8faf71d3f
Name = ip-10-80-11-152.eu-central-1.compute.internal
Node Pool = default
Class = og
DC = dc1
Drain = false
Eligibility = eligible
Status = ready
CSI Controllers = <none>
CSI Drivers = <none>
Uptime = 17h11m15s
Host Volumes
Name ReadOnly Source
acme false /acme
docker-sock-ro true /var/run/docker.sock
grafana false /grafana
loki false /loki
maxmind false /maxmind
mongodb false /mongodb
prometheus false /prometheus
Drivers
Driver Detected Healthy Message Time
docker true true Healthy 2024-08-15T13:29:08Z
exec true true Healthy 2024-08-15T13:29:08Z
java false false <none> 2024-08-15T13:29:08Z
qemu false false <none> 2024-08-15T13:29:08Z
raw_exec false false disabled 2024-08-15T13:29:08Z
Node Events
Time Subsystem Message Details
2024-08-14T20:18:40Z Cluster Node registered <none>
Allocated Resources
CPU Memory Disk
1250/0 MHz 2.3 GiB/15 GiB 600 MiB/94 GiB
Allocation Resource Utilization
CPU Memory
0/0 MHz 67 MiB/15 GiB
Host Resource Utilization
CPU Memory Disk
0/0 MHz 644 MiB/15 GiB 6.2 GiB/100 GiB
Allocations
ID Eval ID Node ID Task Group Version Desired Status Created Modified
9f0332a2-1da1-a884-fa9a-013607d7551e 29bd5a98-f61f-cf7a-2beb-7482774a6be7 cd71fdc6-73f7-e46e-cbb8-8bf8faf71d3f pdf 0 run running 2024-08-14T20:24:40Z 2024-08-14T20:25:14Z
013bff22-3396-bfc4-72ca-a4dc2818cdab 825eac70-4e44-648a-43af-61054124e5a9 cd71fdc6-73f7-e46e-cbb8-8bf8faf71d3f traefik 0 run running 2024-08-14T20:19:18Z 2024-08-14T20:19:22Z
Attributes
consul.connect = true
consul.datacenter = dc1
consul.dns.port = 8600
consul.ft.namespaces = false
consul.grpc = -1
consul.revision = 9f62fb41
consul.server = false
consul.sku = oss
consul.version = 1.19.1
cpu.arch = arm64
cpu.numcores = 4
cpu.reservablecores = 4
cpu.totalcompute = 0
cpu.usablecompute = 0
driver.docker = 1
driver.docker.bridge_ip = 172.17.0.1
driver.docker.os_type = linux
driver.docker.privileged.enabled = true
driver.docker.runtimes = io.containerd.runc.v2,runc
driver.docker.version = 25.0.3
driver.exec = 1
kernel.arch = aarch64
kernel.name = linux
kernel.version = 6.1.82-99.168.amzn2023.aarch64
memory.totalbytes = 16440000512
nomad.advertise.address = 10.80.11.152:4646
nomad.bridge.hairpin_mode = false
nomad.revision = 63b636e5cbaca312cf6ea63e040f445f05f00478
nomad.service_discovery = true
nomad.version = 1.8.3
numa.node.count = 1
numa.node0.cores = 0-3
os.cgroups.version = 2
os.name = amazon
os.signals = SIGNULL,SIGIO,SIGKILL,SIGTTIN,SIGABRT,SIGALRM,SIGCONT,SIGSYS,SIGUSR1,SIGFPE,SIGIOT,SIGSEGV,SIGPROF,SIGSTOP,SIGTERM,SIGTRAP,SIGXFSZ,SIGHUP,SIGTTOU,SIGINT,SIGTSTP,SIGUSR2,SIGBUS,SIGWINCH,SIGILL,SIGPIPE,SIGQUIT,SIGXCPU
os.version = 2023.4.20240401
platform.aws.ami-id = ami-094cddbb0d4ca86ee
platform.aws.instance-life-cycle = on-demand
platform.aws.instance-type = m7g.xlarge
platform.aws.placement.availability-zone = eu-central-1a
unique.consul.name = ip-10-80-11-152.eu-central-1.compute.internal
unique.hostname = ip-10-80-11-152.eu-central-1.compute.internal
unique.network.ip-address = 10.80.11.152
unique.platform.aws.hostname = ip-10-80-11-152.eu-central-1.compute.internal
unique.platform.aws.instance-id = i-069cf800c39c95672
unique.platform.aws.local-hostname = ip-10-80-11-152.eu-central-1.compute.internal
unique.platform.aws.local-ipv4 = 10.80.11.152
unique.platform.aws.mac = 02:ca:ae:11:c2:fb
unique.storage.bytesfree = 100610465792
unique.storage.bytestotal = 107295518720
unique.storage.volume = /dev/nvme0n1p1
Meta
connect.gateway_image = docker.io/envoyproxy/envoy:v${NOMAD_envoy_version}
connect.log_level = info
connect.proxy_concurrency = 1
connect.sidecar_image = docker.io/envoyproxy/envoy:v${NOMAD_envoy_version}
connect.transparent_proxy.default_outbound_port = 15001
connect.transparent_proxy.default_uid = 101
thanks for the info @josh-m-sharpe
Sadly, we require dmidecode
installed on the host if it has arm64 CPU. See https://github.com/hashicorp/nomad/issues/23710 and https://github.com/hashicorp/nomad/issues/18272.
Can you confirm your host has dmidecode
installed? If not, does installing it fix your issue?
Our documentation mentions this, but it's not really prominent. I'll update installation instructions to highlight it.
sadly? best news I've heard all day dnf install -y dmidecode
done -> fixed.
nomad node status -self -verbose | grep cpu
cpu.arch = arm64
cpu.frequency.efficiency = 2600
cpu.frequency.performance = 0
cpu.numcores = 4
cpu.numcores.efficiency = 4
cpu.numcores.performance = 0
cpu.reservablecores = 4
cpu.totalcompute = 10400
cpu.usablecompute = 10400
I might suggest nomad throw an error on startup if nomad detects 0 cpu. Or at minimum an error-level log. Not having reputable CPU seems like that level of a problem.
I did look in the logs for problems before reporting this - so that may have helped.
I got all the way to migrating jobs to this new infrastructure before I noticed. Had this not been staging this would have been site outage.
I updated the documentation and added an ERROR
-level log message if the fingerprinter detects 0 total CPU.
Nomad version
Operating system and Environment details
amazonlinux 2023
Issue
client detects 0 cpu. therefore can't alloc jobs.
Reproduction steps
run this agent as root:
Expected Result
jobs would deploy, client would report that it has cpu
Actual Result
visit client UI: /ui/clients/7ecfa3d3-91ff-71a7-b7a1-8a51c87a495d See:
0 MHz / 0 MHz Total
Job file (if appropriate)
not really important other than specifying some cpu resource requirement, e.g.:
cpu = 100
Notes
It would appear that uncommenting
cpu_total_compute = 10000
in that client config above "fixes" this. This is not a real fix. The agent reports that it has 10000 cpu:4% 388 MHz / 10,000 MHz Total
(it doesn't really) but this at allows me to allocate jobs to these clients.Other context
I upgraded from v1.5.8 to v1.8.3. I didn't test any versions in between.