Closed codingJWilliams closed 10 months ago
The problem also happens with lxd in Ubuntu 23.10 host and Ubuntu 22.04 guest
Hey @codingJWilliams and @livioribeiro, thanks for reporting this. Even though we don't technically support running Nomad inside containers, we shouldn't panic. To better understand the issue, can you dump the contents /sys/devices/system and share with us?
Hi @pkazmierczak ,
Thank you for taking a look into this :)
I've attached two tar.gz files, one is from an 'unpriviledged' LXC container and one is from a priviledged container. Both have the same error when booting nomad.
system.tar.gz priv_system.tar.gz
To note, I have tried to manually define the amount of available CPU resources in the client {} block in my configuration file, however it still seems to try to autodetect it - maybe a configuration option to disable this CPU probing and manually define the values could be an easy fix? Although, I'm not experienced with the nomad code base so I could be incorrect about that
Thanks :)
I was able to verify that nomad v1.6.4 runs in a Proxmox LXC container. Versions 1.7.0 and 1.7.1 always panic with the above error message.
See the output message:
$ wget https://releases.hashicorp.com/nomad/1.6.4/nomad_1.6.4_linux_amd64.zip
$ unzip nomad_1.6.4_linux_amd64.zip
$ ./nomad agent -dev
==> No configuration files loaded
==> Starting Nomad agent...
==> Nomad agent configuration:
Advertise Addrs: HTTP: 127.0.0.1:4646; RPC: 127.0.0.1:4647; Serf: 127.0.0.1:4648
Bind Addrs: HTTP: [127.0.0.1:4646]; RPC: 127.0.0.1:4647; Serf: 127.0.0.1:4648
Client: true
Log Level: DEBUG
Node Id: 7696abf3-79c8-a777-6ff2-afaabc5e4a6e
Region: global (DC: dc1)
Server: true
Version: 1.6.4
==> Nomad agent started! Log data will stream in below:
2023-12-09T12:05:36.195+0100 [INFO] nomad.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:bfeca3bf-792e-9cac-aa45-f4ed2deb8959 Address:127.0.0.1:4647}]"
2023-12-09T12:05:36.195+0100 [INFO] nomad: serf: EventMemberJoin: nomad1.global 127.0.0.1
2023-12-09T12:05:36.195+0100 [INFO] nomad: starting scheduling worker(s): num_workers=1 schedulers=["service", "batch", "system", "sysbatch", "_core"]
2023-12-09T12:05:36.195+0100 [DEBUG] nomad: started scheduling worker: id=ae97cccf-4c9c-4086-2d8e-1b2107c56d15 index=1 of=1
2023-12-09T12:05:36.195+0100 [INFO] nomad: started scheduling worker(s): num_workers=1 schedulers=["service", "batch", "system", "sysbatch", "_core"]
2023-12-09T12:05:36.195+0100 [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=""
2023-12-09T12:05:36.196+0100 [INFO] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
2023-12-09T12:05:36.196+0100 [INFO] agent: detected plugin: name=java type=driver plugin_version=0.1.0
2023-12-09T12:05:36.196+0100 [INFO] agent: detected plugin: name=docker type=driver plugin_version=0.1.0
2023-12-09T12:05:36.196+0100 [INFO] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
2023-12-09T12:05:36.196+0100 [INFO] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
2023-12-09T12:05:36.196+0100 [INFO] client: using state directory: state_dir=/tmp/NomadClient1146077166
2023-12-09T12:05:36.196+0100 [INFO] client: using alloc directory: alloc_dir=/tmp/NomadClient735748202
2023-12-09T12:05:36.196+0100 [INFO] client: using dynamic ports: min=20000 max=32000 reserved=""
2023-12-09T12:05:36.196+0100 [DEBUG] client.cpuset.v2: initializing with: cores=11
2023-12-09T12:05:36.203+0100 [INFO] nomad.raft: entering follower state: follower="Node at 127.0.0.1:4647 [Follower]" leader-address= leader-id=
2023-12-09T12:05:36.203+0100 [DEBUG] worker: running: worker_id=ae97cccf-4c9c-4086-2d8e-1b2107c56d15
2023-12-09T12:05:36.203+0100 [INFO] nomad: adding server: server="nomad1.global (Addr: 127.0.0.1:4647) (DC: dc1)"
2023-12-09T12:05:36.203+0100 [DEBUG] nomad.keyring.replicator: starting encryption key replication
2023-12-09T12:05:36.204+0100 [DEBUG] client.fingerprint_mgr: built-in fingerprints: fingerprinters=["arch", "bridge", "cgroup", "cni", "consul", "cpu", "host", "landlock", "memory", "network", "nomad", "plugins_cni", "signal", "storage", "vault", "env_aws", "env_gce", "env_azure", "env_digitalocean"]
2023-12-09T12:05:36.205+0100 [INFO] client.fingerprint_mgr.cgroup: cgroups are available
2023-12-09T12:05:36.205+0100 [DEBUG] client.fingerprint_mgr: CNI config dir is not set or does not exist, skipping: cni_config_dir=/opt/cni/config
2023-12-09T12:05:36.205+0100 [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=cgroup initial_period=15s
2023-12-09T12:05:36.206+0100 [DEBUG] client.fingerprint_mgr.cpu: detected CPU model: name="AMD Ryzen 5 5600G with Radeon Graphics"
2023-12-09T12:05:36.206+0100 [DEBUG] client.fingerprint_mgr.cpu: detected CPU frequency: mhz=4464
2023-12-09T12:05:36.206+0100 [DEBUG] client.fingerprint_mgr.cpu: detected CPU core count: EXTRA_VALUE_AT_END=1
2023-12-09T12:05:36.206+0100 [DEBUG] client.fingerprint_mgr.cpu: client configuration reserves these cores for node: cores=[]
2023-12-09T12:05:36.206+0100 [DEBUG] client.fingerprint_mgr.cpu: set of reservable cores available for tasks: cores=[11]
2023-12-09T12:05:36.213+0100 [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=consul initial_period=15s
2023-12-09T12:05:36.213+0100 [WARN] client.fingerprint_mgr.landlock: failed to fingerprint kernel landlock feature: error="operation not supported"
2023-12-09T12:05:36.213+0100 [DEBUG] client.fingerprint_mgr.network: unable to read link speed: path=/sys/class/net/lo/speed device=lo
2023-12-09T12:05:36.213+0100 [DEBUG] client.fingerprint_mgr.network: link speed could not be detected and no speed specified by user, falling back to default speed: interface=lo mbits=1000
2023-12-09T12:05:36.213+0100 [DEBUG] client.fingerprint_mgr.network: detected interface IP: interface=lo IP=127.0.0.1
2023-12-09T12:05:36.213+0100 [DEBUG] client.fingerprint_mgr.network: detected interface IP: interface=lo IP=::1
2023-12-09T12:05:36.213+0100 [DEBUG] client.fingerprint_mgr.network: unable to read link speed: path=/sys/class/net/lo/speed device=lo
2023-12-09T12:05:36.213+0100 [DEBUG] client.fingerprint_mgr.network: link speed could not be detected, falling back to default speed: interface=lo mbits=1000
2023-12-09T12:05:36.213+0100 [WARN] client.fingerprint_mgr.cni_plugins: failed to read CNI plugins directory: cni_path=/opt/cni/bin error="open /opt/cni/bin: no such file or directory"
2023-12-09T12:05:36.214+0100 [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=vault initial_period=15s
2023-12-09T12:05:36.218+0100 [DEBUG] client.fingerprint_mgr.env_gce: could not read value for attribute: attribute=machine-type error="Get \"http://169.254.169.254/computeMetadata/v1/instance/machine-type\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
2023-12-09T12:05:36.218+0100 [DEBUG] client.fingerprint_mgr.env_gce: error querying GCE Metadata URL, skipping
2023-12-09T12:05:36.219+0100 [DEBUG] client.fingerprint_mgr.env_azure: could not read value for attribute: attribute=compute/azEnvironment error="Get \"http://169.254.169.254/metadata/instance/compute/azEnvironment?api-version=2019-06-04&format=text\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
2023-12-09T12:05:36.220+0100 [DEBUG] client.fingerprint_mgr.env_digitalocean: failed to request metadata: attribute=region error="Get \"http://169.254.169.254/metadata/v1/region\": dial tcp 169.254.169.254:80: i/o timeout (Client.Timeout exceeded while awaiting headers)"
2023-12-09T12:05:36.220+0100 [DEBUG] client.fingerprint_mgr: detected fingerprints: node_attrs=["arch", "bridge", "cgroup", "cpu", "host", "network", "nomad", "signal", "storage"]
2023-12-09T12:05:36.220+0100 [INFO] client.plugin: starting plugin manager: plugin-type=csi
2023-12-09T12:05:36.220+0100 [INFO] client.plugin: starting plugin manager: plugin-type=driver
2023-12-09T12:05:36.220+0100 [INFO] client.plugin: starting plugin manager: plugin-type=device
2023-12-09T12:05:36.220+0100 [DEBUG] client.device_mgr: exiting since there are no device plugins
2023-12-09T12:05:36.220+0100 [DEBUG] client.driver_mgr.docker: using client connection initialized from environment: driver=docker
2023-12-09T12:05:36.220+0100 [DEBUG] client.driver_mgr: initial driver fingerprint: driver=java health=undetected description=""
2023-12-09T12:05:36.220+0100 [ERROR] client.driver_mgr.docker: failed to list pause containers for recovery: driver=docker error="Get \"http://unix.sock/containers/json?filters=%7B%22label%22%3A%5B%22com.hashicorp.nomad.alloc_id%22%5D%7D\": dial unix /var/run/docker.sock: connect: no such file or directory"
2023-12-09T12:05:36.220+0100 [DEBUG] client.driver_mgr.docker: could not connect to docker daemon: driver=docker endpoint=unix:///var/run/docker.sock error="Get \"http://unix.sock/version\": dial unix /var/run/docker.sock: connect: no such file or directory"
2023-12-09T12:05:36.220+0100 [DEBUG] client.driver_mgr: initial driver fingerprint: driver=docker health=undetected description="Failed to connect to docker daemon"
2023-12-09T12:05:36.221+0100 [DEBUG] client.driver_mgr: initial driver fingerprint: driver=raw_exec health=healthy description=Healthy
2023-12-09T12:05:36.221+0100 [DEBUG] client.driver_mgr: initial driver fingerprint: driver=exec health=healthy description=Healthy
2023-12-09T12:05:36.221+0100 [DEBUG] client.driver_mgr: initial driver fingerprint: driver=qemu health=undetected description=""
2023-12-09T12:05:36.221+0100 [DEBUG] client.plugin: waiting on plugin manager initial fingerprint: plugin-type=driver
2023-12-09T12:05:36.221+0100 [DEBUG] client.plugin: waiting on plugin manager initial fingerprint: plugin-type=device
2023-12-09T12:05:36.221+0100 [DEBUG] client.plugin: finished plugin manager initial fingerprint: plugin-type=device
2023-12-09T12:05:36.221+0100 [DEBUG] client.driver_mgr: detected drivers: drivers="map[healthy:[exec raw_exec] undetected:[qemu java docker]]"
2023-12-09T12:05:36.221+0100 [DEBUG] client.plugin: finished plugin manager initial fingerprint: plugin-type=driver
2023-12-09T12:05:36.221+0100 [DEBUG] client.server_mgr: new server list: new_servers=[127.0.0.1:4647] old_servers=[]
2023-12-09T12:05:36.221+0100 [INFO] client: started client: node_id=a983781e-a587-0c45-8db7-c735d0a07d81
2023-12-09T12:05:36.221+0100 [DEBUG] http: UI is enabled
2023-12-09T12:05:36.221+0100 [DEBUG] http: UI is enabled
See also https://github.com/hashicorp/nomad/issues/19407.
I also wonder if https://github.com/hashicorp/nomad/issues/19406 could be related, given that we're getting odd mismatch of CPU values vs # of cores there.
Thanks for checking and looking into this <3 I don't know enough about the codebase to meaningly contribute but I'll downgrade to 1.6.4 for now whilst this gets looked into
Hi folks! Just wanted to give an update on this and the related set of bugs around panics and fingerprinting (#19407, #19372, #19412). @pkazmierczak @shoenig @lgfa29 and I have been working on trying to get a reproduction of all these issues and they turn out to be interrelated and it depends a bit on your particular environment which one you're going to hit.
Reliable and accurate CPU fingerprinting is highly platform specific. In Nomad 1.7.0 as part of work for NUMA support (and to reduce other problems in our CPU accounting), we introduced a new CPU fingerprinting scheme for Linux, where we go through a series of "scanner" options (ref detect_linux.go#L20-L27
) of decreasing accuracy until we get a valid CPU topology. Unfortunately there are a lot of platforms where this scanning is still coming up with no valid topology.
Depending on the shape of your particular environment, you might end up with 0 cpu reported. Or you might hit the code path where the scheduler tries to read the nil
CPU topology and you get a panic.
Our fix for this is going to be to reintroduce the less accurate pre-1.7.0 fingerprinting as a final fallback for those platforms where we can't make sense of the CPU topology. This means NUMA support won't work in those environments, but largely those platforms appear to be those where NUMA isn't meaningful (ex. containers, hypervisors that cut up a NUMA core between VMs, etc.).
We're working on the patch for this now and will ship it as soon as its available. Thanks for your help in lots of detailed reports and your patience while we work thru this.
Our patches for this issue have landed and will be shipped in Nomad 1.7.2 shortly (likely tomorrow).
Nomad 1.7.2 has shipped. I'm going to close this issue out as shipped. If you're still seeing this after 1.7.2, please let us know and we can reopen. Thanks!
Hey there 🖖 Seems like it's still not as smooth as it should be, using v1.7.7:
Error starting agent: client setup failed: failed to initialize process manager: failed to write root partition cpuset: write /sys/fs/cgroup/nomad.slice/cpuset.cpus: device or resource busy
client.fingerprint_mgr.landlock: failed to fingerprint kernel landlock feature: error="operation not supported"
Is this related or not at all ?
Thanks,
Cheers,
@guyteubbe please open a new issue with all the appropriate details. Thanks!
@tgross hey 🖖 Sorry that was a weird situation where I needed to reboot the LXC containers after an upgrade from 1.6.4 to 1.7.7. Nomad was restarted but the reboot was still needed for the errors to go away...weird... Don't pay attention to my message :)
Cheers,
Nomad version
Nomad v1.7.0 BuildDate 2023-12-07T08:28:54Z Revision e4150e9703f3be6ee2339f0e45ff0801186e022b
Operating system and Environment details
Issue
When attempting to install Nomad inside of an LXC container hosted on Proxmox: VE, I receive a
panic:
error relating to numalib, I assume it's unable to fetch some data about the CPU.Reproduction steps
nomad agent -dev
Expected Result
Nomad starts as expected
Actual Result
Job file (if appropriate)
N/A