hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.87k stars 1.95k forks source link

Panic when running Nomad inside of Proxmox LXC container #19372

Closed codingJWilliams closed 10 months ago

codingJWilliams commented 10 months ago

Nomad version

Nomad v1.7.0 BuildDate 2023-12-07T08:28:54Z Revision e4150e9703f3be6ee2339f0e45ff0801186e022b

Operating system and Environment details

root@services01:~# uname -a
Linux services01 6.5.11-6-pve #1 SMP PREEMPT_DYNAMIC PMX 6.5.11-6 (2023-11-29T08:32Z) x86_64 GNU/Linux
root@services01:~# cat /etc/os-release 
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Issue

When attempting to install Nomad inside of an LXC container hosted on Proxmox: VE, I receive a panic: error relating to numalib, I assume it's unable to fetch some data about the CPU.

Reproduction steps

  1. Install Proxmox: VE following all default options.
  2. Create LXC container with latest Debian 12
  3. Install Hashicorp repo & GPG key
  4. Install nomad with apt
  5. nomad agent -dev

Expected Result

Nomad starts as expected

Actual Result

root@services01:~# nomad agent -dev
==> No configuration files loaded
==> Starting Nomad agent...
panic: runtime error: index out of range [2] with length 2

goroutine 1 [running]:
github.com/hashicorp/nomad/client/lib/numalib.(*Topology).insert(...)
        github.com/hashicorp/nomad/client/lib/numalib/topology.go:156
github.com/hashicorp/nomad/client/lib/numalib.(*Sysfs).discoverCores.func2.1(0x2)
        github.com/hashicorp/nomad/client/lib/numalib/detect_linux.go:128 +0x233
github.com/hashicorp/nomad/client/lib/idset.(*Set[...]).ForEach(0xc000be7230?, 0xc000b9eba0?)
        github.com/hashicorp/nomad/client/lib/idset/idset.go:181 +0x18c
github.com/hashicorp/nomad/client/lib/numalib.(*Sysfs).discoverCores.func2(0x0)
        github.com/hashicorp/nomad/client/lib/numalib/detect_linux.go:122 +0xd5
github.com/hashicorp/nomad/client/lib/idset.(*Set[...]).ForEach(0x30de02a?, 0xc000b9ed08?)
        github.com/hashicorp/nomad/client/lib/idset/idset.go:181 +0x18b
github.com/hashicorp/nomad/client/lib/numalib.(*Sysfs).discoverCores(0x37c2f60, 0xc0005480f0)
        github.com/hashicorp/nomad/client/lib/numalib/detect_linux.go:115 +0xcd
github.com/hashicorp/nomad/client/lib/numalib.(*Sysfs).ScanSystem(0xc000b9ee58?, 0xc0005480f0)
        github.com/hashicorp/nomad/client/lib/numalib/detect_linux.go:55 +0x7f
github.com/hashicorp/nomad/client/lib/numalib.Scan(...)
        github.com/hashicorp/nomad/client/lib/numalib/detect.go:23
github.com/hashicorp/nomad/client/fingerprint.(*CPUFingerprint).initialize(0xc00083f240, 0xc000c073c0)
        github.com/hashicorp/nomad/client/fingerprint/cpu.go:90 +0x487
github.com/hashicorp/nomad/client/fingerprint.(*CPUFingerprint).Fingerprint(0xc00083f240, 0x380d150?, 0xc000c50270)
        github.com/hashicorp/nomad/client/fingerprint/cpu.go:40 +0x25
github.com/hashicorp/nomad/client.(*FingerprintManager).fingerprint(0xc000b78d20, {0x3?, 0x380d150?}, {0x37daa50, 0xc00083f240})
        github.com/hashicorp/nomad/client/fingerprint_manager.go:200 +0xd2
github.com/hashicorp/nomad/client.(*FingerprintManager).setupFingerprinters(0xc000b78d20, {0xc0009bae00?, 0x13, 0x10?})
        github.com/hashicorp/nomad/client/fingerprint_manager.go:143 +0xfb
github.com/hashicorp/nomad/client.(*FingerprintManager).Run(0xc000b78d20)
        github.com/hashicorp/nomad/client/fingerprint_manager.go:107 +0x49b
github.com/hashicorp/nomad/client.NewClient(0xc000aba000, {0x37d7ff8?, 0xc0009de018}, 0xc0009c40e0, {0x37ecd00?, 0xc00007bd80}, 0x1?)
        github.com/hashicorp/nomad/client/client.go:467 +0x1196
github.com/hashicorp/nomad/command/agent.(*Agent).setupClient(0xc000ad0400)
        github.com/hashicorp/nomad/command/agent/agent.go:1131 +0x2cc
github.com/hashicorp/nomad/command/agent.NewAgent(0xc00056afc0, {0x38140e8?, 0xc000abe9c0}, {0x37c7340?, 0xc000a88300}, 0xc0000aad20)
        github.com/hashicorp/nomad/command/agent/agent.go:160 +0x1f3
github.com/hashicorp/nomad/command/agent.(*Command).setupAgent(0xc000544e00, 0xc00056afc0, {0x38140e8, 0xc000abe9c0}, {0x37c7340, 0xc000a88300}, 0x0?)
        github.com/hashicorp/nomad/command/agent/command.go:596 +0xa5
github.com/hashicorp/nomad/command/agent.(*Command).Run(0xc000544e00, {0xc0000527d0, 0x1, 0x1})
        github.com/hashicorp/nomad/command/agent/command.go:809 +0x61b
github.com/mitchellh/cli.(*CLI).Run(0xc0000cb2c0)
        github.com/mitchellh/cli@v1.1.5/cli.go:262 +0x5b8
main.Run({0xc0000527c0, 0x2, 0x2})
        github.com/hashicorp/nomad/main.go:110 +0x228
main.main()
        github.com/hashicorp/nomad/main.go:80 +0x45

Job file (if appropriate)

N/A

livioribeiro commented 10 months ago

The problem also happens with lxd in Ubuntu 23.10 host and Ubuntu 22.04 guest

pkazmierczak commented 10 months ago

Hey @codingJWilliams and @livioribeiro, thanks for reporting this. Even though we don't technically support running Nomad inside containers, we shouldn't panic. To better understand the issue, can you dump the contents /sys/devices/system and share with us?

codingJWilliams commented 10 months ago

Hi @pkazmierczak ,

Thank you for taking a look into this :)

I've attached two tar.gz files, one is from an 'unpriviledged' LXC container and one is from a priviledged container. Both have the same error when booting nomad.

system.tar.gz priv_system.tar.gz

To note, I have tried to manually define the amount of available CPU resources in the client {} block in my configuration file, however it still seems to try to autodetect it - maybe a configuration option to disable this CPU probing and manually define the values could be an easy fix? Although, I'm not experienced with the nomad code base so I could be incorrect about that

Thanks :)

maveonair commented 10 months ago

I was able to verify that nomad v1.6.4 runs in a Proxmox LXC container. Versions 1.7.0 and 1.7.1 always panic with the above error message.

See the output message:

$  wget https://releases.hashicorp.com/nomad/1.6.4/nomad_1.6.4_linux_amd64.zip
$ unzip nomad_1.6.4_linux_amd64.zip
$ ./nomad agent -dev
==> No configuration files loaded
==> Starting Nomad agent...
==> Nomad agent configuration:

       Advertise Addrs: HTTP: 127.0.0.1:4646; RPC: 127.0.0.1:4647; Serf: 127.0.0.1:4648
            Bind Addrs: HTTP: [127.0.0.1:4646]; RPC: 127.0.0.1:4647; Serf: 127.0.0.1:4648
                Client: true
             Log Level: DEBUG
               Node Id: 7696abf3-79c8-a777-6ff2-afaabc5e4a6e
                Region: global (DC: dc1)
                Server: true
               Version: 1.6.4

==> Nomad agent started! Log data will stream in below:

    2023-12-09T12:05:36.195+0100 [INFO]  nomad.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:bfeca3bf-792e-9cac-aa45-f4ed2deb8959 Address:127.0.0.1:4647}]"
    2023-12-09T12:05:36.195+0100 [INFO]  nomad: serf: EventMemberJoin: nomad1.global 127.0.0.1
    2023-12-09T12:05:36.195+0100 [INFO]  nomad: starting scheduling worker(s): num_workers=1 schedulers=["service", "batch", "system", "sysbatch", "_core"]
    2023-12-09T12:05:36.195+0100 [DEBUG] nomad: started scheduling worker: id=ae97cccf-4c9c-4086-2d8e-1b2107c56d15 index=1 of=1
    2023-12-09T12:05:36.195+0100 [INFO]  nomad: started scheduling worker(s): num_workers=1 schedulers=["service", "batch", "system", "sysbatch", "_core"]
    2023-12-09T12:05:36.195+0100 [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=""
    2023-12-09T12:05:36.196+0100 [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    2023-12-09T12:05:36.196+0100 [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
    2023-12-09T12:05:36.196+0100 [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
    2023-12-09T12:05:36.196+0100 [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    2023-12-09T12:05:36.196+0100 [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    2023-12-09T12:05:36.196+0100 [INFO]  client: using state directory: state_dir=/tmp/NomadClient1146077166
    2023-12-09T12:05:36.196+0100 [INFO]  client: using alloc directory: alloc_dir=/tmp/NomadClient735748202
    2023-12-09T12:05:36.196+0100 [INFO]  client: using dynamic ports: min=20000 max=32000 reserved=""
    2023-12-09T12:05:36.196+0100 [DEBUG] client.cpuset.v2: initializing with: cores=11
    2023-12-09T12:05:36.203+0100 [INFO]  nomad.raft: entering follower state: follower="Node at 127.0.0.1:4647 [Follower]" leader-address= leader-id=
    2023-12-09T12:05:36.203+0100 [DEBUG] worker: running: worker_id=ae97cccf-4c9c-4086-2d8e-1b2107c56d15
    2023-12-09T12:05:36.203+0100 [INFO]  nomad: adding server: server="nomad1.global (Addr: 127.0.0.1:4647) (DC: dc1)"
    2023-12-09T12:05:36.203+0100 [DEBUG] nomad.keyring.replicator: starting encryption key replication
    2023-12-09T12:05:36.204+0100 [DEBUG] client.fingerprint_mgr: built-in fingerprints: fingerprinters=["arch", "bridge", "cgroup", "cni", "consul", "cpu", "host", "landlock", "memory", "network", "nomad", "plugins_cni", "signal", "storage", "vault", "env_aws", "env_gce", "env_azure", "env_digitalocean"]
    2023-12-09T12:05:36.205+0100 [INFO]  client.fingerprint_mgr.cgroup: cgroups are available
    2023-12-09T12:05:36.205+0100 [DEBUG] client.fingerprint_mgr: CNI config dir is not set or does not exist, skipping: cni_config_dir=/opt/cni/config
    2023-12-09T12:05:36.205+0100 [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=cgroup initial_period=15s
    2023-12-09T12:05:36.206+0100 [DEBUG] client.fingerprint_mgr.cpu: detected CPU model: name="AMD Ryzen 5 5600G with Radeon Graphics"
    2023-12-09T12:05:36.206+0100 [DEBUG] client.fingerprint_mgr.cpu: detected CPU frequency: mhz=4464
    2023-12-09T12:05:36.206+0100 [DEBUG] client.fingerprint_mgr.cpu: detected CPU core count: EXTRA_VALUE_AT_END=1
    2023-12-09T12:05:36.206+0100 [DEBUG] client.fingerprint_mgr.cpu: client configuration reserves these cores for node: cores=[]
    2023-12-09T12:05:36.206+0100 [DEBUG] client.fingerprint_mgr.cpu: set of reservable cores available for tasks: cores=[11]
    2023-12-09T12:05:36.213+0100 [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=consul initial_period=15s
    2023-12-09T12:05:36.213+0100 [WARN]  client.fingerprint_mgr.landlock: failed to fingerprint kernel landlock feature: error="operation not supported"
    2023-12-09T12:05:36.213+0100 [DEBUG] client.fingerprint_mgr.network: unable to read link speed: path=/sys/class/net/lo/speed device=lo
    2023-12-09T12:05:36.213+0100 [DEBUG] client.fingerprint_mgr.network: link speed could not be detected and no speed specified by user, falling back to default speed: interface=lo mbits=1000
    2023-12-09T12:05:36.213+0100 [DEBUG] client.fingerprint_mgr.network: detected interface IP: interface=lo IP=127.0.0.1
    2023-12-09T12:05:36.213+0100 [DEBUG] client.fingerprint_mgr.network: detected interface IP: interface=lo IP=::1
    2023-12-09T12:05:36.213+0100 [DEBUG] client.fingerprint_mgr.network: unable to read link speed: path=/sys/class/net/lo/speed device=lo
    2023-12-09T12:05:36.213+0100 [DEBUG] client.fingerprint_mgr.network: link speed could not be detected, falling back to default speed: interface=lo mbits=1000
    2023-12-09T12:05:36.213+0100 [WARN]  client.fingerprint_mgr.cni_plugins: failed to read CNI plugins directory: cni_path=/opt/cni/bin error="open /opt/cni/bin: no such file or directory"
    2023-12-09T12:05:36.214+0100 [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=vault initial_period=15s
    2023-12-09T12:05:36.218+0100 [DEBUG] client.fingerprint_mgr.env_gce: could not read value for attribute: attribute=machine-type error="Get \"http://169.254.169.254/computeMetadata/v1/instance/machine-type\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
    2023-12-09T12:05:36.218+0100 [DEBUG] client.fingerprint_mgr.env_gce: error querying GCE Metadata URL, skipping
    2023-12-09T12:05:36.219+0100 [DEBUG] client.fingerprint_mgr.env_azure: could not read value for attribute: attribute=compute/azEnvironment error="Get \"http://169.254.169.254/metadata/instance/compute/azEnvironment?api-version=2019-06-04&format=text\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
    2023-12-09T12:05:36.220+0100 [DEBUG] client.fingerprint_mgr.env_digitalocean: failed to request metadata: attribute=region error="Get \"http://169.254.169.254/metadata/v1/region\": dial tcp 169.254.169.254:80: i/o timeout (Client.Timeout exceeded while awaiting headers)"
    2023-12-09T12:05:36.220+0100 [DEBUG] client.fingerprint_mgr: detected fingerprints: node_attrs=["arch", "bridge", "cgroup", "cpu", "host", "network", "nomad", "signal", "storage"]
    2023-12-09T12:05:36.220+0100 [INFO]  client.plugin: starting plugin manager: plugin-type=csi
    2023-12-09T12:05:36.220+0100 [INFO]  client.plugin: starting plugin manager: plugin-type=driver
    2023-12-09T12:05:36.220+0100 [INFO]  client.plugin: starting plugin manager: plugin-type=device
    2023-12-09T12:05:36.220+0100 [DEBUG] client.device_mgr: exiting since there are no device plugins
    2023-12-09T12:05:36.220+0100 [DEBUG] client.driver_mgr.docker: using client connection initialized from environment: driver=docker
    2023-12-09T12:05:36.220+0100 [DEBUG] client.driver_mgr: initial driver fingerprint: driver=java health=undetected description=""
    2023-12-09T12:05:36.220+0100 [ERROR] client.driver_mgr.docker: failed to list pause containers for recovery: driver=docker error="Get \"http://unix.sock/containers/json?filters=%7B%22label%22%3A%5B%22com.hashicorp.nomad.alloc_id%22%5D%7D\": dial unix /var/run/docker.sock: connect: no such file or directory"
    2023-12-09T12:05:36.220+0100 [DEBUG] client.driver_mgr.docker: could not connect to docker daemon: driver=docker endpoint=unix:///var/run/docker.sock error="Get \"http://unix.sock/version\": dial unix /var/run/docker.sock: connect: no such file or directory"
    2023-12-09T12:05:36.220+0100 [DEBUG] client.driver_mgr: initial driver fingerprint: driver=docker health=undetected description="Failed to connect to docker daemon"
    2023-12-09T12:05:36.221+0100 [DEBUG] client.driver_mgr: initial driver fingerprint: driver=raw_exec health=healthy description=Healthy
    2023-12-09T12:05:36.221+0100 [DEBUG] client.driver_mgr: initial driver fingerprint: driver=exec health=healthy description=Healthy
    2023-12-09T12:05:36.221+0100 [DEBUG] client.driver_mgr: initial driver fingerprint: driver=qemu health=undetected description=""
    2023-12-09T12:05:36.221+0100 [DEBUG] client.plugin: waiting on plugin manager initial fingerprint: plugin-type=driver
    2023-12-09T12:05:36.221+0100 [DEBUG] client.plugin: waiting on plugin manager initial fingerprint: plugin-type=device
    2023-12-09T12:05:36.221+0100 [DEBUG] client.plugin: finished plugin manager initial fingerprint: plugin-type=device
    2023-12-09T12:05:36.221+0100 [DEBUG] client.driver_mgr: detected drivers: drivers="map[healthy:[exec raw_exec] undetected:[qemu java docker]]"
    2023-12-09T12:05:36.221+0100 [DEBUG] client.plugin: finished plugin manager initial fingerprint: plugin-type=driver
    2023-12-09T12:05:36.221+0100 [DEBUG] client.server_mgr: new server list: new_servers=[127.0.0.1:4647] old_servers=[]
    2023-12-09T12:05:36.221+0100 [INFO]  client: started client: node_id=a983781e-a587-0c45-8db7-c735d0a07d81
    2023-12-09T12:05:36.221+0100 [DEBUG] http: UI is enabled
    2023-12-09T12:05:36.221+0100 [DEBUG] http: UI is enabled
tgross commented 10 months ago

See also https://github.com/hashicorp/nomad/issues/19407.

I also wonder if https://github.com/hashicorp/nomad/issues/19406 could be related, given that we're getting odd mismatch of CPU values vs # of cores there.

codingJWilliams commented 10 months ago

Thanks for checking and looking into this <3 I don't know enough about the codebase to meaningly contribute but I'll downgrade to 1.6.4 for now whilst this gets looked into

tgross commented 10 months ago

Hi folks! Just wanted to give an update on this and the related set of bugs around panics and fingerprinting (#19407, #19372, #19412). @pkazmierczak @shoenig @lgfa29 and I have been working on trying to get a reproduction of all these issues and they turn out to be interrelated and it depends a bit on your particular environment which one you're going to hit.

Reliable and accurate CPU fingerprinting is highly platform specific. In Nomad 1.7.0 as part of work for NUMA support (and to reduce other problems in our CPU accounting), we introduced a new CPU fingerprinting scheme for Linux, where we go through a series of "scanner" options (ref detect_linux.go#L20-L27) of decreasing accuracy until we get a valid CPU topology. Unfortunately there are a lot of platforms where this scanning is still coming up with no valid topology.

Depending on the shape of your particular environment, you might end up with 0 cpu reported. Or you might hit the code path where the scheduler tries to read the nil CPU topology and you get a panic.

Our fix for this is going to be to reintroduce the less accurate pre-1.7.0 fingerprinting as a final fallback for those platforms where we can't make sense of the CPU topology. This means NUMA support won't work in those environments, but largely those platforms appear to be those where NUMA isn't meaningful (ex. containers, hypervisors that cut up a NUMA core between VMs, etc.).

We're working on the patch for this now and will ship it as soon as its available. Thanks for your help in lots of detailed reports and your patience while we work thru this.

tgross commented 10 months ago

Our patches for this issue have landed and will be shipped in Nomad 1.7.2 shortly (likely tomorrow).

tgross commented 10 months ago

Nomad 1.7.2 has shipped. I'm going to close this issue out as shipped. If you're still seeing this after 1.7.2, please let us know and we can reopen. Thanks!

guyteubbe commented 4 months ago

Hey there 🖖 Seems like it's still not as smooth as it should be, using v1.7.7:

Error starting agent: client setup failed: failed to initialize process manager: failed to write root partition cpuset: write /sys/fs/cgroup/nomad.slice/cpuset.cpus: device or resource busy

client.fingerprint_mgr.landlock: failed to fingerprint kernel landlock feature: error="operation not supported"

Is this related or not at all ?

Thanks,

Cheers,

tgross commented 4 months ago

@guyteubbe please open a new issue with all the appropriate details. Thanks!

guyteubbe commented 4 months ago

@tgross hey 🖖 Sorry that was a weird situation where I needed to reboot the LXC containers after an upgrade from 1.6.4 to 1.7.7. Nomad was restarted but the reboot was still needed for the errors to go away...weird... Don't pay attention to my message :)

Cheers,