hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.77k stars 1.94k forks source link

Nomad depends on gnu implementation of df #12440

Open finwo opened 2 years ago

finwo commented 2 years ago

Nomad version

1.2.6, freshly downloaded from https://releases.hashicorp.com/nomad/1.2.6/nomad_1.2.6_linux_amd64.zip

Operating system and Environment details

Minimal linux kernel with custom init, built & run using make test from https://github.com/finwo/nomados/

Start command:

Config:

{
    "data_dir": "/var/lib/nomad/data",
    "datacenter": "dc1",
    "client": {
        "enabled": true,
        "options": {
            "user.denylist": "",
            "user.checked_drivers": "exec,raw_exec"
        },
        "server_join": {
            "retry_join": ["nomad", "10.1.1.80"],
            "retry_max": 10,
            "retry_interval": "1s"
        },
        "state_dir": "/var/lib/nomad/client"
    },
    "server": {
        "enabled": false
    },
    "log_level": "TRACE",
    "bind_addr": "0.0.0.0"
}

Issue

agent: error starting agent: error="client setup failed: fingerprinting failed: failed to determine disk space for /var/lib/nomad/data/alloc: failed to determine mount point"

Nomad throws an error early during startup as shown above, stating it can't determine the available disk space or mount point, when running in a minimal environment on ramdisk (as in the linked repository).

Reproduction steps

Clone the linked repository, run make test in the root of the repsitory, wait for it all to build, start and crash

Expected Result

A virtual machine running as a nomad client which joins the server that's already running in my network

Actual Result

After getting an IP and sarting nomad, nomad throws an error (agent: error starting agent: error="client setup failed: fingerprinting failed: failed to determine disk space for /var/lib/nomad/data/alloc: failed to determine mount point") stating it can't determine the mount point for it's alloc directory.

The problem arises both with and without the -dev flag, although without the flag it depends on /sbin/ip to be present before the error is thrown.

Job file (if appropriate)

n/a

Nomad Server logs (if appropriate)

n/a

Nomad Client logs (if appropriate)

==> Loaded configuration from /etc/nomad/init.json
==> Starting Nomad agent...
==> Error starting agent: client setup failed: fingerprinting failed: failed to determine disk space for /var/lib/nomad/data/alloc: failed to determine mount point for /var/lib/nomad/data/alloc
    2022-04-02T17:15:16.255Z [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/var/lib/nomad/data/plugins
    2022-04-02T17:15:16.256Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/var/lib/nomad/data/plugins
    2022-04-02T17:15:16.256Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/var/lib/nomad/data/plugins
    2022-04-02T17:15:16.256Z [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
    2022-04-02T17:15:16.256Z [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
    2022-04-02T17:15:16.256Z [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    2022-04-02T17:15:16.256Z [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    2022-04-02T17:15:16.256Z [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    2022-04-02T17:15:16.256Z [INFO]  client: using state directory: state_dir=/var/lib/nomad/client
    2022-04-02T17:15:16.256Z [INFO]  client: using alloc directory: alloc_dir=/var/lib/nomad/data/alloc
    2022-04-02T17:15:16.256Z [INFO]  client: using dynamic ports: min=20000 max=32000 reserved=""
    2022-04-02T17:15:16.256Z [WARN]  client: could not initialize cpuset cgroup subsystem, cpuset management disabled: error="not implemented for cgroup v2 unified hierarchy"
    2022-04-02T17:15:16.256Z [DEBUG] client.fingerprint_mgr: built-in fingerprints: fingerprinters=["arch", "bridge", "cgroup", "cni", "consul", "cpu", "host", "memory", "network", "nomad", "signal", "storage",]
    2022-04-02T17:15:16.257Z [WARN]  client.fingerprint_mgr: failed to detect bridge kernel module, bridge network mode disabled:
  error=
  | 3 errors occurred:
  |     * failed to open /proc/modules: open /proc/modules: no such file or directory
  |     * failed to open /lib/modules/5.17.0+/modules.builtin: open /lib/modules/5.17.0+/modules.builtin: no such file or directory
  |     * failed to open /lib/modules/5.17.0+/modules.dep: open /lib/modules/5.17.0+/modules.dep: no such file or directory
  | 

    2022-04-02T17:15:16.257Z [INFO]  client.fingerprint_mgr.cgroup: cgroups are available
    2022-04-02T17:15:16.257Z [DEBUG] client.fingerprint_mgr: CNI config dir is not set or does not exist, skipping: cni_config_dir=/opt/cni/config
    2022-04-02T17:15:16.257Z [TRACE] agent.plugin_loader.exec: task event loop shutdown: plugin_dir=/var/lib/nomad/data/plugins
    2022-04-02T17:15:16.257Z [TRACE] agent.plugin_loader.qemu: task event loop shutdown: plugin_dir=/var/lib/nomad/data/plugins
    2022-04-02T17:15:16.257Z [TRACE] agent.plugin_loader.java: task event loop shutdown: plugin_dir=/var/lib/nomad/data/plugins
    2022-04-02T17:15:16.257Z [TRACE] agent.plugin_loader.docker: task event loop shutdown: plugin_dir=/var/lib/nomad/data/plugins
    2022-04-02T17:15:16.257Z [TRACE] agent.plugin_loader.raw_exec: task event loop shutdown: plugin_dir=/var/lib/nomad/data/plugins
    2022-04-02T17:15:16.257Z [TRACE] agent.plugin_loader.java: task event loop shutdown: plugin_dir=/var/lib/nomad/data/plugins
    2022-04-02T17:15:16.257Z [TRACE] agent.plugin_loader.docker: task event loop shutdown: plugin_dir=/var/lib/nomad/data/plugins
    2022-04-02T17:15:16.257Z [TRACE] agent.plugin_loader.raw_exec: task event loop shutdown: plugin_dir=/var/lib/nomad/data/plugins
    2022-04-02T17:15:16.257Z [TRACE] agent.plugin_loader.exec: task event loop shutdown: plugin_dir=/var/lib/nomad/data/plugins
    2022-04-02T17:15:16.257Z [TRACE] agent.plugin_loader.qemu: task event loop shutdown: plugin_dir=/var/lib/nomad/data/plugins
    2022-04-02T17:15:16.257Z [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=cgroup period=15s
    2022-04-02T17:15:16.257Z [DEBUG] client.fingerprint_mgr.cpu: detected cpu frequency: MHz=1608
    2022-04-02T17:15:16.257Z [DEBUG] client.fingerprint_mgr.cpu: detected core count: cores=1
    2022-04-02T17:15:16.257Z [WARN]  client.fingerprint_mgr.cpu: failed to detect set of reservable cores: error="not implemented for cgroup v2 unified hierarchy"
    2022-04-02T17:15:16.258Z [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=consul period=15s
    2022-04-02T17:15:16.258Z [DEBUG] client.fingerprint_mgr.network: link speed detected: interface=eth0 mbits=1000
    2022-04-02T17:15:16.258Z [DEBUG] client.fingerprint_mgr.network: detected interface IP: interface=eth0 IP=10.0.2.15
    2022-04-02T17:15:16.258Z [DEBUG] client.fingerprint_mgr.network: detected interface IP: interface=eth0 IP=fec0::5054:ff:fe12:3456
    2022-04-02T17:15:16.258Z [DEBUG] client.fingerprint_mgr.network: unable to read link speed: path=/sys/class/net/lo/speed device=lo
    2022-04-02T17:15:16.258Z [DEBUG] client.fingerprint_mgr.network: link speed could not be detected, falling back to default speed: interface=lo mbits=1000
    2022-04-02T17:15:16.261Z [DEBUG] client.fingerprint_mgr.network: unable to read link speed: path=/sys/class/net/sit0/speed device=sit0
    2022-04-02T17:15:16.261Z [DEBUG] client.fingerprint_mgr.network: link speed could not be detected, falling back to default speed: interface=sit0 mbits=1000
    2022-04-02T17:15:16.261Z [ERROR] agent: error starting agent: error="client setup failed: fingerprinting failed: failed to determine disk space for /var/lib/nomad/data/alloc: failed to determine mount point"
shoenig commented 2 years ago

Hi @finwo thanks for the report, nomados looks like an interesting project!

The root cause here is likely the lack of a df executable used for finding the mount point of the data directory

https://github.com/hashicorp/nomad/blob/v1.2.6/client/fingerprint/storage_unix.go#L32

If there's a better way to get the same information without resorting to exec, we can definitely look into incorporating it.

danishprakash commented 2 years ago

If, as the function name suggests, we want free space, we can maybe try the statfs syscall?. statx otoh, can give us the mount ID but it seems like we needed that above just for parsing the output from df.

What do you think @shoenig?

shoenig commented 2 years ago

Sounds reasonable @danishprakash. Of course the best way to know would be to try it :slightly_smiling_face:

finwo commented 2 years ago

@shoenig Just tested if adding a df binary worked.

When using the df binary from ubase, the error remained. Using the df binary from my host system (void linux), nomad started up and registered succesfully as a client with my previously-existing server.

fosslinux commented 2 years ago

I can confirm this issue when using busybox's df. It appears that the functionality of busybox's df is not equivalent in some way to coreutils df.

finwo commented 2 years ago

Is someone internally already working on it or should I give it a go with statfs/statx?

shoenig commented 2 years ago

Hey @finwo feel free to work on this!

ryanbreed commented 1 year ago

hackety hack config in client stanza:

  options {
    "fingerprint.denylist" = "storage"
  }

got my agent starting.

for some reason, fingerprinting only failed when starting under systemd on eL8. didn't bother digging deeper.

bjconlan commented 5 months ago

I'm not sure if this will help your case but i'm building this using busybox (for the moment for debugging) and noticed that the call to df was actually failing due to os.Exec not resolving it from the path.

(Note that in busybox echo $PATH does show /sbin:/usr/sbin:/bin:/usr/bin but this isn't set as the path for other processes (and wont be seen when running env so you need to explicitly export it in the shell so it is available to the env)

in any case, this does look to work as expected when using busybox's df implementation (also note that I've called mount -t tmpfs tmpfs /tmp (among other things) as the code calls something like df -kP /tmp/NomadClient9823174204 (using nomad agent -dev this if using the provided /etc/config/init.json to use the state_dir specified based in the original config (so /var/lib/nomad will need to be mounted as appropriate to resolve in the df call/storage fingerprinting)