hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.87k stars 1.95k forks source link

Nomad Upgrade Error: network interface fingerprinting failed [CLIENT] #20068

Closed ctmiller8 closed 6 months ago

ctmiller8 commented 7 months ago

Nomad version

Upgraded from v1.6.1 to v1.6.2 successfully but we experienced the issue with all recent releases up to v1.7.4 (minus 1.6.2).

Operating system and Environment details

Ubuntu 20.04.6 LTS (Nomad servers and Linux clients) Windows clients (Version 10.0.19045.4046) Docker version 24.0.7

Issue

When upgrading both Linux and Windows clients to any release beyond v1.6.2 the client fails to register with the cluster due to the below error message:

[ERROR] agent: error starting agent: error="client setup failed: fingerprinting failed: Error while detecting network interface during fingerprinting: fork/exec /sbin/ip: no such file or directory"

The above error DOES NOT impact Nomad server nodes during upgrade to any recent release ONLY client nodes are impacted.

Reproduction steps

  1. Upgrade Nomad Linux/Windows clients to any recent release beyond v1.6.2

We've reproduced this issue on the below releases:

1.7.4 1.7.3 1.7.2 1.7.1 1.7.0 1.6.7 1.6.6 1.6.5 1.6.4 1.6.3

Expected Result

Client(s) are upgraded to the target version and register successfully with the cluster without error.

Actual Result

[ERROR] agent: error starting agent: error="client setup failed: fingerprinting failed: Error while detecting network interface during fingerprinting: fork/exec /sbin/ip: no such file or directory"

Job file (if appropriate)

N/A

Nomad Server logs (if appropriate)

2024-03-05T13:20:03.905Z [DEBUG] http: request complete: method=GET path=/v1/metrics?format=prometheus duration=1.795404ms 2024-03-05T13:20:07.132Z [DEBUG] worker: dequeued evaluation: worker_id=f07e9b4f-cc31-bf8a-2a38-5294f8830c0a eval_id=0b080961-1c39-7e0f-8a26-72053aefc772 type=service namespace=npe-redis job_id=redis-replica-05 node_id="" triggered_by=queued-allocs 2024-03-05T13:20:07.132Z [TRACE] worker: changed workload status: worker_id=f07e9b4f-cc31-bf8a-2a38-5294f8830c0a from=WaitingToDequeue to=WaitingForRaft 2024-03-05T13:20:07.132Z [TRACE] worker: changed workload status: worker_id=f07e9b4f-cc31-bf8a-2a38-5294f8830c0a from=WaitingForRaft to=Scheduling 2024-03-05T13:20:07.132Z [DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=0b080961-1c39-7e0f-8a26-72053aefc772 job_id=redis-replica-05 namespace=npe-redis worker_id=f07e9b4f-cc31-bf8a-2a38-5294f8830c0a results= | Total changes: (place 1) (destructive 0) (inplace 0) (stop 0) (disconnect 0) (reconnect 0) | Desired Changes for "redis-replica-05": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)

2024-03-05T13:20:07.132Z [DEBUG] worker: dequeued evaluation: worker_id=6473fc41-8019-e42c-921c-1d2bd1bfb371 eval_id=654ce6c4-7623-873c-ef58-191d69bd396f type=service namespace=npe-trans job_id=trans-web-05 node_id="" triggered_by=queued-allocs 2024-03-05T13:20:07.132Z [TRACE] worker: changed workload status: worker_id=6473fc41-8019-e42c-921c-1d2bd1bfb371 from=WaitingToDequeue to=WaitingForRaft 2024-03-05T13:20:07.132Z [TRACE] worker: changed workload status: worker_id=6473fc41-8019-e42c-921c-1d2bd1bfb371 from=WaitingForRaft to=Scheduling 2024-03-05T13:20:07.133Z [DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=654ce6c4-7623-873c-ef58-191d69bd396f job_id=trans-web-05 namespace=npe-trans worker_id=6473fc41-8019-e42c-921c-1d2bd1bfb371 results= | Total changes: (place 1) (destructive 0) (inplace 0) (stop 0) (disconnect 0) (reconnect 0) | Desired Changes for "trans-web": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)

2024-03-05T13:20:07.133Z [DEBUG] worker: dequeued evaluation: worker_id=bf061121-492b-3964-a063-1033336a8e5b eval_id=16d90d47-af9c-b6bf-a3e9-f4210af004e1 type=service namespace=npe-vm job_id=vm-back-web-npe node_id="" triggered_by=queued-allocs 2024-03-05T13:20:07.133Z [TRACE] worker: changed workload status: worker_id=bf061121-492b-3964-a063-1033336a8e5b from=WaitingToDequeue to=WaitingForRaft 2024-03-05T13:20:07.133Z [TRACE] worker: changed workload status: worker_id=bf061121-492b-3964-a063-1033336a8e5b from=WaitingForRaft to=Scheduling 2024-03-05T13:20:07.133Z [DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=16d90d47-af9c-b6bf-a3e9-f4210af004e1 job_id=vm-back-web-npe namespace=npe-vm worker_id=bf061121-492b-3964-a063-1033336a8e5b results= | Total changes: (place 1) (destructive 0) (inplace 0) (stop 0) (disconnect 0) (reconnect 0) | Desired Changes for "vm-back-web-npe": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)

2024-03-05T13:20:07.133Z [DEBUG] worker: reblocked evaluation: worker_id=f07e9b4f-cc31-bf8a-2a38-5294f8830c0a eval="<Eval \"0b080961-1c39-7e0f-8a26-72053aefc772\" JobID: \"redis-replica-05\" Namespace: \"npe-redis\">" 2024-03-05T13:20:07.133Z [DEBUG] worker: reblocked evaluation: worker_id=6473fc41-8019-e42c-921c-1d2bd1bfb371 eval="<Eval \"654ce6c4-7623-873c-ef58-191d69bd396f\" JobID: \"trans-web-05\" Namespace: \"npe-trans\">" 2024-03-05T13:20:07.133Z [DEBUG] worker: reblocked evaluation: worker_id=bf061121-492b-3964-a063-1033336a8e5b eval="<Eval \"16d90d47-af9c-b6bf-a3e9-f4210af004e1\" JobID: \"vm-back-web-npe\" Namespace: \"npe-vm\">" 2024-03-05T13:20:07.133Z [DEBUG] worker: ack evaluation: worker_id=f07e9b4f-cc31-bf8a-2a38-5294f8830c0a eval_id=0b080961-1c39-7e0f-8a26-72053aefc772 type=service namespace=npe-redis job_id=redis-replica-05 node_id="" triggered_by=queued-allocs 2024-03-05T13:20:07.133Z [TRACE] worker: changed workload status: worker_id=f07e9b4f-cc31-bf8a-2a38-5294f8830c0a from=Scheduling to=WaitingToDequeue 2024-03-05T13:20:07.133Z [DEBUG] worker: ack evaluation: worker_id=6473fc41-8019-e42c-921c-1d2bd1bfb371 eval_id=654ce6c4-7623-873c-ef58-191d69bd396f type=service namespace=npe-trans job_id=trans-web-05 node_id="" triggered_by=queued-allocs 2024-03-05T13:20:07.133Z [TRACE] worker: changed workload status: worker_id=6473fc41-8019-e42c-921c-1d2bd1bfb371 from=Scheduling to=WaitingToDequeue 2024-03-05T13:20:07.134Z [DEBUG] worker: ack evaluation: worker_id=bf061121-492b-3964-a063-1033336a8e5b eval_id=16d90d47-af9c-b6bf-a3e9-f4210af004e1 type=service namespace=npe-vm job_id=vm-back-web-npe node_id="" triggered_by=queued-allocs 2024-03-05T13:20:07.134Z [TRACE] worker: changed workload status: worker_id=bf061121-492b-3964-a063-1033336a8e5b from=Scheduling to=WaitingToDequeue 2024-03-05T13:20:07.135Z [DEBUG] worker: dequeued evaluation: worker_id=e5b4d9ba-acb4-8d92-17a9-dc02780535d9 eval_id=415f3d30-2291-71dd-bbf6-1b2d838ef07e type=service namespace=npe-vm job_id=vm-back-nginx-npe node_id="" triggered_by=queued-allocs 2024-03-05T13:20:07.135Z [TRACE] worker: changed workload status: worker_id=e5b4d9ba-acb4-8d92-17a9-dc02780535d9 from=WaitingToDequeue to=WaitingForRaft 2024-03-05T13:20:07.135Z [TRACE] worker: changed workload status: worker_id=e5b4d9ba-acb4-8d92-17a9-dc02780535d9 from=WaitingForRaft to=Scheduling 2024-03-05T13:20:07.135Z [DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=415f3d30-2291-71dd-bbf6-1b2d838ef07e job_id=vm-back-nginx-npe namespace=npe-vm worker_id=e5b4d9ba-acb4-8d92-17a9-dc02780535d9 results= | Total changes: (place 1) (destructive 0) (inplace 0) (stop 0) (disconnect 0) (reconnect 0) | Desired Changes for "vm-back-nginx-npe": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)

Nomad Client logs (if appropriate)

==> Loaded configuration from /etc/nomad/local.json ==> Starting Nomad agent... ==> Error starting agent: client setup failed: fingerprinting failed: Error while detecting network interface during fingerprinting: fork/exec /sbin/ip: no such file or directory 2024-03-05T13:19:45.186Z [WARN] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/nomad/data/plugins 2024-03-05T13:19:45.188Z [DEBUG] agent.plugin_loader.docker: using standard client connection: plugin_dir=/nomad/data/plugins endpoint=unix:///var/run/docker.sock 2024-03-05T13:19:45.188Z [INFO] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0 2024-03-05T13:19:45.188Z [INFO] agent: detected plugin: name=exec type=driver plugin_version=0.1.0 2024-03-05T13:19:45.188Z [INFO] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0 2024-03-05T13:19:45.188Z [INFO] agent: detected plugin: name=java type=driver plugin_version=0.1.0 2024-03-05T13:19:45.188Z [INFO] agent: detected plugin: name=docker type=driver plugin_version=0.1.0 2024-03-05T13:19:45.189Z [INFO] client: using state directory: state_dir=/nomad/data/client 2024-03-05T13:19:45.189Z [INFO] client: using alloc directory: alloc_dir=/nomad/data/alloc 2024-03-05T13:19:45.189Z [INFO] client: using dynamic ports: min=20000 max=32000 reserved="" 2024-03-05T13:19:45.193Z [DEBUG] client.fingerprint_mgr: built-in fingerprints: fingerprinters=["arch", "bridge", "cgroup", "cni", "consul", "cpu", "host", "landlock", "memory", "network", "nomad", "plugins_cni", "signal", "storage", "vault", "env_aws", "env_gce", "env_azure", "env_digitalocean"] 2024-03-05T13:19:45.193Z [DEBUG] client.fingerprint_mgr.cgroup: detected cgroups: version=1 2024-03-05T13:19:45.193Z [DEBUG] client.fingerprint_mgr: CNI config dir is not set or does not exist, skipping: cni_config_dir=/opt/cni/config 2024-03-05T13:19:45.195Z [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=consul initial_period=15s 2024-03-05T13:19:45.196Z [DEBUG] client.fingerprint_mgr.cpu: detected CPU model: name="Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz" 2024-03-05T13:19:45.196Z [DEBUG] client.fingerprint_mgr.cpu: detected CPU frequency: mhz=1995 2024-03-05T13:19:45.196Z [DEBUG] client.fingerprint_mgr.cpu: detected CPU core count: cores=8 2024-03-05T13:19:45.197Z [WARN] client.fingerprint_mgr.landlock: failed to fingerprint kernel landlock feature: error="function not implemented" 2024-03-05T13:19:45.197Z [ERROR] agent: error starting agent: error="client setup failed: fingerprinting failed: Error while detecting network interface during fingerprinting: fork/exec /sbin/ip: no such file or directory"

49-3 commented 6 months ago

i got the same error on 1.7.6

ctmiller8 commented 6 months ago

i got the same error on 1.7.6

Did you find any workarounds?

tgross commented 6 months ago

Hi folks, the logs are telling you what the issue is here. The go-sockaddr we're using for network fingerprinting uses the ip command to read the host's network environment. Although that was upgraded in https://github.com/hashicorp/nomad/commit/23e4b7c9d23350f9d3bd2707b0d79f413767c438 that update won't ship until 1.8.0, so we're still using the version of go-sockaddr we've been using for a long time (it's used ip since 7+ years ago).

It's definitely worth verifying that your client host has the ip command and that Nomad (which should be running as root) has access to read it.

Windows

I suspect the error message is different on Windows? It uses netstat.

ctmiller8 commented 6 months ago

Hi @tgross - thanks for the reply.

I quickly tested v1.7.6 with a Windows and Linux client - on Windows, the v1.7.6 client starts up normally and connects to the cluster without issue.

On Linux, the same network fingerprint issue persists with v1.7.6 and below is the client log output:

2024-04-16T19:29:49.858875098Z ==> Loaded configuration from /etc/nomad/local.json 2024-04-16T19:29:49.873377054Z 2024-04-16T19:29:49.872Z [ERROR] agent: error starting agent: error="client setup failed: fingerprinting failed: Error while detecting network interface during fingerprinting: fork/exec /sbin/ip: no such file or directory" 2024-04-16T19:29:49.859011549Z ==> Starting Nomad agent... 2024-04-16T19:29:49.873186911Z ==> Error starting agent: client setup failed: fingerprinting failed: Error while detecting network interface during fingerprinting: fork/exec /sbin/ip: no such file or directory 2024-04-16T19:29:49.873272729Z 2024-04-16T19:29:49.860Z [WARN] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/nomad/data/plugins 2024-04-16T19:29:49.873303092Z 2024-04-16T19:29:49.862Z [DEBUG] agent.plugin_loader.docker: using standard client connection: plugin_dir=/nomad/data/plugins endpoint=unix:///var/run/docker.sock 2024-04-16T19:29:49.873308690Z 2024-04-16T19:29:49.862Z [INFO] agent: detected plugin: name=docker type=driver plugin_version=0.1.0 2024-04-16T19:29:49.873312989Z 2024-04-16T19:29:49.862Z [INFO] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0 2024-04-16T19:29:49.873317156Z 2024-04-16T19:29:49.862Z [INFO] agent: detected plugin: name=exec type=driver plugin_version=0.1.0 2024-04-16T19:29:49.873321009Z 2024-04-16T19:29:49.862Z [INFO] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0 2024-04-16T19:29:49.873324886Z 2024-04-16T19:29:49.862Z [INFO] agent: detected plugin: name=java type=driver plugin_version=0.1.0 2024-04-16T19:29:49.873328857Z 2024-04-16T19:29:49.863Z [INFO] client: using state directory: state_dir=/nomad/data/client 2024-04-16T19:29:49.873333091Z 2024-04-16T19:29:49.863Z [INFO] client: using alloc directory: alloc_dir=/nomad/data/alloc 2024-04-16T19:29:49.873337199Z 2024-04-16T19:29:49.863Z [INFO] client: using dynamic ports: min=20000 max=32000 reserved="" 2024-04-16T19:29:49.873341527Z 2024-04-16T19:29:49.868Z [DEBUG] client.fingerprint_mgr: built-in fingerprints: fingerprinters=["arch", "bridge", "cgroup", "cni", "consul", "cpu", "host", "landlock", "memory", "network", "nomad", "plugins_cni", "signal", "storage", "vault", "env_aws", "env_gce", "env_azure", "env_digitalocean"] 2024-04-16T19:29:49.873347717Z 2024-04-16T19:29:49.868Z [DEBUG] client.fingerprint_mgr.cgroup: detected cgroups: version=1 2024-04-16T19:29:49.873351713Z 2024-04-16T19:29:49.868Z [DEBUG] client.fingerprint_mgr: CNI config dir is not set or does not exist, skipping: cni_config_dir=/opt/cni/config 2024-04-16T19:29:49.873355684Z 2024-04-16T19:29:49.870Z [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=consul initial_period=15s 2024-04-16T19:29:49.873359607Z 2024-04-16T19:29:49.872Z [DEBUG] client.fingerprint_mgr.cpu: detected CPU model: name="Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz" 2024-04-16T19:29:49.873363864Z 2024-04-16T19:29:49.872Z [DEBUG] client.fingerprint_mgr.cpu: detected CPU frequency: mhz=1995 2024-04-16T19:29:49.873368385Z 2024-04-16T19:29:49.872Z [DEBUG] client.fingerprint_mgr.cpu: detected CPU core count: cores=16 2024-04-16T19:29:49.873372606Z 2024-04-16T19:29:49.872Z [WARN] client.fingerprint_mgr.landlock: failed to fingerprint kernel landlock feature: error="function not implemented"

As you mentioned in your reply, the fingerprint method has not changed for 7+ years which is exactly what makes this issue peculiar. We started with Nomad v1.3.1 up to v1.6.2 (NPE) and v1.6.1 (PRD) and haven't seen this issue with any previous releases and our server/client configs have not changed.

I believe if the root cause were a permissions or dependency issue with executing the ip command the client should break with any version that we test. This is not the case, and the client starts up normally as soon as we revert back to v1.6.2.

Also just to reiterate, running a higher version image as server starts without issue - the go-sockadd is also used when server is being passed, correct?

Please advise - TIA!

49-3 commented 6 months ago

i got it working using this:

IP=`hostname -i`

echo '
datacenter = "dc1"
data_dir = "/var/lib/nomad"
bind_addr = "'${IP}'"

on my entry point who configure the > /etc/nomad.d/nomad.hcl on a debian:12.2-slim container

tgross commented 6 months ago

@ctmiller8 wrote:

on Windows, the v1.7.6 client starts up normally and connects to the cluster without issue.

:+1: Ok, good to hear. If it were the same error on both platforms that would be worrisome given they use different commands to fingerprint!

I believe if the root cause were a permissions or dependency issue with executing the ip command the client should break with any version that we test. This is not the case, and the client starts up normally as soon as we revert back to v1.6.2.

Yeah, I can't argue with your logic there, for sure. Assuming that you're keeping the same host and same systemd unit file, and not replacing a VM from a machine image or something (which might be different). Any chance I could get you to strace -ff the agent during startup to see if that reveals anything helpful here? You can email that output to nomad-oss-debug@hashicorp.com (just in case there's anything slightly sensitive in there). Just refer to this issue in the email body / subject line.

Also just to reiterate, running a higher version image as server starts without issue - the go-sockadd is also used when server is being passed, correct?

No, only the client does any "fingerprinting". The client needs to survey its environment so it can inform the server what resources it has available for scheduling (ex. CPU, memory, disk, network interfaces, CNI plugins, task drivers, etc., etc.). The error you're seeing is in that fingerprinting component.


@49-3 wrote:

on a debian:12.2-slim container

Please note that running clients in a container is explicitly unsupported. The container image we ship has an entrypoint script that announces this loudly. :grinning:

ctmiller8 commented 6 months ago

i got it working using this:

IP=`hostname -i`

echo '
datacenter = "dc1"
data_dir = "/var/lib/nomad"
bind_addr = "'${IP}'"

on my entry point who configure the > /etc/nomad.d/nomad.hcl on a debian:12.2-slim container

Thanks for the update @49-3 and suggestion.

I don't think this workaround would apply to our environment(s) because we are explicitly passing a value with bind_addr for each cluster member (server and client). We use static IPs and static DNS host names for our nodes.

ctmiller8 commented 6 months ago

Thanks for the reply @tgross and suggestions.

Yeah, I can't argue with your logic there, for sure. Assuming that you're keeping the same host and same systemd unit file, and not replacing a VM from a machine image or something (which might be different).

Yes - as I mentioned to @49-3 we're using static IPs and DNS host names for all of our nodes in the cluster.

Any chance I could get you to strace -ff the agent during startup to see if that reveals anything helpful here? You can email that output to nomad-oss-debug@hashicorp.com (just in case there's anything slightly sensitive in there). Just refer to this issue in the email body / subject line.

Yes I'll check on adding strace -ff to the agent startup and follow up with the results. In full transparency, our Linux clients are running in container since v1.3.1 without issue. This is why I listed Docker version 24.0.7 in the issue details. Again, I find this issue very peculiar given our history (albeit brief) and successes with running cross-platform and solving our initial problem statement by using Nomad.

No, only the client does any "fingerprinting". The client needs to survey its environment so it can inform the server what resources it has available for scheduling (ex. CPU, memory, disk, network interfaces, CNI plugins, task drivers, etc., etc.).

Thanks for the info and clarification - this helps us isolate and troubleshoot as a client (Linux) specific issue.

tgross commented 6 months ago

In full transparency, our Linux clients are running in container since v1.3.1 without issue. This is why I listed Docker version 24.0.7 in the issue details.

Could the container image just not have /sbin/ip in it? You haven't mentioned whether you were using the official image HashiCorp publishes or some other image.

ctmiller8 commented 6 months ago

@tgross apologies for the delay.

I tested the base nomad image I was using [multani] with v1.6.2 and v1.7.6 using your suggestion of strace -ff, compared both outputs, and noticed that the diff was /sbin/ip was missing which impacted the network fingerprinting on client startup.

I then rewrote the client compose, using the official hashicorp/nomad:1.7.6 as the base image, and the client started up without issue.

I'm still doing end-to-end testing, as the client I chose to test with has quite a few allocations assigned to it, but so far so good. I'll let this change bake for a day or so before calling it good - but I will follow up with the end results so we can close the loop.

ctmiller8 commented 6 months ago

@tgross quick update - testing was completed successfully.

Thanks everyone again for the assist!