hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.63k stars 1.93k forks source link

bug: failed to parse CNI plugin versions in ubuntu-latest in github-actions #23229

Open Kamilcuk opened 1 month ago

Kamilcuk commented 1 month ago

Nomad version

Nomad v1.8.0 BuildDate 2024-05-28T17:38:17Z Revision 28b82e4b2259fae5a62e2ed47395334bea5a24c4

Operating system and Environment details

github-actions ubuntu-latest

Issue

nomad does not detect CNI pluigns, but they are installed in github-actions.

Reproduction steps

Run the following in github actions:

---
on:
  - push
jobs:
  test:
    name: Test
    runs-on: ubuntu-latest
    steps:
      - name: test
        run: |
          set -x
          sudo mkdir -vp /opt/cni
          sudo ln -sv /usr/lib/cni /opt/cni/bin
          v=https://releases.hashicorp.com/nomad/1.8.0/nomad_1.8.0_linux_amd64.zip
          wget -q "$v"
          unzip *.zip
          export PATH=$PWD:$PATH
          chmod +x ./nomad
          nomad --version
          sudo ./nomad agent -dev &
          sleep 10
          exec 0<&-
          nomad operator api /v1/nodes |
            jq -r '.[].ID' |
            xargs -t -i nomad operator api /v1/node/{} |
            jq
          sudo killall nomad
          sleep 10
          exit 1

Expected Result

Nomad should have access to CNI plugins, or report the version of them.

Actual Result

Nomad has no access to CNI plugins, but they are available at /opt/cni/bin

Is this expected? Is the CNI plugins in github-actions ubuntu-latest too old or is this a bug in version detection?

Thanks.

Nomad Server logs (if appropriate)

https://github.com/Kamilcuk/nomad-tools/actions/runs/9397590511/job/25881118293

+ sudo ./nomad agent -dev
==> No configuration files loaded
==> Starting Nomad agent...
==> Nomad agent configuration:
       Advertise Addrs: HTTP: 127.0.0.1:4646; RPC: 127.0.0.1:4647; Serf: 127.0.0.1:4648
            Bind Addrs: HTTP: [127.0.0.1:4646]; RPC: 127.0.0.1:4647; Serf: 127.0.0.1:4648
                Client: true
             Log Level: DEBUG
               Node Id: f564c710-6332-6ed4-d14e-823ae545361e
                Region: global (DC: dc1)
                Server: true
               Version: 1.8.0
==> Nomad agent started! Log data will stream in below:
    2024-06-06T08:26:17.046Z [DEBUG] nomad: issuer not set; OIDC Discovery endpoint for workload identities disabled
    2024-06-06T08:26:17.048Z [INFO]  nomad.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:58ad8c5d-4158-c7db-c873-5e7cd8606998 Address:127.0.0.1:4647}]"
    2024-06-06T08:26:17.048Z [INFO]  nomad.raft: entering follower state: follower="Node at 127.0.0.1:4647 [Follower]" leader-address= leader-id=
    2024-06-06T08:26:17.048Z [INFO]  nomad: serf: EventMemberJoin: fv-az695-642.global 127.0.0.1
    2024-06-06T08:26:17.048Z [INFO]  nomad: starting scheduling worker(s): num_workers=4 schedulers=["system", "sysbatch", "service", "batch", "_core"]
    2024-06-06T08:26:17.048Z [DEBUG] nomad: started scheduling worker: id=435c571c-7766-1239-c25f-ad6c4116953d index=1 of=4
    2024-06-06T08:26:17.048Z [DEBUG] nomad: started scheduling worker: id=c4135637-3bb5-72f1-dad8-2739685efa2b index=2 of=4
    2024-06-06T08:26:17.048Z [DEBUG] nomad: started scheduling worker: id=26328977-f958-d72d-08bd-8663e0c13786 index=3 of=4
    2024-06-06T08:26:17.048Z [DEBUG] nomad: started scheduling worker: id=e335869b-1e60-42b7-37ef-d4b9a71e33d3 index=4 of=4
    2024-06-06T08:26:17.048Z [INFO]  nomad: started scheduling worker(s): num_workers=4 schedulers=["system", "sysbatch", "service", "batch", "_core"]
    2024-06-06T08:26:17.048Z [DEBUG] worker: running: worker_id=435c571c-7766-1239-c25f-ad6c4116953d
    2024-06-06T08:26:17.048Z [DEBUG] worker: running: worker_id=c4135637-3bb5-72f1-dad8-2739685efa2b
    2024-06-06T08:26:17.048Z [INFO]  nomad: adding server: server="fv-az695-642.global (Addr: 127.0.0.1:4647) (DC: dc1)"
    2024-06-06T08:26:17.048Z [DEBUG] worker: running: worker_id=26328977-f958-d72d-08bd-8663e0c13786
    2024-06-06T08:26:17.048Z [DEBUG] worker: running: worker_id=e335869b-1e60-42b7-37ef-d4b9a71e33d3
    2024-06-06T08:26:17.048Z [DEBUG] nomad.keyring.replicator: starting encryption key replication
    2024-06-06T08:26:17.049Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=""
    2024-06-06T08:26:17.049Z [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
    2024-06-06T08:26:17.049Z [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
    2024-06-06T08:26:17.049Z [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    2024-06-06T08:26:17.049Z [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    2024-06-06T08:26:17.049Z [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    2024-06-06T08:26:17.049Z [INFO]  client: using state directory: state_dir=/tmp/NomadClient419424903
    2024-06-06T08:26:17.049Z [INFO]  client: using alloc directory: alloc_dir=/tmp/NomadClient334035025
    2024-06-06T08:26:17.049Z [INFO]  client: using dynamic ports: min=20000 max=32000 reserved=""
    2024-06-06T08:26:17.050Z [DEBUG] client.fingerprint_mgr: built-in fingerprints: fingerprinters=["arch", "bridge", "cgroup", "cni", "consul", "cpu", "host", "landlock", "memory", "network", "nomad", "plugins_cni", "signal", "storage", "vault", "env_aws", "env_gce", "env_azure", "env_digitalocean"]
    2024-06-06T08:26:17.050Z [DEBUG] client.fingerprint_mgr.cgroup: detected cgroups: version=2
    2024-06-06T08:26:17.050Z [DEBUG] client.fingerprint_mgr: CNI config dir is not set or does not exist, skipping: cni_config_dir=/opt/cni/config
    2024-06-06T08:26:17.051Z [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=consul initial_period=15s
    2024-06-06T08:26:17.055Z [DEBUG] client.fingerprint_mgr.cpu: detected CPU model: name="AMD EPYC 7763 64-Core Processor"
    2024-06-06T08:26:17.055Z [DEBUG] client.fingerprint_mgr.cpu: detected CPU frequency: mhz=2450
    2024-06-06T08:26:17.055Z [DEBUG] client.fingerprint_mgr.cpu: detected CPU core count: cores=4
    2024-06-06T08:26:17.057Z [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=lo
    2024-06-06T08:26:17.058Z [DEBUG] client.fingerprint_mgr.network: unable to read link speed: path=/sys/class/net/lo/speed device=lo
    2024-06-06T08:26:17.058Z [DEBUG] client.fingerprint_mgr.network: link speed could not be detected and no speed specified by user, falling back to default speed: interface=lo mbits=1000
    2024-06-06T08:26:17.058Z [DEBUG] client.fingerprint_mgr.network: detected interface IP: interface=lo IP=127.0.0.1
    2024-06-06T08:26:17.058Z [DEBUG] client.fingerprint_mgr.network: detected interface IP: interface=lo IP=::1
    2024-06-06T08:26:17.059Z [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=lo
    2024-06-06T08:26:17.059Z [DEBUG] client.fingerprint_mgr.network: unable to read link speed: path=/sys/class/net/lo/speed device=lo
    2024-06-06T08:26:17.059Z [DEBUG] client.fingerprint_mgr.network: link speed could not be detected, falling back to default speed: interface=lo mbits=1000
    2024-06-06T08:26:17.061Z [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=docker0
    2024-06-06T08:26:17.061Z [DEBUG] client.fingerprint_mgr.network: unable to parse link speed: path=/sys/class/net/docker0/speed device=docker0
    2024-06-06T08:26:17.061Z [DEBUG] client.fingerprint_mgr.network: link speed could not be detected, falling back to default speed: interface=docker0 mbits=1000
    2024-06-06T08:26:17.360Z [DEBUG] client.fingerprint_mgr.cni_plugins: failed to parse CNI plugin version: name=bandwidth
    2024-06-06T08:26:17.363Z [DEBUG] client.fingerprint_mgr.cni_plugins: failed to parse CNI plugin version: name=bridge
    2024-06-06T08:26:17.882Z [DEBUG] client.fingerprint_mgr.cni_plugins: failed to parse CNI plugin version: name=dhcp
    2024-06-06T08:26:18.163Z [DEBUG] client.fingerprint_mgr.cni_plugins: failed to parse CNI plugin version: name=firewall
    2024-06-06T08:26:18.380Z [DEBUG] client.fingerprint_mgr.cni_plugins: failed to parse CNI plugin version: name=flannel
    2024-06-06T08:26:18.402Z [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
    2024-06-06T08:26:18.402Z [INFO]  nomad.raft: entering candidate state: node="Node at 127.0.0.1:4647 [Candidate]" term=2
    2024-06-06T08:26:18.402Z [DEBUG] nomad.raft: voting for self: term=2 id=58ad8c5d-4158-c7db-c873-5e7cd8606998
    2024-06-06T08:26:18.402Z [DEBUG] nomad.raft: calculated votes needed: needed=1 term=2
    2024-06-06T08:26:18.402Z [DEBUG] nomad.raft: vote granted: from=58ad8c5d-4158-c7db-c873-5e7cd8606998 term=2 tally=1
    2024-06-06T08:26:18.402Z [INFO]  nomad.raft: election won: term=2 tally=1
    2024-06-06T08:26:18.402Z [INFO]  nomad.raft: entering leader state: leader="Node at 127.0.0.1:4647 [Leader]"
    2024-06-06T08:26:18.402Z [INFO]  nomad: cluster leadership acquired
    2024-06-06T08:26:18.404Z [DEBUG] nomad.autopilot: autopilot is now running
    2024-06-06T08:26:18.404Z [DEBUG] nomad.autopilot: state update routine is now running
    2024-06-06T08:26:18.404Z [INFO]  nomad.core: established cluster id: cluster_id=89d8b91d-fe90-4df9-2333-3fc6cd0e5f10 create_time=1717662378404341877
    2024-06-06T08:26:18.404Z [INFO]  nomad: eval broker status modified: paused=false
    2024-06-06T08:26:18.404Z [INFO]  nomad: blocked evals status modified: paused=false
    2024-06-06T08:26:18.520Z [INFO]  nomad.keyring: initialized keyring: id=f9ad8e94-563c-0aa8-1dfb-e55ba7de91a6
    2024-06-06T08:26:18.680Z [DEBUG] client.fingerprint_mgr.cni_plugins: failed to parse CNI plugin version: name=host-device
    2024-06-06T08:26:18.860Z [DEBUG] client.fingerprint_mgr.cni_plugins: failed to parse CNI plugin version: name=host-local
    2024-06-06T08:26:19.014Z [DEBUG] client.fingerprint_mgr.cni_plugins: failed to parse CNI plugin version: name=ipvlan
    2024-06-06T08:26:19.155Z [DEBUG] client.fingerprint_mgr.cni_plugins: failed to parse CNI plugin version: name=loopback
    2024-06-06T08:26:19.372Z [DEBUG] client.fingerprint_mgr.cni_plugins: failed to parse CNI plugin version: name=macvlan
    2024-06-06T08:26:19.374Z [DEBUG] client.fingerprint_mgr.cni_plugins: failed to parse CNI plugin version: name=portmap
    2024-06-06T08:26:19.526Z [DEBUG] client.fingerprint_mgr.cni_plugins: failed to parse CNI plugin version: name=ptp
    2024-06-06T08:26:19.689Z [DEBUG] client.fingerprint_mgr.cni_plugins: failed to parse CNI plugin version: name=sbr
    2024-06-06T08:26:19.794Z [DEBUG] client.fingerprint_mgr.cni_plugins: failed to parse CNI plugin version: name=static
    2024-06-06T08:26:19.796Z [DEBUG] client.fingerprint_mgr.cni_plugins: failed to parse CNI plugin version: name=tuning
    2024-06-06T08:26:20.005Z [DEBUG] client.fingerprint_mgr.cni_plugins: failed to parse CNI plugin version: name=vlan
    2024-06-06T08:26:20.131Z [DEBUG] client.fingerprint_mgr.cni_plugins: failed to parse CNI plugin version: name=vrf
    2024-06-06T08:26:20.133Z [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=vault initial_period=15s
    2024-06-06T08:26:20.135Z [DEBUG] client.fingerprint_mgr.env_digitalocean: failed to request metadata: attribute=region error="Get \"[http://169.254.169.254/metadata/v1/region\](http://169.254.169.254/metadata/v1/region/)": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
    2024-06-06T08:26:20.139Z [DEBUG] client.fingerprint_mgr.env_gce: could not read value for attribute: attribute=machine-type error="Get \"[http://169.254.169.254/computeMetadata/v1/instance/machine-type\](http://169.254.169.254/computeMetadata/v1/instance/machine-type/)": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
    2024-06-06T08:26:20.139Z [DEBUG] client.fingerprint_mgr.env_gce: error querying GCE Metadata URL, skipping
    2024-06-06T08:26:20.141Z [DEBUG] client.fingerprint_mgr.env_azure: could not read value for attribute: attribute=compute/azEnvironment error="Get \"http://169.254.169.254/metadata/instance/compute/azEnvironment?api-version=2019-06-04&format=text\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
    2024-06-06T08:26:20.141Z [DEBUG] client.fingerprint_mgr: detected fingerprints: node_attrs=["arch", "bridge", "cpu", "host", "network", "nomad", "plugins_cni", "signal", "storage"]
    2024-06-06T08:26:20.141Z [INFO]  client.proclib.cg2: initializing nomad cgroups: cores=0-3
    2024-06-06T08:26:20.141Z [DEBUG] client.proclib.cg2: top level partition root nomad.slice cgroup initialized
    2024-06-06T08:26:20.141Z [DEBUG] client.proclib.cg2: partition member nomad.slice/share cgroup initialized
    2024-06-06T08:26:20.141Z [DEBUG] client.proclib.cg2: partition member nomad.slice/reserve cgroup initialized
    2024-06-06T08:26:20.141Z [INFO]  client.plugin: starting plugin manager: plugin-type=csi
    2024-06-06T08:26:20.141Z [INFO]  client.plugin: starting plugin manager: plugin-type=driver
    2024-06-06T08:26:20.141Z [INFO]  client.plugin: starting plugin manager: plugin-type=device
    2024-06-06T08:26:20.141Z [DEBUG] client.device_mgr: exiting since there are no device plugins
    2024-06-06T08:26:20.141Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=exec health=healthy description=Healthy
    2024-06-06T08:26:20.141Z [DEBUG] client.driver_mgr.docker: using client connection initialized from environment: driver=docker
    2024-06-06T08:26:20.141Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=raw_exec health=healthy description=Healthy
    2024-06-06T08:26:20.141Z [DEBUG] client.plugin: waiting on plugin manager initial fingerprint: plugin-type=driver
    2024-06-06T08:26:20.141Z [DEBUG] client.plugin: waiting on plugin manager initial fingerprint: plugin-type=device
    2024-06-06T08:26:20.141Z [DEBUG] client.plugin: finished plugin manager initial fingerprint: plugin-type=device
    2024-06-06T08:26:20.142Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=qemu health=undetected description=""
    2024-06-06T08:26:20.142Z [DEBUG] client.server_mgr: new server list: new_servers=[127.0.0.1:4647] old_servers=[]
    2024-06-06T08:26:20.155Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=docker health=healthy description=Healthy
    2024-06-06T08:26:20.239Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=java health=healthy description=Healthy
    2024-06-06T08:26:20.239Z [DEBUG] client.driver_mgr: detected drivers: drivers="map[healthy:[raw_exec exec docker java] undetected:[qemu]]"
    2024-06-06T08:26:20.239Z [DEBUG] client.plugin: finished plugin manager initial fingerprint: plugin-type=driver
    2024-06-06T08:26:20.239Z [INFO]  client: started client: node_id=ab79e196-4ab8-931f-327f-ae30203e91ab
    2024-06-06T08:26:20.239Z [DEBUG] http: UI is enabled
    2024-06-06T08:26:20.239Z [DEBUG] http: UI is enabled
    2024-06-06T08:26:20.240Z [INFO]  client: node registration complete
    2024-06-06T08:26:20.240Z [DEBUG] client: state updated: node_status=ready
    2024-06-06T08:26:20.241Z [DEBUG] client: updated allocations: index=1 total=0 pulled=0 filtered=0
    2024-06-06T08:26:20.241Z [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=0
    2024-06-06T08:26:20.241Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=0 errors=0
    2024-06-06T08:26:21.241Z [DEBUG] client: state changed, updating node and re-registering
    2024-06-06T08:26:21.241Z [INFO]  client: node registration complete
jrasell commented 1 month ago

Hi @Kamilcuk, do you know version of CNI ships in this setup? https://github.com/hashicorp/nomad/issues/20263 details problems with upstream CNI which caused detection errors that has been fixed in subsequent releases from CNI and did not need any intervention or changes within Nomad.

Kamilcuk commented 1 month ago
+ apt list -a containernetworking-plugins
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
Listing...
containernetworking-plugins/jammy,now 0.9.1+ds1-1 amd64 [installed,automatic]

As I understand https://github.com/containernetworking/plugins/releases/tag/v0.8.0 , the CNI spec v0.4.0 was added in v0.8.0 of containernetowrking-plugins. So I think protocol 0.4.0 should be detected.

And --help and --version of each executable in the directory:

+ /opt/cni/bin/bandwidth --help
CNI bandwidth plugin version unknown
+ /opt/cni/bin/bandwidth --version
CNI bandwidth plugin version unknown
+ for i in /opt/cni/bin/*
+ /opt/cni/bin/bridge --help
CNI bridge plugin version unknown
+ /opt/cni/bin/bridge --version
CNI bridge plugin version unknown
+ for i in /opt/cni/bin/*
+ /opt/cni/bin/dhcp --help
CNI dhcp plugin version unknown
+ /opt/cni/bin/dhcp --version
CNI dhcp plugin version unknown
+ for i in /opt/cni/bin/*
+ /opt/cni/bin/dnsname --help
CNI dnsname plugin
version: 1.3.1
commit: unknown
+ /opt/cni/bin/dnsname --version
CNI dnsname plugin
version: 1.3.1
commit: unknown
+ for i in /opt/cni/bin/*
+ /opt/cni/bin/firewall --help
CNI firewall plugin version unknown
+ /opt/cni/bin/firewall --version
CNI firewall plugin version unknown
+ for i in /opt/cni/bin/*
+ /opt/cni/bin/flannel --help
CNI flannel plugin version unknown
+ /opt/cni/bin/flannel --version
CNI flannel plugin version unknown
+ for i in /opt/cni/bin/*
+ /opt/cni/bin/host-device --help
CNI host-device plugin version unknown
+ /opt/cni/bin/host-device --version
CNI host-device plugin version unknown
+ for i in /opt/cni/bin/*
+ /opt/cni/bin/host-local --help
CNI host-local plugin version unknown
+ /opt/cni/bin/host-local --version
CNI host-local plugin version unknown
+ for i in /opt/cni/bin/*
+ /opt/cni/bin/ipvlan --help
CNI ipvlan plugin version unknown
+ /opt/cni/bin/ipvlan --version
CNI ipvlan plugin version unknown
+ for i in /opt/cni/bin/*
+ /opt/cni/bin/loopback --help
CNI loopback plugin version unknown
+ /opt/cni/bin/loopback --version
CNI loopback plugin version unknown
+ for i in /opt/cni/bin/*
+ /opt/cni/bin/macvlan --help
CNI macvlan plugin version unknown
+ /opt/cni/bin/macvlan --version
CNI macvlan plugin version unknown
+ for i in /opt/cni/bin/*
+ /opt/cni/bin/portmap --help
CNI portmap plugin version unknown
+ /opt/cni/bin/portmap --version
CNI portmap plugin version unknown
+ for i in /opt/cni/bin/*
+ /opt/cni/bin/ptp --help
CNI ptp plugin version unknown
+ /opt/cni/bin/ptp --version
CNI ptp plugin version unknown
+ for i in /opt/cni/bin/*
+ /opt/cni/bin/sbr --help
CNI sbr plugin version unknown
+ /opt/cni/bin/sbr --version
CNI sbr plugin version unknown
+ for i in /opt/cni/bin/*
+ /opt/cni/bin/static --help
CNI static plugin version unknown
+ /opt/cni/bin/static --version
CNI static plugin version unknown
+ for i in /opt/cni/bin/*
+ /opt/cni/bin/tuning --help
CNI tuning plugin version unknown
+ /opt/cni/bin/tuning --version
CNI tuning plugin version unknown
+ for i in /opt/cni/bin/*
+ /opt/cni/bin/vlan --help
CNI vlan plugin version unknown
+ /opt/cni/bin/vlan --version
CNI vlan plugin version unknown
+ for i in /opt/cni/bin/*
+ /opt/cni/bin/vrf --help
CNI vrf plugin version unknown
+ /opt/cni/bin/vrf --version
CNI vrf plugin version unknown
Kamilcuk commented 1 month ago

The XY issue is, Nomad <1.8 did not auto add '${attr.plugins.cni.bridge}' semver '>= 0.4.0' constraints to jobs. The docker jobs with bridge just worked. Nomad 1.8 does adds the constrants, but now the version is not detected, and the job will not run. Even if it wouold work if it would be started. Bottom line, this is like a regression, because some users may be not be able to run jobs that used to work by omission of check. Thanks.

tgross commented 1 month ago

@Kamilcuk these are very old versions of the CNI plugins. You should be using more current versions that will report their fingerprint correctly (and fix a ton of bugs!)

tgross commented 1 month ago

I'm going to update our docs to establish a minimum version of the plugins and add a deprecation warning on the 1.8.x release notes.

Kamilcuk commented 1 month ago

Hi. Is it be possible to remove the check from job specification or force Nomad to know that CNI plugins with specific version exists? It would be nice to have a plugin "docker" { config { cni_version = "4.0.0" # overrides cni detection } } configuration option.

tgross commented 1 month ago

@Kamilcuk I'm not sure why we'd want to do that; the reason we added the constraint is because users were running into problems where they were deployed on old versions of CNI plugins (or no CNI plugins at all!). Wouldn't this just open up users to having incorrect behavior? And in any case, CNI plugins aren't associated with the docker plugin at all.