hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.76k stars 1.94k forks source link

Upgrade to docker-ce-cli 5:27.0.3 breaks nomad #23523

Open ebarriosjr opened 1 month ago

ebarriosjr commented 1 month ago

Nomad version

Nomad v1.8.1 BuildDate 2024-06-19T06:43:57Z Revision 5022543e4b7b8dcec9df123f86630ae3fdcffbe6

Operating system and Environment details

lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 22.04.4 LTS Release: 22.04 Codename: jammy

Issue

After upgrading docker-ce-cli from 5:27.0.2 to 5:27.0.3 nomad breaks. No containers were deployed. Some of them had the issue: Constraint "missing network": 1 nodes excluded by filter, others were trying to use ipv6 instead of ipv4.

Reproduction steps

Update docker-ce-cli to version 5:27.0.3 and reboot.

Expected Result

Nomad would be able to spawn docker container without issue.

Actual Result

No container could be started

tgross commented 1 month ago

Hi @ebarriosjr! Nomad doesn't use the Docker CLI. From the package version number you've got there, I'm assuming you're using a downstream distribution and not Docker's own package? If I look at https://github.com/docker/cli/compare/v27.0.2...v27.0.3 I see that they vendored the main moby/moby project at v27.0.3. And then if I look at the release notes for v27.0.3 I see some interesting suspects. So my guess is that dockerd itself was also upgraded by your package update? Before we go digging further, can you confirm that by providing the output of docker version?

tgross commented 1 month ago

For what it's worth, I've upgraded my local environment to 27.0.3 and tested out a Nomad job with networking and wasn't able to reproduce any problems. Maybe there's something specific to your client configuration or job that you could share?

output of docker version ``` $ docker version Client: Docker Engine - Community Version: 27.0.3 API version: 1.46 Go version: go1.21.11 Git commit: 7d4bcd8 Built: Sat Jun 29 00:03:03 2024 OS/Arch: linux/amd64 Context: default Server: Docker Engine - Community Engine: Version: 27.0.3 API version: 1.46 (minimum version 1.24) Go version: go1.21.11 Git commit: 662f78c Built: Sat Jun 29 00:03:03 2024 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.7.18 GitCommit: ae71819c4f5e67bb4d5ae76a6b735f29cc25774e runc: Version: 1.7.18 GitCommit: v1.1.13-0-g58aa920 docker-init: Version: 0.19.0 GitCommit: de40ad0 ```

The other weird item here is this error Constraint "missing network": 1 nodes excluded by filter that you reported, because that suggests that there's something wrong with host fingerprinting of the network. And that doesn't involve Docker at all.

MatthewJohn commented 1 month ago

Yesterday, after building a new nomad client, I've found that the connect envoy side-car ports are not being published correctly. Nothing has changed in the setup except newer packages have been installed.

From what I can see, the other clients were running 26.X of docker-ce and the new one is running 27.X. The other clients had packages updates (mostly kernel and docker to 27.X and they've also started failing in the same way).

Happy to supply any info - from what I can see iptables has the entries for the allocations/ports, but getting connection refused.

The client was running 1.7.7, but have upgraded to 1.8.1, but still seeing the same issue.

I'm going to try and downgrade docker to see if it helps and will get back

Matt

tgross commented 1 month ago

Any chance you upgraded the host distro at the same time? There's an open issue around the bridge module having been baked-in rather than a DKM https://github.com/hashicorp/nomad/issues/23583 and that's hitting a known issue in our network fingerprinting. (Which previously only impacted niche OS distros.)

ebarriosjr commented 1 month ago

Hi @tgross, the output of my docker version command is:


Client: Docker Engine - Community
 Version:           27.0.2
 API version:       1.46
 Go version:        go1.21.11
 Git commit:        912c1dd
 Built:             Wed Jun 26 18:48:01 2024
 OS/Arch:           linux/arm64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          27.0.3
  API version:      1.46 (minimum version 1.24)
  Go version:       go1.21.11
  Git commit:       662f78c
  Built:            Sat Jun 29 00:02:44 2024
  OS/Arch:          linux/arm64
  Experimental:     false
 containerd:
  Version:          1.7.18
  GitCommit:        ae71819c4f5e67bb4d5ae76a6b735f29cc25774e
 runc:
  Version:          1.7.18
  GitCommit:        v1.1.13-0-g58aa920
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0```
tgross commented 1 month ago

Weird that your client and server don't match. But the server looks identical to what I've posted above. Any thoughts about the networking discussion above?

ebarriosjr commented 1 month ago

Thats because i reverted the version of docker-ce-cli to 27.0.2. On 27.0.3 all the jobs that i have running on nomad stop working with the missing network error.

MatthewJohn commented 1 month ago

Any chance you upgraded the host distro at the same time? There's an open issue around the bridge module having been baked-in rather than a DKM #23583 and that's hitting a known issue in our network fingerprinting. (Which previously only impacted niche OS distros.)

Assuming this was aimed at me.. I'm running Debian bookworm, which definitely hasn't changed. As I say, it could be something completely unrelated, but a port-forwarding issue would presumably be a nomad client-related issue (as opposed to nomad servers, consul etc. related) and all the clients did so after they were rebooted and the only thing that had changed were package updates (plus a re-install, which included the latest docker version).

I'm just following up on the downgrade to see if it helped :)

Matt

Edit: No, the downgrade didn't help - so probably completely unrelated. Apologies, I'll continue my investigation

Edit edit: Yes, please completely ignore me - mine was actually the connect PKI root CA expiring (but happened during a powerdown, so the affect was quite different - envoy would start "happily" without any errors/warnings, but just didn't listen on any of the service ports!)

tgross commented 1 month ago

Ok, thanks @MatthewJohn. So @ebarriosjr that leaves the networking, as I mentioned earlier:

The other weird item here is this error Constraint "missing network": 1 nodes excluded by filter that you reported, because that suggests that there's something wrong with host fingerprinting of the network. And that doesn't involve Docker at all.

https://github.com/hashicorp/nomad/issues/23583 suggests that something may have changed in the environment where the bridge kernel module is unavailable, but I'd expect to see a network still. For us to make further progress on this we'll need information from you on the network fingerprint (and/or client logs from the network fingerprinting), whether the distro has been updated, whether the kernel module is present, etc.