hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.76k stars 1.94k forks source link

Bridge Network Module Fails to Load Automatically #23583

Closed mr-karan closed 1 month ago

mr-karan commented 1 month ago

Nomad version

Output from nomad version: Nomad v1.7.7

Operating system and Environment details

OS: Linux (AWS EC2) Kernel version: 6.8.0-1008-aws Environment: AWS EC2 instance

Issue

After upgrading Nomad from 1.6.1 to 1.7.7, the bridge network is not functioning. The Nomad agent logs indicate that the bridge module is not found or loaded in the system kernel.

Reproduction steps

  1. Upgrade Nomad from version 1.6.1 to 1.7.7.
  2. Restart Nomad services.
  3. Check Nomad client logs for network-related errors.

Expected Result

The bridge network functions correctly without any errors, similar to the behavior seen in version 1.6.1.

Actual Result

The following errors are observed in Nomad client logs indicating issues with the bridge module:

Nomad Client logs (if appropriate)

Jul 13 02:24:33 app nomad[4191]:   error=
Jul 13 02:24:33 app nomad[4191]:   | 4 errors occurred:
Jul 13 02:24:33 app nomad[4191]:   | \t* failed to find /sys/module/bridge: stat /sys/module/bridge: no su>
Jul 13 02:24:33 app nomad[4191]:   | \t* module bridge not in /proc/modules
Jul 13 02:24:33 app nomad[4191]:   | \t* module bridge not in /lib/modules/6.8.0-1008-aws/modules.builtin
Jul 13 02:24:33 app nomad[4191]:   | \t* module bridge not in /lib/modules/6.8.0-1008-aws/modules.dep
Jul 13 02:24:33 app nomad[4191]:   |
Jul 13 02:24:33 app nomad[4191]:

Note

I fixed it by manually loading the module with sudo modprobe bridge. After doing this, I restarted Nomad and spawned my job which had network.mode="bridge" and it worked fine. I think this is a regression from 1.6->1.7 upgrade.

tgross commented 1 month ago

Hi @mr-karan! It's a little strange that you're finding a regression here, as Nomad doesn't load the kernel module on its own. That's actually a known issue https://github.com/hashicorp/nomad/issues/10902 (and also https://github.com/hashicorp/nomad/issues/17311, sort of). But we also have https://github.com/hashicorp/nomad/issues/23523 open recently where someone wasn't seeing network fingerprinting happen correctly but seems to think it's Docker-related.

Is there any chance you've updated other infrastructure components other than Nomad when seeing this regression, @mr-karan? Specifically kernel, distro, Docker version, etc.?

mr-karan commented 1 month ago

Ah yes, indeed we upgraded to Ubuntu Minimal 24.04 (we were on 22.04 Minimal in 1.6.1). I'll try to reproduce this in a VM and get back. Thanks for taking a look!

tgross commented 1 month ago

Similar issue reported here: https://github.com/hashicorp/nomad/issues/23700

We'll need to update install documentation on this, as lots of folks are going to get caught out by it unfortunately.

tgross commented 1 month ago

Docs are being updated in #23707