Containers w/ upstreams losing network due to jobs are started/stoped too often

Background

[Filed as a request as I'm not sure this really is a bug or not]

Lately, we've seen that we completely lose connection with an important container we have and consul is unable to healthcheck it. After some digging, we could se the arp cache within the container (/proc/net/arp) had a totally alien mac-address set which did not correspond to the mac-address of the nomad-interface in the host. Meaning the container had no way to communicating with the outside.

The reason for this seems to be due to the fact we have another job, with an upstream, that does a start/stop a couple of times per hour. From what we understand, bridges in Linux works as it'll take the lowest mac address of the interfaces it hosts. This means every once in while, the bridge nomad: will change its mac address, and normally all containers should get its arp cache updated due to this. But, for some reason that doesn't happen every time for us, and then we're toast. Even though the cache should be updated after 60secs, it isn't for reasons we don't really have an answer to.

I'm not sure whether this is a bug in Linux (5.14 ubuntu 20.04 LTS) or something else but looking for ways around this problem is to arping -U the bridge, then the container will update its cache. Another way would be to pre-make the nomad bridge during boot, and set it to a have a low mac address (00:00:..), thus preventing it flipping over to something else as no other interface added to the bridge will have a lower address. Yet another way would be to have a bridge_network_mac_address in nomad's config but that is not something that is supported as of v1.2.6. Only a name and a subnet.

Proposal

From what we've seen, there are others with similar problems without solutions so see this issue as an informational post and if possible, also as a request for a new configuration parameter to have nomad set the address on the bridge it creates.

Hi @dozepih! Your analysis seems sound to me.

Another way would be to pre-make the nomad bridge during boot, and set it to a have a low mac address (00:00:..), thus preventing it flipping over to something else as no other interface added to the bridge will have a lower address.

Yeah, I proposed that in https://github.com/hashicorp/nomad/issues/6618 to unrelated issues in CNI, but we've never gotten around to doing it. It looks like https://github.com/hashicorp/nomad/issues/10915 is another more recent issue that would be fixed with the same solution.

As a workaround in the meantime, you may also be able to improve the situation by expiring ARP more frequently, but I understand that may not be a super desirable solution depending on how persistent connections are between processes:

echo 1 > /proc/sys/net/ipv4/neigh/nomad/gc_stale_time
echo 1 >  /proc/sys/net/ipv4/neigh/nomad/delay_first_probe_time
echo 1 > /proc/sys/net/ipv4/neigh/nomad/retrans_time
echo 1 > /proc/sys/net/ipv4/neigh/nomad/base_reachable_time
echo 1 > /proc/sys/net/ipv4/neigh/nomad/locktime

Thanks for opening this issue @dozepih. I'm going to leave this open and marked as an enhancement for roadmapping

hashicorp / nomad

Containers w/ upstreams losing network due to jobs are started/stoped too often #12103

Background

Proposal