moby / libnetwork

networking for containers
Apache License 2.0
2.15k stars 879 forks source link

Failing to attach containers to encrypted overlay network since linux 5.15.17 #2653

Open arnegroskurth opened 2 years ago

arnegroskurth commented 2 years ago

(I've phrased the issue for moby/moby before realizing that this is a separate component - so sorry for the docker-based description)

Description

Its currently not possible to communicate over encrypted overlay networks with kernel 5.15.17 due to an unset interface id when configuring the ipsec tunnel.

Downstream issue: https://github.com/coreos/fedora-coreos-tracker/issues/1111

Steps to reproduce the issue:

With two linux 5.15.17 hosts: Create an encrypted overlay network in a swarm and try to communicate between two containers on different nodes attached to that overlay network.

Additional information you deem important (e.g. issue happens only occasionally):

related linux change: https://github.com/torvalds/linux/commit/68ac0f3810e7 potential workaround in netlink library: https://github.com/vishvananda/netlink/pull/727

Missing Ifid for netlink.XfrmPolicy struct: (there may be more) https://github.com/moby/libnetwork/blob/64b7a4574d1426139437d20e81c0b6d391130ec8/drivers/overlay/encryption.go#L343

arnegroskurth commented 2 years ago

Also: Does it really make sense to only log the failure to create the xfrm policies as a warning? Seems like the network(-attachment) is not usable without that policy so I would much rather expect an error appearing for in the docker-client when creating/starting a container.

jsmouret commented 2 years ago

Similiar issue on Debian Buster with the same logs as https://github.com/coreos/fedora-coreos-tracker/issues/1111#issuecomment-1049171739

Working with linux-image-4.19.0-18-amd64 Broken with linux-image-4.19.0-19-amd64

Nowheresly commented 2 years ago

Related to this issue:

https://github.com/moby/moby/issues/43359#issue-1166547264

smin commented 2 years ago

The Ubuntu kernels don't seem to have reverted the validation on XFRM IF_ID being > 0. Corrections to the original patch have been included in the latest linux-aws-5.13 which could be read as an indication of it staying https://launchpad.net/ubuntu/+source/linux-aws-5.13/5.13.0-1023.25~20.04.1 https://launchpad.net/bugs/1968591)

What's the appropriate change in Moby or libnetwork?

  1. pass a non-zero Ifid to the netlink call?
  2. patch the netlink library to include changes in https://github.com/vishvananda/netlink/pull/727
  3. update the netlink library in vendor.conf to a newer release that includes the PR above?