newsnowlabs / runcvm

RunCVM (Run Container VM) is an experimental open-source Docker container runtime, for launching standard container workloads - as well as Systemd, Docker, even OpenWrt - in VMs using 'docker run`
Apache License 2.0
72 stars 4 forks source link

DNS lookups fail in RunCVM in Google Cloud #16

Closed struanb closed 5 months ago

struanb commented 5 months ago

DNS lookups fail in RunCVM container/VMs launched in Google Cloud instances (from images with vmx enabled).

This may be because Google instances' own DNS uses the same IP as the DNS server RunCVM provides to its VMs, 169.254.169.254.

Test this by assigning a different IP to RunCVM DNS, and test on AWS and Azure.

Consider changing the RunCVM default, and adding a RUNCVM_DNS option to override the default.

struanb commented 5 months ago

This turns out not to be (strictly) a clash between the Google DNS and the IP used to route VM DNS to dnsmasq within the container.

Instead, the issue is caused by Google's default network security settings (specifically, the rp_filter settings) within some of its Linux images (at least, the Debian image):

# grep rp_filter /etc/sysctl.d/60-gce-network-security.conf 
net.ipv4.conf.all.rp_filter=1
net.ipv4.conf.default.rp_filter=1

See https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt for details of this setting.

The problem arises because Docker DNS appears to issue DNS UDP packets with a source address of the container bridge IP, which in RunCVM is overridden from its default value (of the container interface IP) with 169.254.1.1.

Observing diagnostics is hard without also changing the RUNCVM_DNS_IP in runcvm-ctr-entrypoint from 169.254.169.254 (which is also Google Cloud's internal DNS IP configured in the instance's /etc/resolv.conf) to, say, 169.254.2.2 (since RunCVM really doesn't mind). Having done this, then nslookup abc.com. failures can be seen in the logs as follows (where br-86ec12c40838 is the bridge associated with the Docker custom network the launched RunCVM container-VM is connected to:

docker run --rm -it --runtime=runcvm --network=test --name=test alpine ash -c 'echo Starting...; echo; nslookup abc.com.'
Starting...

Server:     169.254.2.2
Address:    169.254.2.2:53

;; connection timed out; no servers could be reached
$ journalctl -u docker -g resolver -f
Feb 02 23:34:10 instance-1 dockerd[901]: time="2024-02-02T23:34:10.709427720Z" level=error msg="[resolver] failed to query DNS server: 169.254.169.254:53, query: ;qwerty.com.\tIN\t AAAA" error="read udp 169.254.1.1:60562->169.254.169.254:53: i/o timeout"
# tcpdump -n udp -i br-86ec12c40838 port 53
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on br-86ec12c40838, link-type EN10MB (Ethernet), snapshot length 262144 bytes
23:33:52.652540 IP 169.254.2.2.53 > 172.21.0.2.60525: 47457 ServFail 0/0/0 (25)
23:33:52.691150 IP 169.254.2.2.53 > 172.21.0.2.60525: 47805 ServFail 0/0/0 (25)

Compare with a runc container:

# tcpdump -n udp -i br-86ec12c40838 port 53
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
23:44:23.365393 IP 172.21.0.2.34698 > 169.254.169.254.53: 36575+ A? abc.com. (25)
23:44:23.407471 IP 169.254.169.254.53 > 172.21.0.2.34698: 36575 4/0/0 A 54.230.18.46, A 54.230.18.128, A 54.230.18.11, A 54.230.18.23 (89)

The workaround is to set rp_filter to 2 (loose mode) or 0 (no source validation) on, at minimum, the Docker custom network bridge. (Indeed, on RunCVM development VMs and test servers, the rp_filter settings are currently all 0 as per standard Debian defaults).

This can be done at varying levels of generality e.g.

  1. sysctl net.ipv4.conf.$BRIDGE.rp_filter=2 or sysctl net.ipv4.conf.$BRIDGE.rp_filter=0 (where $BRIDGE is the Docker custom network bridge that the container is connected to
  2. sysctl net.ipv4.conf.default.rp_filter=2 or sysctl net.ipv4.conf.default.rp_filter=0 (before dockerd starts and creates any bridges)
  3. sysctl net.ipv4.conf.all.rp_filter=2 (at any time, since "The max value from conf/{all,interface}/rp_filter is used when doing source validation on the {interface}."
  4. Comment out any net.ipv4.conf.all.rp_filter=1 or net.ipv4.conf.default.rp_filter=1 lines in /etc/sysctl.conf or /etc/sysctl.d/* (e.g. /etc/sysctl.d/60-gce-network-security.conf), or replace such lines with net.ipv4.conf.all.rp_filter=2.
struanb commented 5 months ago

Closed by https://github.com/newsnowlabs/runcvm/pull/17