Closed zypriafl closed 12 months ago
@zypriafl Please see the debug section in the readme and do some tests at the node and cluster level, also look at the cilium logs, see the cilium example in the examples section also in the readme.
@M4t7e Any ideas on this DNS failure for an Egress node?
Could it be some kind of deadlock. "NetworkPluginNotReady: cni plugin not initialized" because cilium container cannot be pulled because then again Network is not ready?
@zypriafl You are affected by the infamous Iranian IPs blocks (at least percieved as such) that are blocked from pulling from GCR and every service that depend on that. So two solutions:
Ether use k3s_registries to proxy all container pull requests to another unblocked registry, or and probably the easiest solution.
Cordon the node, and brutally delete it with hcloud server delete xxx
, that will free up its IP, then with hcloud also, or the UI register a floating IP (just temporarily), it will assign the liberated IP. They terraform apply again to deploy the missing node with a new IP, then release the reserved IP that is not blocked from GCR.
The latter trick is a 5 minutes op and should work! Good luck.
It looks like the node itself has no working DNS...
Regarding #830: This is not the valid anymore. Now Cilium CNI is not limited to a single interface and it detects the needed interfaces automatically. This is mandatory for BPF based NAT scenarios.
@zypriafl Can you execute the following commands in bash and provide me the output please? Just copy & paste all of it and hit enter.
set -o xtrace
ip a show eth0
ip a show eth1
nmcli device show eth0
nmcli device show eth1
ip route
ip route get 1.1.1.1
ping -c 5 1.1.1.1
cat /etc/resolv.conf
cat /etc/NetworkManager/conf.d/dns.conf
dig google.com @185.12.64.1 +short
dig google.com @185.12.64.2 +short
dig google.com @2a01:4ff:ff00::add:1 +short
dig google.com @2a01:4ff:ff00::add:2 +short
dig google.com @1.1.1.1 +short
dig google.com @8.8.8.8 +short
dig google.com @9.9.9.9 +short
curl -v google.com
set +o xtrace
thanks @M4t7e, here are the outputs:
app-scaler-egress-pool-nfe:~ # set -o xtrace
+ set -o xtrace
app-scaler-egress-pool-nfe:~ # ip a show eth0
+ ip --color=auto a show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 96:00:02:b7:b2:4f brd ff:ff:ff:ff:ff:ff
altname enp0s3
altname ens3
inet 5.75.209.253/32 scope global noprefixroute eth0
valid_lft forever preferred_lft forever
inet 91.107.195.41/32 scope global noprefixroute eth0
valid_lft forever preferred_lft forever
inet6 fe80::352f:e891:5704:18f9/64 scope link noprefixroute
valid_lft forever preferred_lft forever
app-scaler-egress-pool-nfe:~ # ip a show eth1
+ ip --color=auto a show eth1
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast state UP group default qlen 1000
link/ether 86:00:00:67:45:19 brd ff:ff:ff:ff:ff:ff
altname enp0s10
altname ens10
inet 172.16.48.101/32 scope global dynamic noprefixroute eth1
valid_lft 86233sec preferred_lft 86233sec
inet6 fe80::b3a2:e260:cd40:5d04/64 scope link noprefixroute
valid_lft forever preferred_lft forever
app-scaler-egress-pool-nfe:~ # nmcli device show eth0
+ nmcli device show eth0
GENERAL.DEVICE: eth0
GENERAL.TYPE: ethernet
GENERAL.HWADDR: 96:00:02:B7:B2:4F
GENERAL.MTU: 1500
GENERAL.STATE: 100 (connected)
GENERAL.CONNECTION: eth0
GENERAL.CON-PATH: /org/freedesktop/NetworkManager/ActiveConnection/4
WIRED-PROPERTIES.CARRIER: on
IP4.ADDRESS[1]: 91.107.195.41/32
IP4.ADDRESS[2]: 5.75.209.253/32
IP4.GATEWAY: 172.31.1.1
IP4.ROUTE[1]: dst = 172.31.1.1/32, nh = 0.0.0.0, mt = 20100
IP4.ROUTE[2]: dst = 0.0.0.0/0, nh = 172.31.1.1, mt = 20100
IP6.ADDRESS[1]: fe80::352f:e891:5704:18f9/64
IP6.GATEWAY: --
IP6.ROUTE[1]: dst = fe80::/64, nh = ::, mt = 1024
app-scaler-egress-pool-nfe:~ # nmcli device show eth1
GENERAL.DEVICE: eth1
GENERAL.TYPE: ethernet
GENERAL.HWADDR: 86:00:00:67:45:19
GENERAL.MTU: 1450
GENERAL.STATE: 100 (connected)
GENERAL.CONNECTION: eth1
GENERAL.CON-PATH: /org/freedesktop/NetworkManager/ActiveConnection/3
WIRED-PROPERTIES.CARRIER: on
IP4.ADDRESS[1]: 172.16.48.101/32
IP4.GATEWAY: 172.16.0.1
IP4.ROUTE[1]: dst = 0.0.0.0/0, nh = 172.16.0.1, mt = 20101
IP4.ROUTE[2]: dst = 172.16.0.0/12, nh = 172.16.0.1, mt = 101
IP4.ROUTE[3]: dst = 172.16.0.1/32, nh = 0.0.0.0, mt = 101
IP6.ADDRESS[1]: fe80::b3a2:e260:cd40:5d04/64
IP6.GATEWAY: --
IP6.ROUTE[1]: dst = fe80::/64, nh = ::, mt = 1024
app-scaler-egress-pool-nfe:~ #
app-scaler-egress-pool-nfe:~ # ip route
+ ip --color=auto route
default via 172.31.1.1 dev eth0 proto static metric 20100
default via 172.16.0.1 dev eth1 proto dhcp src 172.16.48.101 metric 20101
172.16.0.0/12 via 172.16.0.1 dev eth1 proto dhcp src 172.16.48.101 metric 101
172.16.0.1 dev eth1 proto dhcp scope link src 172.16.48.101 metric 101
172.31.1.1 dev eth0 proto static scope link metric 20100
app-scaler-egress-pool-nfe:~ # ip route get 1.1.1.1
+ ip --color=auto route get 1.1.1.1
1.1.1.1 via 172.31.1.1 dev eth0 src 5.75.209.253 uid 0
cache
app-scaler-egress-pool-nfe:~ # ping -c 5 1.1.1.1
+ ping -c 5 1.1.1.1
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=55 time=6.61 ms
64 bytes from 1.1.1.1: icmp_seq=2 ttl=55 time=6.45 ms
64 bytes from 1.1.1.1: icmp_seq=3 ttl=55 time=5.64 ms
64 bytes from 1.1.1.1: icmp_seq=4 ttl=55 time=5.73 ms
64 bytes from 1.1.1.1: icmp_seq=5 ttl=55 time=5.84 ms
--- 1.1.1.1 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4007ms
rtt min/avg/max/mdev = 5.638/6.055/6.608/0.396 ms
app-scaler-egress-pool-nfe:~ # cat /etc/resolv.conf
+ cat /etc/resolv.conf
# Generated by NetworkManager
app-scaler-egress-pool-nfe:~ # cat /etc/NetworkManager/conf.d/dns.conf
+ cat /etc/NetworkManager/conf.d/dns.conf
cat: /etc/NetworkManager/conf.d/dns.conf: No such file or directory
app-scaler-egress-pool-nfe:~ # dig google.com @185.12.64.1 +short
+ dig google.com @185.12.64.1 +short
216.58.212.142
app-scaler-egress-pool-nfe:~ # dig google.com @185.12.64.2 +short
+ dig google.com @185.12.64.2 +short
216.58.206.46
app-scaler-egress-pool-nfe:~ # dig google.com @2a01:4ff:ff00::add:1 +short
+ dig google.com @2a01:4ff:ff00::add:1 +short
;; UDP setup with 2a01:4ff:ff00::add:1#53(2a01:4ff:ff00::add:1) for google.com failed: network unreachable.
;; UDP setup with 2a01:4ff:ff00::add:1#53(2a01:4ff:ff00::add:1) for google.com failed: network unreachable.
;; UDP setup with 2a01:4ff:ff00::add:1#53(2a01:4ff:ff00::add:1) for google.com failed: network unreachable.
app-scaler-egress-pool-nfe:~ # dig google.com @2a01:4ff:ff00::add:2 +short
+ dig google.com @2a01:4ff:ff00::add:2 +short
;; UDP setup with 2a01:4ff:ff00::add:2#53(2a01:4ff:ff00::add:2) for google.com failed: network unreachable.
;; UDP setup with 2a01:4ff:ff00::add:2#53(2a01:4ff:ff00::add:2) for google.com failed: network unreachable.
;; UDP setup with 2a01:4ff:ff00::add:2#53(2a01:4ff:ff00::add:2) for google.com failed: network unreachable.
app-scaler-egress-pool-nfe:~ # dig google.com @1.1.1.1 +short
+ dig google.com @1.1.1.1 +short
142.250.185.110
app-scaler-egress-pool-nfe:~ # dig google.com @8.8.8.8 +short
+ dig google.com @8.8.8.8 +short
216.58.206.46
app-scaler-egress-pool-nfe:~ # dig google.com @9.9.9.9 +short
+ dig google.com @9.9.9.9 +short
142.250.186.142
app-scaler-egress-pool-nfe:~ # curl -v google.com
+ curl -v google.com
* Could not resolve host: google.com
* Closing connection 0
curl: (6) Could not resolve host: google.com
app-scaler-egress-pool-nfe:~ # set +o xtrace
+ set +o xtrace
app-scaler-egress-pool-nfe:~ #
@mysticaltech the issue persists even with another IP.
Thx for the output @zypriafl
That's strange, you have no DNS Servers configured as far as I can see. Was that a fresh server installation or is there some history for that node?
Please reboot the node and try to reach google.com again. If that does not help, please destroy and redeploy the node again or try to configure the DNS servers explicitly in kube.tf with either dns_servers = ["1.1.1.1", "8.8.8.8", "9.9.9.10"]
(CloudFlare, Google & Quad9) or dns_servers = ["185.12.64.1", "185.12.64.2"]
(Hetzner DNS).
Btw, it looks like your IPv6 configuration is not okay. You have no public routable IPv6 addresses on your egress node. Nowadays operating systems usually prefer IPv6 over IPv4. But this only explains why you can't resolve DNS records via IPv6.
Thank you. The node was fresh created. Setting the DNS server explictly in kube.tf fixed it and the cilium image can now be pulled: dns_servers = ["1.1.1.1", "8.8.8.8", "9.9.9.10"]
For me the issue is resolved. However, maybe there is still something to be fixed to make it work without settings dns_servers explicitly....
@mysticaltech maybe we have here an unfortunate combination... My theory:
If a floating IP is used, ipv4.method manual
disables DHCP for IPv4 entirely and therefore also obtaining IPv4 DNS servers:
https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/blob/9bf9edc1a579c097b10954ddab2d5a5aea7067e4/agents.tf#L210-L215
If IPv6 is enabled, it can solve the problem, because you still have the possibility to use DNS via IPv6. But here IPv6 was not working either. This needs to be tested, but so far this is the only explanation I have.
Solution can be explicit pre-configuration of DNS servers in variables.tf with Hetzner Resolvers:
Then we have full redundancy without depending on NetworkManager + DHCP
@M4t7e It makes sense, let's add those default DNS resolvers, will do tomorrow but don't hesitate to PR 🙏
@zypriafl It should be fixed by default in v2.10.0. I followed @M4t7e's fix suggestion.
Description
I try to use egress with cilium. However the egress node seems to have a DNS issue and therefore is not able to pull the cilium pod (see screenshots).
This might be the same as discussed here: https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/discussions/830. Unfortunately, this discussion doesn't make it clear how to resolve this.
Thank you for your help.
Kube.tf file
Screenshots
Platform
Linux