Closed liyimeng closed 5 years ago
@liyimeng can you ensure the br_netfilter module is loaded. The agent is supposed to load this module but it seems to not always work. I'm troubleshooting that now.
@liyimeng FYI, it you are running in a container you need bind mount in /lib/modules/$(uname -r):/lib/modules/$(uname -r):ro
so that modules can be loaded
I do have it loaded
alpine:/home/alpine/k3s/dist/artifacts# lsmod | grep netfilter br_netfilter 20480 0 bridge 163840 1 br_netfilter
When I do a troubleshooting like this: https://kubernetes.io/docs/tasks/debug-application-cluster/debug-service/#running-commands-in-a-pod
` / # wget -O- hostnames Connecting to hostnames (10.43.129.144:80) hostnames-85bc9c579-rtr4r
** server can't find hostnames.default.svc.cluster.local: NXDOMAIN
Can't find hostnames.svc.cluster.local: No answer Can't find hostnames.cluster.local: No answer Can't find hostnames.default.svc.cluster.local: No answer Can't find hostnames.svc.cluster.local: No answer *** Can't find hostnames.cluster.local: No answer
/ # nslookup hostnames.default ;; connection timed out; no servers could be reached / # cat /etc/hosts
127.0.0.1 localhost ::1 localhost ip6-localhost ip6-loopback fe00::0 ip6-localnet fe00::0 ip6-mcastprefix fe00::1 ip6-allnodes fe00::2 ip6-allrouters 10.42.0.7 busybox / # arp 10-42-0-9.hostnames.default.svc.cluster.local (10.42.0.9) at 96:12:b9:85:cf:5c [ether] on eth0 10-42-0-8.hostnames.default.svc.cluster.local (10.42.0.8) at 1e:87:4b:df:77:2a [ether] on eth0
? (10.42.0.1) at 4e:b8:bd:7b:10:7b [ether] on eth0 10-42-0-5.kube-dns.kube-system.svc.cluster.local (10.42.0.5) at ba:e7:34:61:0a:bc [ether] on eth0 10-42-0-10.hostnames.default.svc.cluster.local (10.42.0.10) at 4e:2b:cf:de:e2:9e [ether] on eth0 `
how strange it is, I actually can reach the hostnames service when run wget, but nslookup failed. I guess it is something wrong on forwarding the packet from pod to service, or wise verse.
Do we have tube-proxy or ipvs to map between service and pods?
OK, I see that we use kube-proxy, at least iptables for this. String enough, nslookup work on the host! just not inside the pods!
` nslookup www.google.com 10.43.0.10 Server: 10.43.0.10 Address 1: 10.43.0.10
Name: www.google.com Address 1: 216.58.207.196 arn11s04-in-f4.1e100.net Address 2: 2a00:1450:400e:809::2004 ams15s32-in-x04.1e100.net alpine:/home/alpine/k3s/dist/artifacts# nslookup hostnames.default.svc.cluster.local 10.43.0.10 Server: 10.43.0.10 Address 1: 10.43.0.10
nslookup: can't resolve 'hostnames.default.svc.cluster.local': Name does not resolve `
BTW, ip forwarding is on
alpine:/home/alpine/k3s/dist/artifacts# cat /proc/sys/net/ipv4/ip_forward 1
@liyimeng is there any way I can reproduce your setup?
@ibuildthecloud Here is what I have done:
Seen the same issue when installing on an existing system. When running on a clean install, there are no issues.
After some testing, the issue appears to be in having existing iptables rules that have a default DROP
policy on the INPUT
. After setting the input policy to ACCEPT
, the issue appears to be resolved. Therefore, either a note needs to be added to the docs or k3s needs to setup its own iptables rules on the INPUT
chain to insure that traffic does not hit the default policy, which is usually DROP
or REJECT
for security reasons.
In a fresh installation over CentOS 7.5 I'm getting the same issue:
# kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
default ds4m-0 1/1 Running 0 25m
kube-system coredns-7748f7f6df-6j8kh 0/1 CrashLoopBackOff 9 25m
kube-system helm-install-traefik-bprt6 0/1 CrashLoopBackOff 9 25m
# kubectl logs coredns-7748f7f6df-6j8kh --namespace kube-system
E0305 16:36:17.678604 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:322: Failed to list *v1.Namespace: Get https://10.43.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 10.43.0.1:443: connect: no route to host
E0305 16:36:19.680598 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:322: Failed to list *v1.Namespace: Get https://10.43.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 10.43.0.1:443: connect: no route to host
.:53
2019-03-05T16:36:21.676Z [INFO] CoreDNS-1.3.0
2019-03-05T16:36:21.676Z [INFO] linux/amd64, go1.11.4, c8f0e94
CoreDNS-1.3.0
linux/amd64, go1.11.4, c8f0e94
2019-03-05T16:36:21.676Z [INFO] plugin/reload: Running configuration MD5 = 3ef0d797df417f2c0375a4d1531511fb
E0305 16:36:23.686625 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:322: Failed to list *v1.Namespace: Get https://10.43.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 10.43.0.1:443: connect: no route to host
E0305 16:36:23.690587 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:315: Failed to list *v1.Service: Get https://10.43.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.43.0.1:443: connect: no route to host
# kubectl logs helm-install-traefik-bprt6 --namespace kube-system
+ export HELM_HOST=127.0.0.1:44134
+ HELM_HOST=127.0.0.1:44134
+ helm init --client-only
+ tiller --listen=127.0.0.1:44134 --storage=secret
[main] 2019/03/05 16:35:55 Starting Tiller v2.12.3 (tls=false)
[main] 2019/03/05 16:35:55 GRPC listening on 127.0.0.1:44134
[main] 2019/03/05 16:35:55 Probes listening on :44135
[main] 2019/03/05 16:35:55 Storage driver is Secret
[main] 2019/03/05 16:35:55 Max history per release is 0
Creating /root/.helm
Creating /root/.helm/repository
Creating /root/.helm/repository/cache
Creating /root/.helm/repository/local
Creating /root/.helm/plugins
Creating /root/.helm/starters
Creating /root/.helm/cache/archive
Creating /root/.helm/repository/repositories.yaml
Adding stable repo with URL: https://kubernetes-charts.storage.googleapis.com
Error: Looks like "https://kubernetes-charts.storage.googleapis.com" is not a valid chart repository or cannot be reached: Get https://kubernetes-charts.storage.googleapis.com/index.yaml: dial tcp: lookup kubernetes-charts.storage.googleapis.com on 10.43.0.10:53: read udp 10.42.1.2:48204->10.43.0.10:53: read: no route to host
My firewalld configuration:
# firewall-cmd --list-all
public (active)
target: default
icmp-block-inversion: no
interfaces: eth0
sources:
services: dhcpv6-client ssh
ports: 4789/udp 6443/tcp
protocols:
masquerade: no
forward-ports:
source-ports:
icmp-blocks:
rich rules:
br_netfilter module is loaded:
# lsmod | grep netfilter
br_netfilter 22256 0
bridge 151336 2 br_netfilter,ebtable_broute
Which extra rules do I have to configure to get it working?
I've not used firewalld before, but essentially you need to add a rule equivalent to this iptables rule:
iptables -I INPUT 1 -i cni0 -s 10.42.0.0/16 -j ACCEPT
This rule says, permit incoming packets from interface (bridge) cni0
, with source in range 10.42.0.0/16
. If you wanted to be more granular, you could individually open each port, rather than accept any.
I've added the rule and core-dns logs way less errors (although there are still some) but helm-install-traefik continues crashing continuously with the same error. Do I need another rule for it?
Does firewalld have a log somewhere of what packets it is blocking or a way to enable such a log? If so, look there to see what might still be getting dropped.
I'm seeing this with VMWare Photon 3.0. Adding @aaliddell's snippet to /etc/systemd/scripts/ip4save
has done the trick.
@briandealwis can you please point out which exact snippet you are referring to?
iptables -I INPUT 1 -i cni0 -s 10.42.0.0/16 -j ACCEPT
I'm having the same problems.
From inside a pod, (busybox) the dns is configured as 10.43.x.x but no interface of that name is created. I start the server with
/usr/local/bin/k3s server --cluster-cidr 10.10.0.0/16
without disabling the coredns service. But my machine shows no interface with range 10.43.x.x:
cni0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet 10.10.0.1 netmask 255.255.255.0 broadcast 10.10.0.255
ether b6:05:a1:65:5e:49 txqueuelen 1000 (Ethernet)
RX packets 1990 bytes 181178 (176.9 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1312 bytes 131217 (128.1 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.42.2.112 netmask 255.255.0.0 broadcast 10.42.255.255
ether 00:15:5d:01:d2:21 txqueuelen 1000 (Ethernet)
RX packets 744576 bytes 380910420 (363.2 MiB)
RX errors 0 dropped 6 overruns 0 frame 0
TX packets 71389 bytes 49850951 (47.5 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet 10.10.0.0 netmask 255.255.255.255 broadcast 10.10.0.0
ether 1e:41:71:d1:92:ff txqueuelen 0 (Ethernet)
RX packets 26 bytes 1944 (1.8 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 14 bytes 888 (888.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
loop txqueuelen 1000 (Lokale Schleife)
RX packets 632239 bytes 225761419 (215.3 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 632239 bytes 225761419 (215.3 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
The 10.43.0.0/16 range is the default service ClusterIP range, which isn't actually bound to any interface but is instead a 'virtual' ip that is routed by iptables to a pod backing the service: https://kubernetes.io/docs/concepts/services-networking/service/#proxy-mode-iptables
You can change the service ip range with --service-cidr
: https://github.com/rancher/k3s/pull/171
Thanks for your responses!
I added my cidr network to the firewall (accept rules show up there).
Still, after uninstalling k3s and reinstalling with
./install-k3s.sh --cluster-cidr=10.10.0.0/16
I still get errors related to DNS.
After the startup I tail the logs from coredns doing a kubectl logs -f coredns-7748f7f6df-6klwg -n kube-system
:
[root@h20181152922 ~]# kubectl logs -f coredns-7748f7f6df-6klwg -n kube-system
.:53
2019-03-13T07:45:33.023Z [INFO] CoreDNS-1.3.0
2019-03-13T07:45:33.023Z [INFO] linux/amd64, go1.11.4, c8f0e94
CoreDNS-1.3.0
linux/amd64, go1.11.4, c8f0e94
2019-03-13T07:45:33.023Z [INFO] plugin/reload: Running configuration MD5 = 3ef0d797df417f2c0375a4d1531511fb
2019-03-13T07:45:54.026Z [ERROR] plugin/errors: 2 7066956279340311327.7933564162591032888. HINFO: unreachable backend: read udp 10.10.0.20:47160->1.1.1.1:53: i/o timeout
2019-03-13T07:45:57.025Z [ERROR] plugin/errors: 2 7066956279340311327.7933564162591032888. HINFO: unreachable backend: read udp 10.10.0.20:46236->1.1.1.1:53: i/o timeout
From the server logs:
Mar 13 08:50:47 docker2 k3s: time="2019-03-13T08:50:47.568418341+01:00" level=info msg="Running kubelet --healthz-bind-address 127.0.0.1 --read-only-port 0 --allow-privileged=true --cluster-domain cluster.local --kubeconfig /var/lib/rancher/k3s/agent/kubeconfig.yaml --eviction-hard imagefs.available<5%,nodefs.available<5% --eviction-minimum-reclaim imagefs.available=10%,nodefs.available=10% --fail-swap-on=false --cgroup-driver cgroupfs --root-dir /var/lib/rancher/k3s/agent/kubelet --cert-dir /var/lib/rancher/k3s/agent/kubelet/pki --seccomp-profile-root /var/lib/rancher/k3s/agent/kubelet/seccomp --cni-conf-dir /var/lib/rancher/k3s/agent/etc/cni/net.d --cni-bin-dir /var/lib/rancher/k3s/data/e44f7a46cadac4cec9a759756f2a27fdb25e705a83d8d563207c6a6c5fa368b4/bin --cluster-dns 10.43.0.10 --container-runtime remote --container-runtime-endpoint unix:///run/k3s/containerd/containerd.sock --address 127.0.0.1 --anonymous-auth=false --client-ca-file /var/lib/rancher/k3s/agent/client-ca.pem --hostname-override h20181152922 --cpu-cfs-quota=false --runtime-cgroups /systemd/system.slice --kubelet-cgroups /systemd/system.slice"
WhenI try to do a nslookup from buysbox:
[root@h20181152922 k3s]# /usr/local/bin/k3s kubectl run -i --tty busybox --image=busybox --restart=Never -- sh
If you don't see a command prompt, try pressing enter.
/ # nslookup www.google.de
;; connection timed out; no servers could be reached
/ # cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local haba.int
nameserver 10.43.0.10
options ndots:5
/ # ping 10.43.0.10
PING 10.43.0.10 (10.43.0.10): 56 data bytes
^C
--- 10.43.0.10 ping statistics ---
4 packets transmitted, 0 packets received, 100% packet loss
So it seems DNS is not working properly...
Pings to 10.43.0.0/16 addresses aren't going to respond, due to them being 'virtual' and only really existing within iptables. If your DNS requests are getting to the CoreDNS pod, then cluster networking looks like it's working. Your issue may be related to https://github.com/rancher/k3s/issues/53 (how on earth did a DNS problem get issue number 53...)
In fact the coredns isn't able to reach out to 1.1.1.1:53. I changed it according to #53 and now its working!!
Thanks!
BTW: you're right on the issue number. What a nice coincidence!
To anyone running into this issue on Fedora, the proper command to add the iptables rule is:
firewall-cmd --permanent --direct --add-rule ipv4 filter INPUT 1 -i cni0 -s 10.42.0.0/16 -j ACCEPT
and then a firewall-cmd --reload
fixed it for me.
Still having issues with DNS resolving though.
Fedora 29 these fixed both CoreDNS and Traefik install for me:
firewall-cmd --permanent --direct --add-rule ipv4 filter INPUT 1 -i cni0 -s 10.42.0.0/16 -j ACCEPT
firewall-cmd --permanent --direct --add-rule ipv4 filter FORWARD 1 -s 10.42.0.0/15 -j ACCEPT
firewall-cmd --reload
Might be possible to further narrow down or optimise the /15.
Fedora 29 these fixed both CoreDNS and Traefik install for me:
Perfect timing! This worked like a charm for me on CentOS 7. :beers:
Getting the same issue with multi nodes k3s. With only one node, everything work like a charm. When adding a new node :
kubectl get nodes
I'm playing with this command kubectl run -it --rm --restart=Never busybox --image=busybox sh
When codeDNS
and busybox
pods are not on the same host, they can't talk.
But when they are on the same node, they can...
Two fresh Centos 7 launched on GCP with no firewall filtering between them.
k3s cluster launched with those start commands :
server : /usr/local/bin/k3s server --docker
node : /usr/local/bin/k3s agent --docker --server https://server:6443 --token "TOKEN"
This issue topic is very broad and each person's setup is different and unique. I'd like to close the original issue and if you are still having networking issues, can you open a new issue.
Ideally the subject is something that indicates what OS you are using, what version, and something specific about how the networking is broken.
Thanks for understanding!
Fedora 29 these fixed both CoreDNS and Traefik install for me:
firewall-cmd --permanent --direct --add-rule ipv4 filter INPUT 1 -i cni0 -s 10.42.0.0/16 -j ACCEPT firewall-cmd --permanent --direct --add-rule ipv4 filter FORWARD 1 -s 10.42.0.0/15 -j ACCEPT firewall-cmd --reload
Might be possible to further narrow down or optimise the /15.
The narrow solution whould be sudo iptables -A KUBE-FORWARD -s 10.42.0.0/16 -d 10.42.0.0/16 -m state --state NEW -j ACCEPT
However, KUBE-FORWARD table is updated quiclky, so previous command will work one time if you are quick enough. So you can use sudo firewall-cmd --direct --add-rule ipv4 filter FORWARD 1 -s 10.42.0.0/16 -d 10.42.0.0/16 -m state --state NEW -j ACCEPT
I had same error, and i spent some time to resolve it, even i reinstalled my OS, and tried with different kubes' versions. At the end the issue was firewalld. I disabled and tried again with a fresh installation, and now it is working fine.
@deniseschannon it seems the same issue on k3os.
I've not used firewalld before, but essentially you need to add a rule equivalent to this iptables rule:
iptables -I INPUT 1 -i cni0 -s 10.42.0.0/16 -j ACCEPT
This rule says, permit incoming packets from interface (bridge)
cni0
, with source in range10.42.0.0/16
. If you wanted to be more granular, you could individually open each port, rather than accept any.
Is it correct for ufw or am I the other way around :
sudo ufw allow in on cni0 from 10.42.0.0/16 comment "K3s rule : https://github.com/rancher/k3s/issues/24#issuecomment-469759329"
Getting the same issue with multi nodes k3s. With only one node, everything work like a charm. When adding a new node :
- it show up with
kubectl get nodes
- when running new pods, they start properly on this node
I'm playing with this command
kubectl run -it --rm --restart=Never busybox --image=busybox sh
WhencodeDNS
andbusybox
pods are not on the same host, they can't talk. But when they are on the same node, they can...My config
Two fresh Centos 7 launched on GCP with no firewall filtering between them. k3s cluster launched with those start commands : server :
/usr/local/bin/k3s server --docker
node :/usr/local/bin/k3s agent --docker --server https://server:6443 --token "TOKEN"
@Lunik did you end up finding a solution for this ?? i am having the same problem.
In case if you are having a similar issue, I notice there were rules related to docker in my chains. (I was using containerd). The steps I have followed:
On fedora 31 I found the simplest thing to do was:
firewall-cmd --permanent --direct --add-rule ipv4 filter FORWARD 0 -i cni0 -j ACCEPT
firewall-cmd --reload
(edit: fix random space in -i)
@kbrowder would you mind reviewing that command? Im on Fedora31 and seeing this issue but when I try your command it says its not a valid ipv4 filter command.
@kbrowder would you mind reviewing that command? Im on Fedora31 and seeing this issue but when I try your command it says its not a valid ipv4 filter command.
there was just a typo in his command it says - i cni0
instead of -i cni0
besides that it works for me on centos8
@maci0, woops, you're right, I edited my response above, sorry for the delay @jbutler992
I would just like to summarize this post by clearly stating the two iptables rules, taken from above, whose fixed my broken fresh install of k3s in a matter of (a fraction of) a second, after several days of struggling with it:
sudo iptables -I INPUT 1 -i cni0 -s 10.42.0.0/16 -j ACCEPT
sudo iptables -I FORWARD 1 -s 10.42.0.0/15 -j ACCEPT
Many many thanks to you all for your contribution!
This is an old thread, but I still want to share this to potentially save someone days of frustration.
I had a private network of 10.42.42.0/24
and could not figure out why k3s was not working. Using non-default cluster/service CIDRs fixed networking issues for me.
This is an old thread, but I still want to share this to potentially save someone days of frustration.
I had a private network of
10.42.42.0/24
and could not figure out why k3s was not working. Using non-default cluster/service CIDRs fixed networking issues for me.
Thank you, kind sir. That hint saved me a ton of time!
I would just like to summarize this post by clearly stating the two iptables rules, taken from above, whose fixed my broken fresh install of k3s in a matter of (a fraction of) a second, after several days of struggling with it:
sudo iptables -I INPUT 1 -i cni0 -s 10.42.0.0/16 -j ACCEPT
sudo iptables -I FORWARD 1 -s 10.42.0.0/15 -j ACCEPT
Many many thanks to you all for your contribution!
Thanks you !
helm install job never succeed, it seem that it is not possible to reach dns server.
Verify by running a busy box