kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
111.11k stars 39.67k forks source link

Coredns service do not work,but endpoint is ok,the other SVCs are normal only except dns #63900

Closed yinwenqin closed 6 years ago

yinwenqin commented 6 years ago

here is the status right now:

root:~# kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", BuildDate:"2018-02-07T12:22:21Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", BuildDate:"2018-02-07T11:55:20Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}

root:~# kubectl get svc -n kube-system | grep dns
kube-dns               ClusterIP   10.96.0.10       <none>         53/UDP,53/TCP   68d

root:~# kubectl get pod -n kube-system -o wide | grep dns 
coredns-657fc9d5b8-85mfb                1/1       Running     0          28m       172.26.2.164   yksp009029

root:~# kubectl get ep kube-dns -n kube-system
NAME       ENDPOINTS                         AGE
kube-dns   172.26.2.164:53,172.26.2.164:53   68d

#here are svcs info
root:~# kubectl get svc -o wide -n kube-system
NAME                   TYPE        CLUSTER-IP       EXTERNAL-IP    PORT(S)         AGE       SELECTOR
heapster               ClusterIP   10.111.110.137   <none>         80/TCP          67d       k8s-app=heapster
kube-dns               ClusterIP   10.96.0.10       <none>         53/UDP,53/TCP   68d       k8s-app=kube-dns
kubernetes-dashboard   NodePort    10.104.130.140   <none>         443:30843/TCP   67d       k8s-app=kubernetes-dashboard
monitoring-grafana     ClusterIP   10.109.146.115   192.168.9.60   80/TCP          67d       k8s-app=grafana
monitoring-influxdb    ClusterIP   10.101.110.69    <none>         8086/TCP        67d       k8s-app=influxdb
traefik-web-ui         ClusterIP   10.98.101.187    <none>         80/TCP          67d       k8s-app=traefik-ingress-lb

#all SVCs work normally except kube-dns
root:~# telnet 10.111.110.137 80
Trying 10.111.110.137...
Connected to 10.111.110.137.
Escape character is '^]'.
^C
Connection closed by foreign host.

#The port of the kube-dns svc is in a non-listening state,but in fact, the pod on end-point is listening
root:~# telnet 10.96.0.10 53
Trying 10.96.0.10...
telnet: Unable to connect to remote host: No route to host

#ping is ok
root:~# ping 10.96.0.10
PING 10.96.0.10 (10.96.0.10) 56(84) bytes of data.
64 bytes from 10.96.0.10: icmp_seq=1 ttl=64 time=0.055 ms
64 bytes from 10.96.0.10: icmp_seq=2 ttl=64 time=0.038 ms

#the kube-dns endpoint pod is in listening state
root:~# telnet 172.26.2.164 53
Trying 172.26.2.164...
Connected to 172.26.2.164.
Escape character is '^]'.
Connection closed by foreign host.

It just look like every thing is fine.Whether it is viewing the svc/pod/endpoint status, logs,iptables,network,everything is normal. But only the one svc--kube-dns don't work,it causes to be unable to parse the domain name in pod,but ping IP is ok. So worry about this for a long time,Somebody can help me?

yinwenqin commented 6 years ago

/area dns

yinwenqin commented 6 years ago

/kind bug

dims commented 6 years ago

/sig network

mritd commented 6 years ago

I guess maybe the selector of kube-dns svc does not match coredns pod.

You can delete the kube-dns svc and then use this script to create the coredns.yaml file (note the changes to the script parameters) , Finally execute the kubectl create -f coredns.yaml command to use coredns (this was tested on my 1.10.1 ubuntu16 cluster)

chrisohaver commented 6 years ago

I believe the endpoints would not list IPs if the selector did not match the pod labels. Per my understanding, the fact that the endpoints exist and have ready IPs means the selectors are selecting something.

FYI, The official CoreDNS deployment script is here.

@yinwenqin, can you make a DNS query from a pod? For example by spinning up a client pod e.g. kubectl run -it --rm --image=infoblox/dnstools dns-client, and then then executing a dig query from the pod such as dig kubernetes.default.svc.cluster.local.

thockin commented 6 years ago

https://kubernetes.io/docs/tasks/debug-application-cluster/debug-service/

thockin commented 6 years ago

@johnbelamaric

chrisohaver commented 6 years ago

@yinwenqin, still an issue?

yinwenqin commented 6 years ago

@chrisohaver After I redeployed CoreDNS twice, the problem was solved, which was very strange, but so far the problem has not recurred.I will close this issue,thank you so much!

ngocson2vn commented 6 years ago

Same issue, I also had to redeploy the coredns deployment:

$ kubectl get deploy -n kube-system
NAME                   DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
coredns                2         2         2            2           29h
$ wget https://raw.githubusercontent.com/zlabjp/kubernetes-scripts/master/force-update-deployment
$ chmod +x force-update-deployment
$ force-update-deployment coredns -n kube-system
GregSilverman commented 6 years ago

Thanks! This problem was driving me nuts! I had to redeploy coredns three times before it started working. Also had to upgrade kubeadm on my master node. Insanity!

Am wondering though if this needs to be done on all nodes. I did it on the master node with success, but am still having issues on the other nodes.

GregSilverman commented 6 years ago

I rebuilt my cluster, and now this works on 2-nodes (master and a worker), but cannot get it to work on the other 2-nodes. Plus, I have to force a redeploy with alarming frequency, otherwise I lose the network. This is nuts.

GregSilverman commented 6 years ago

And now it died... no matter how many redeploys I issue, coredns is not working with kubernetes. I'll try rebuilding it again later with Calico as the CNI. Am using Flannel now.

johnbelamaric commented 6 years ago

Did you take a look at the link above - https://kubernetes.io/docs/tasks/debug-application-cluster/debug-service/

you can probably jump to the section:

https://kubernetes.io/docs/tasks/debug-application-cluster/debug-service/#does-the-service-work-by-ip

and proceed from there

GregSilverman commented 6 years ago

Thanks @johnbelamaric, before trying this, I tried @mritd's solution. So far, it appears to be working across all nodes. I'll do some heavy testing later to verify this, but so far, so good.

johnbelamaric commented 6 years ago

Ok, good. If you are using the add on manager, and have kube-dns instead of coredns enabled, then it will revert the service and deployment resources, which may result in something like this (depending on the labels on your coredns deployment).

xiaomj commented 5 years ago

Same issue, I also had to redeploy the coredns deployment:

$ kubectl get deploy -n kube-system
NAME                   DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
coredns                2         2         2            2           29h
$ wget https://raw.githubusercontent.com/zlabjp/kubernetes-scripts/master/force-update-deployment
$ chmod +x force-update-deployment
$ force-update-deployment coredns -n kube-system

worked.But why deploy twice solved? I checked route items using ip route, found nothing different.

johnbelamaric commented 5 years ago

It is certainly strange. I would want to see:

1) your service, deployment and configmap resources, along with some event history (kubectl describe or kubectl get events) - before and after the issue 2) logs from the CoreDNS containers both working and not working

zhangguanzhang commented 5 years ago

same issue first, svc could telnet

[root@k8s-m1 ~]# telnet 10.96.0.10 53
Trying 10.96.0.10...
Connected to 10.96.0.10.
Escape character is '^]'.
^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^CConnection closed by foreign host.

now, test the dns pod ip

[root@k8s-m1 ~]# kubectl -n kube-system get pod  -o wide
NAME                              READY   STATUS    RESTARTS   AGE    IP           NODE          NOMINATED NODE   READINESS GATES
coredns-5dc6f95498-m6c5n          1/1     Running   0          60s    10.244.1.5   172.16.2.10   <none>           <none>
metrics-server-7c7c88f4d4-8l9dd   1/1     Running   0          7h5m   10.244.4.5   172.16.2.4    <none>           <none>
[root@k8s-m1 ~]# dig @10.244.1.5 baidu.com +short

; <<>> DiG 9.9.4-RedHat-9.9.4-73.el7_6 <<>> @10.244.1.5 baidu.com +short
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached
[root@k8s-m1 ~]# telnet 10.244.1.5 53
Trying 10.244.1.5...
Connected to 10.244.1.5.
Escape character is '^]'.
^C^C^C^C^C^CConnection closed by foreign host.
[root@k8s-m1 ~]# ^C
[root@k8s-m1 ~]# curl 10.244.1.5:8181/ready
OK

now , ssh the node and test

[root@k8s-m1 ~]# ssh 172.16.2.10
[root@k8s-n1 ~]# dig @10.244.1.5 baidu.com
39.156.69.79
220.181.38.148
johnbelamaric commented 5 years ago

Telnet and curl use TCP. dig is using UDP. Try +tcp option with dig. That will tell you if it's some UDP transport issue in your network.

zhangguanzhang commented 5 years ago

I rebuild my work environment,I will try the method you said when I encounter it later.

comsky commented 3 years ago

@ngocson2vn Worked. This solution save my day.

ikingye commented 3 years ago

I clear the all firewall, and it works just fine now.

iptables -P INPUT ACCEPT
iptables -P FORWARD ACCEPT
iptables -P OUTPUT ACCEPT
iptables -t nat -F
iptables -t mangle -F
iptables -F
iptables -X
cyb3rfist commented 10 months ago

Traffic can be routed to the pods via a Kubernetes service, or it can be routed directly to the pods. When traffic is routed to the pods via a Kubernetes service, Kubernetes uses a built-in mechanism called kube-proxy to load balance traffic between the pods.

Having an outdated kube-proxy image version can cause the routing issues. In my case it was 1.23 running on k8s cluster 1.27. Upgrading the image fixed the issue for me