kubernetes-retired / cluster-api-provider-nested

Cluster API Provider for Nested Clusters
Apache License 2.0
301 stars 67 forks source link

Etcd is not created #203

Closed krasimirdermendzhiev closed 3 years ago

krasimirdermendzhiev commented 3 years ago

Hello, folks

I have problem with etcd when try to create tenant master. I'm following this file https://github.com/kubernetes-sigs/cluster-api-provider-nested/blob/main/virtualcluster/doc/demo.md

I try on kubernetes cluster version 1.19.12

kubectl get ns

NAME                         STATUS   AGE
default                      Active   21h
default-0994e6-vc-sample-1   Active   18h
kube-node-lease              Active   21h
kube-public                  Active   21h
kube-system                  Active   21h
vc-manager                   Active   21h

kubectl get all -n vc-manager

NAME                              READY   STATUS    RESTARTS   AGE
pod/vc-manager-76c5878465-kvhg2   1/1     Running   0          5h37m
pod/vc-syncer-55c5bc5898-tq989    1/1     Running   0          5h37m
pod/vn-agent-hd4gl                1/1     Running   0          13m
pod/vn-agent-xqbp6                1/1     Running   0          10m

NAME                                     TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/virtualcluster-webhook-service   ClusterIP                 <none>        9443/TCP   21h

NAME                      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/vn-agent   2         2         2       2            2           <none>          21h

NAME                         READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/vc-manager   1/1     1            1           21h
deployment.apps/vc-syncer    1/1     1            1           21h

NAME                                    DESIRED   CURRENT   READY   AGE
replicaset.apps/vc-manager-76c5878465   1         1         1       21h
replicaset.apps/vc-syncer-55c5bc5898    1         1         1       21h

kubectl get clusterversion

NAME           AGE
cv-sample-lb   18h

kubectl get VirtualCluster

NAME          AGE
vc-sample-1   18h

I decided to deploy dnsutil and try to find wher is the problem and I can see the pod is normal in the same tenant master namespace. kubectl exec -i -t dnsutils -n default-0994e6-vc-sample-1 -- nslookup kubernetes.default

Server:         xxx.xxx.xxx.10
Address:        xxx.xxx.xxx.10#53

Name:   kubernetes.default.svc.cluster.local
Address: xxx.xxx.xxx.1

kubectl exec -i -t dnsutils -n default-0994e6-vc-sample-1 -- cat /etc/resolv.conf

nameserver xxx.xxx.xxx.10
search default-0994e6-vc-sample-1.svc.cluster.local svc.cluster.local cluster.local 
options ndots:5

kubectl logs pod/etcd-0 -n default-0994e6-vc-sample-1

[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2021-08-07 08:52:42.485084 I | etcdmain: etcd Version: 3.4.0
2021-08-07 08:52:42.485112 I | etcdmain: Git SHA: 898bd1351
2021-08-07 08:52:42.485117 I | etcdmain: Go Version: go1.12.9
2021-08-07 08:52:42.485121 I | etcdmain: Go OS/Arch: linux/amd64
2021-08-07 08:52:42.485126 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2021-08-07 08:52:42.485170 I | embed: peerTLS: cert = /etc/kubernetes/pki/etcd/tls.crt, key = /etc/kubernetes/pki/etcd/tls.key, trusted-ca = /etc/kubernetes/pki/root/tls.crt, client-cert-auth = true, crl-file = 
2021-08-07 08:52:42.486045 I | embed: name = etcd-0
2021-08-07 08:52:42.486056 I | embed: data dir = /var/lib/etcd/data
2021-08-07 08:52:42.486060 I | embed: member dir = /var/lib/etcd/data/member
2021-08-07 08:52:42.486064 I | embed: heartbeat = 100ms
2021-08-07 08:52:42.486068 I | embed: election = 1000ms
2021-08-07 08:52:42.486072 I | embed: snapshot count = 100000
2021-08-07 08:52:42.486079 I | embed: advertise client URLs = https://etcd-0.etcd:2379
{"level":"warn","ts":1628152774.606514,"caller":"netutil/netutil.go:121","msg":"failed to resolve URL Host","url":"https://etcd-0.etcd:2380","host":"etcd-0.etcd:2380","retry-interval":1,"error":"lookup etcd-0.etcd on xxx.xxx.xxx:10:53: no such host"}

Coredns Log [INFO] "A IN etcd-0.etcd. udp 29 false 512" NXDOMAIN qr,rd,ra 29 0.000281613s

krasimirdermendzhiev commented 3 years ago

I'm trying to fix the problem and when I configure ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  corefile: |
    rewrite name etcd-0.etcd kubernetes.default.svc.cluster.local.

I don't have logs

2021-08-07 08:52:42.486079 I | embed: advertise client URLs = https://etcd-0.etcd:2379
{"level":"warn","ts":1628152774.606514,"caller":"netutil/netutil.go:121","msg":"failed to resolve URL Host","url":"https://etcd-0.etcd:2380","host":"etcd-0.etcd:2380","retry-interval":1,"error":"lookup etcd-0.etcd on xxx.xxx.xxx.10:53: no such host"}

in my etcd pods but stil have problem when I try to create I have message: cannot find sts/etcd in ns default-8e1cb1-vc-sample-1: default-8e1cb1-vc-sample-1/etcd is not ready in 120 seconds

This is my logs from the pod: kubectl logs pod/etcd-0 -n default-8e1cb1-vc-sample-1

[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2021-08-09 06:03:18.695492 I | etcdmain: etcd Version: 3.4.0
2021-08-09 06:03:18.695537 I | etcdmain: Git SHA: 898bd1351
2021-08-09 06:03:18.695541 I | etcdmain: Go Version: go1.12.9
2021-08-09 06:03:18.695545 I | etcdmain: Go OS/Arch: linux/amd64
2021-08-09 06:03:18.695549 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2021-08-09 06:03:18.695611 I | embed: peerTLS: cert = /etc/kubernetes/pki/etcd/tls.crt, key = /etc/kubernetes/pki/etcd/tls.key, trusted-ca = /etc/kubernetes/pki/root/tls.crt, client-cert-auth = true, crl-file = 
2021-08-09 06:03:18.696125 I | embed: name = etcd-0
2021-08-09 06:03:18.696136 I | embed: data dir = /var/lib/etcd/data
2021-08-09 06:03:18.696141 I | embed: member dir = /var/lib/etcd/data/member
2021-08-09 06:03:18.696144 I | embed: heartbeat = 100ms
2021-08-09 06:03:18.696148 I | embed: election = 1000ms
2021-08-09 06:03:18.696152 I | embed: snapshot count = 100000
2021-08-09 06:03:18.696160 I | embed: advertise client URLs = https://etcd-0.etcd:2379
{"level":"info","ts":1628488998.7060077,"caller":"netutil/netutil.go:112","msg":"resolved URL Host","url":"https://etcd-0.etcd:2380","host":"etcd-0.etcd:2380","resolved-addr":"xxx.xxx.xxx.1:2380"}
{"level":"info","ts":1628488998.7071345,"caller":"netutil/netutil.go:112","msg":"resolved URL Host","url":"https://etcd-0.etcd:2380","host":"etcd-0.etcd:2380","resolved-addr":"xxx.xxx.xxx.1:2380"}
2021-08-09 06:03:18.711994 I | etcdserver: starting member 1252090b999e74b4 in cluster e47539242bb46ea
raft2021/08/09 06:03:18 INFO: 1252090b999e74b4 switched to configuration voters=()
raft2021/08/09 06:03:18 INFO: 1252090b999e74b4 became follower at term 0
raft2021/08/09 06:03:18 INFO: newRaft 1252090b999e74b4 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
raft2021/08/09 06:03:18 INFO: 1252090b999e74b4 became follower at term 1
raft2021/08/09 06:03:18 INFO: 1252090b999e74b4 switched to configuration voters=(1320127586199565492)
2021-08-09 06:03:18.716165 W | auth: simple token is not cryptographically signed
2021-08-09 06:03:18.719014 I | etcdserver: starting server... [version: 3.4.0, cluster version: to_be_decided]
2021-08-09 06:03:18.719883 I | etcdserver: 1252090b999e74b4 as single-node; fast-forwarding 9 ticks (election ticks 10)
raft2021/08/09 06:03:18 INFO: 1252090b999e74b4 switched to configuration voters=(1320127586199565492)
2021-08-09 06:03:18.720543 I | etcdserver/membership: added member 1252090b999e74b4 [https://etcd-0.etcd:2380] to cluster e47539242bb46ea
2021-08-09 06:03:18.721387 I | embed: ClientTLS: cert = /etc/kubernetes/pki/etcd/tls.crt, key = /etc/kubernetes/pki/etcd/tls.key, trusted-ca = /etc/kubernetes/pki/root/tls.crt, client-cert-auth = true, crl-file = 
2021-08-09 06:03:18.721504 I | embed: listening for peers on [::]:2380
raft2021/08/09 06:03:19 INFO: 1252090b999e74b4 is starting a new election at term 1
raft2021/08/09 06:03:19 INFO: 1252090b999e74b4 became candidate at term 2
raft2021/08/09 06:03:19 INFO: 1252090b999e74b4 received MsgVoteResp from 1252090b999e74b4 at term 2
raft2021/08/09 06:03:19 INFO: 1252090b999e74b4 became leader at term 2
raft2021/08/09 06:03:19 INFO: raft.node: 1252090b999e74b4 elected leader 1252090b999e74b4 at term 2
2021-08-09 06:03:19.313001 I | etcdserver: setting up the initial cluster version to 3.4
2021-08-09 06:03:19.313587 N | etcdserver/membership: set the initial cluster version to 3.4
2021-08-09 06:03:19.313635 I | etcdserver/api: enabled capabilities for version 3.4
2021-08-09 06:03:19.313650 I | embed: ready to serve client requests
2021-08-09 06:03:19.313676 I | etcdserver: published {Name:etcd-0 ClientURLs:[https://etcd-0.etcd:2379]} to cluster e47539242bb46ea
2021-08-09 06:03:19.314749 I | embed: serving client requests on [::]:2379
krasimirdermendzhiev commented 3 years ago

I stopped here

Warning  Unhealthy  1s  kubelet  Readiness probe failed: {"level":"warn","ts":"2021-08-09T08:17:15.307Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-7acea29d-a801-4282-8894-ef334a299146/etcd:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
https://etcd:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Error: unhealthy cluster
Fei-Guo commented 3 years ago

Can you list everything in default-0994e6-vc-sample-1 namespace? May sure you have configured a headless Service for etcd Pod.

This is my local setup:

kubectl get all -n tenant1admin-f7ea3a-vc-sample-1
NAME                       READY   STATUS    RESTARTS   AGE
pod/apiserver-0            1/1     Running   0          86d
pod/controller-manager-0   1/1     Running   0          86d
pod/etcd-0                 1/1     Running   0          86d

NAME                    TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)          AGE
service/apiserver-svc   NodePort    10.98.56.6   <none>        6443:30015/TCP   86d
service/etcd            ClusterIP   None         <none>        <none>           86d

NAME                                  READY   AGE
statefulset.apps/apiserver            1/1     86d
statefulset.apps/controller-manager   1/1     86d
statefulset.apps/etcd                 1/1     86d
Fei-Guo commented 3 years ago

I am not sure if you can hardcode the url to "localhost:2379" as a temporal workaround. @charleszheng44 may provide more ideas.

krasimirdermendzhiev commented 3 years ago

Can you list everything in default-0994e6-vc-sample-1 namespace? May sure you have configured a headless Service for etcd Pod.

This is my local setup:

kubectl get all -n tenant1admin-f7ea3a-vc-sample-1
NAME                       READY   STATUS    RESTARTS   AGE
pod/apiserver-0            1/1     Running   0          86d
pod/controller-manager-0   1/1     Running   0          86d
pod/etcd-0                 1/1     Running   0          86d

NAME                    TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)          AGE
service/apiserver-svc   NodePort    10.98.56.6   <none>        6443:30015/TCP   86d
service/etcd            ClusterIP   None         <none>        <none>           86d

NAME                                  READY   AGE
statefulset.apps/apiserver            1/1     86d
statefulset.apps/controller-manager   1/1     86d
statefulset.apps/etcd                 1/1     86d

kubectl get all -n default-080686-vc-sample-1
NAME         READY   STATUS    RESTARTS   AGE
pod/etcd-0   0/1     Running   1          3m1s

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/etcd ClusterIP None 14h

NAME READY AGE statefulset.apps/etcd 0/1 14h

Fei-Guo commented 3 years ago

This makes me think it is core dns problem. Can you kubectl exec into the etcd Pod and check if the dns service in the super cluster is actually working?

krasimirdermendzhiev commented 3 years ago
kubectl exec -it pod/etcd-0 -n default-5e4fe0-vc-sample-1 -- sh                         
/ # cat /etc/resolv.conf 
nameserver xxx.xxx.xxx.10
search default-5e4fe0-vc-sample-1.svc.cluster.local svc.cluster.local cluster.local 
options ndots:5
kubectl get all -n kube-system | grep coredns                   
pod/coredns-688ff95595-c8fmk                              1/1     Running   0          62m
pod/coredns-688ff95595-ln4nv                              1/1     Running   0          27m
deployment.apps/coredns                              2/2     2            2           14h
replicaset.apps/coredns-688ff95595                              2         2         2       6h48m
krasimirdermendzhiev commented 3 years ago

I am not sure if you can hardcode the url to "localhost:2379" as a temporal workaround. @charleszheng44 may provide more ideas.

I tried with 127.0.0.1 for

--advertise-client-urls=https://127.0.0.1:2379
--endpoints=https://127.0.0.1:2379

and the etcd is created!

charleszheng44 commented 3 years ago

Hi @krasimirdermendzhiev, I used to run into the same problem, but I can't remember what is the root cause. I guess the problem is caused by the super master version(1.19), downgrading to 1.18 should make things work. If my memory is correct, the problem happened because the Etcd service (etcd-0.etcd) is not accessible before the Etcd pod is ready, while the Etcd pod itself needs to visit the Etcd service.

krasimirdermendzhiev commented 3 years ago

Hi @krasimirdermendzhiev, I used to run into the same problem, but I can't remember what is the root cause. I guess the problem is caused by the super master version(1.19), downgrading to 1.18 should make things work. If my memory is correct, the problem happened because the Etcd service (etcd-0.etcd) is not accessible before the Etcd pod is ready, while the Etcd pod itself needs to visit the Etcd service.

Yes I think the same like you "the problem happened because the Etcd service (etcd-0.etcd) is not accessible before the Etcd pod is ready, while the Etcd pod itself needs to visit the Etcd service."

Thank you, folks! P.S. I will close and if somebody need it can open it.

mghlaiel commented 2 years ago

Hi, @krasimirdermendzhiev @Fei-Guo @charleszheng44
I ran into the same issue with rancher Kubernetes v1.21, coredns, and calico. I tried hardcoding the URLs to localhost and I get the same error in the container logs: "netutil/netutil.go:121","msg":"failed to resolve URL Host"

PS: I can't exec into the container as it doesn't start.

Thank you