kubernetes-sigs / cluster-api-provider-vsphere

Apache License 2.0
368 stars 292 forks source link

While creating a TKG cluster using Cluster Api on vSphere 6.7, only the LB and Control plane VMs are created, not the other VMs for workers #827

Closed justinmurray closed 4 years ago

justinmurray commented 4 years ago

/kind bug

What steps did you take and what happened:

As a user I followed the guidelines at https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/master/docs/getting_started.md closely but when the workload cluster is created there are no VMs created for the remaining nodes apart from the initial LB and control plane VMs. I had gone through the entire cycle once before, I deleted the workload cluster using kubectl and then deleted the Management cluster using kind delete cluster - and started from the beginning to build each one in turn.

What did you expect to happen: Expected to see the LB and controlplane VMs being created, along with a set of 3 worker VMs. The 3 worker VMs never get created

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.] When issuing the command on the management cluster (generated using Kind create cluster): kubectl logs capi-kubeadm-bootstrap-controller-manager-54bf6747bf-89n85 -n capi-kubeadm-bootstrap-system --all-containers |more I see an error with creation of a secret that is, I believe for the first of the new VMs: I0311 18:01:17.814415 1 kubeadmconfig_controller.go:298] controllers/KubeadmConfig "msg"="Creating BootstrapData for the init control plane" "kind"="Machi ne" "kubeadmconfig"={"Namespace":"default","Name":"vsphere-tkg-7sprc"} "name"="vsphere-tkg-kx2mq" "version"="17253" E0311 18:01:17.821438 1 kubeadmconfig_controller.go:374] controllers/KubeadmConfig "msg"="failed to store bootstrap data" "error"="failed to create bootst rap data secret for KubeadmConfig default/vsphere-tkg-7sprc: secrets \"vsphere-tkg-7sprc\" already exists" "kind"="Machine" "kubeadmconfig"= {"Namespace":"defaul t","Name":"vsphere-tkg-7sprc"} "name"="vsphere-tkg-kx2mq" "version"="17253" This is the first time I created a Workload cluster with the name "sphere-tkg" as a prefix from the management cluster so it is not likely that the secret mentioned was lying around from a previous cluster creation.

Set of log files attached from the cap* pods in the management cluster (kind cluster) logbundle-cap.tar.zip

Environment:

justinmurray commented 4 years ago

logbundle-cap.tar.zip

detiber commented 4 years ago

@justinmurray how many replicas are configured for the associated MachineDeployment?

Has a CNI provider been deployed to the workload cluster yet? The control plane will not pass readiness checks if the CNI provider is not running, which will block the creation of workers from the MachineDeployment.

justinmurray commented 4 years ago

Thank you @detiber : I used the command from the Quickstart Guide as follows, so I expected three VMs to be assigned for worker roles:clusterctl config cluster vsphere-quickstart \ --infrastructure vsphere \ --kubernetes-version v1.17.3 \ --control-plane-machine-count 1 \ --worker-machine-count 3 > cluster.yaml How would I check on your CNI question above? Would CNI appear as one of the pods in the Workload cluster? Here is what I get from "kubectl get pods -A" command:

[root@capvm1 .cluster-api]# kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE kube-system calico-kube-controllers-68dc4cf88f-5jpfs 1/1 Running 0 27h kube-system calico-node-szx24 1/1 Running 0 27h kube-system coredns-6955765f44-7vlv8 1/1 Running 0 27h kube-system coredns-6955765f44-8scgz 1/1 Running 0 27h kube-system etcd-vsphere-tkg-kx2mq 1/1 Running 0 27h kube-system kube-apiserver-vsphere-tkg-kx2mq 1/1 Running 0 27h kube-system kube-controller-manager-vsphere-tkg-kx2mq 1/1 Running 0 27h kube-system kube-proxy-l8tmq 0/1 ImagePullBackOff 0 27h kube-system kube-scheduler-vsphere-tkg-kx2mq 1/1 Running 0 27h kube-system vsphere-cloud-controller-manager-q5j7n 1/1 Running 0 27h kube-system vsphere-csi-controller-0 4/5 CrashLoopBackOff 473 27h kube-system vsphere-csi-node-mx5rj 3/3 Running 0 27h [root@capvm1 .cluster-api]#

justinmurray commented 4 years ago

Also seeing this error condition when I do a : kubectl describe pod cube-proxy-l8tmq -n Kube-system (I tried the docker command separately to see if I could get to the image in question): Events: Type Reason Age From Message


Normal BackOff 11m (x7269 over 27h) kubelet, vsphere-tkg-kx2mq Back-off pulling image "k8s.gcr.io/kube-proxy:1.17.3" Warning Failed 107s (x7313 over 27h) kubelet, vsphere-tkg-kx2mq Error: ImagePullBackOff [root@capvm1 .cluster-api]

docker pull k8s.gcr.io/kube-proxy:1.17.3

Trying to pull repository k8s.gcr.io/kube-proxy ... Pulling repository k8s.gcr.io/kube-proxy unauthorized: authentication required

detiber commented 4 years ago

Hmm, that is odd, I would expect the kube-proxy image to have a tag of v1.17.3 not 1.17.3.

detiber commented 4 years ago

@justinmurray what does kubectl get kubeadmcontrolplane -o yaml show for your environment? Mostly curious about the value of Spec.Version

justinmurray commented 4 years ago

Here is the output for that command from the management cluster:

[root@capvm1 logs]# kubectl get kubeadmcontrolplane -o yaml apiVersion: v1 items:

justinmurray commented 4 years ago

Yes, I can manually docker pull the k8s.gcr.io/kube-proxy image when the "v" comes before the version, 1.17.3 but get the error when that v is missing. This I guess is built into the creation of the kind cluster somewhere, is that correct?

yastij commented 4 years ago

version: 1.17.3 should be version: v1.17.3

justinmurray commented 4 years ago

Where should that change be made?

yastij commented 4 years ago

when doing clusterctl config you'd need to pass v1.17.3 as kubernetes version see: https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/master/docs/getting_started.md#creating-a-vsphere-based-workload-cluster

justinmurray commented 4 years ago

Yes, I used the full clusterctl command with the parameters as specified there just now - and the worker VMs now appear in vCenter. Thank you.

There was an earlier instruction on the Cluster Api page that did not show the option to create the cluster.yaml file and instead piped the output from "clusterctl config cluster" straight into the kubectl apply -f - command. I think the latter method was not set up with the correct version prefix for tube-proxy and that caused the above issue. Thanks again.

justinmurray commented 4 years ago

The cluster is created now in the vSphere 6.7 U2 lab (Lab 1). However, I am still seeing issues in the vSphere 6.7 Update 3 lab (different hardware, different location, same Cluster Api release) where the CAPV controller pod within the capv-system namespace is producing these messages below and the creation of the cluster is going no further than the LB VM creation:

E0313 19:12:16.028134 1 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="unexpected error while reconciling load balancer config for infrastructure.cluster.x-k8s.io/v1alpha3, Kind=H AProxyLoadBalancer default/tkg1: failed to get hapi global config for infrastructure.cluster.x-k8s.io/v1alpha3, Kind=HAProxyLoadBalancer default/tkg1: Get https://10.196.180.156:5556/v1/services/haproxy/configuration/glob al: remote error: tls: bad certificate" "controller"="haproxyloadbalancer" "request"={"Namespace":"default","Name":"tkg1"} I0313 19:12:40.401764 1 vspherecluster_controller.go:219] capv-controller-manager/vspherecluster-controller/default/tkg1 "msg"="Reconciling VSphereCluster" I0313 19:12:40.414612 1 vspherecluster_controller.go:366] capv-controller-manager/vspherecluster-controller/default/tkg1 "msg"="status.ready not found for load balancer" "load-balancer-gvk"="infrastructure.cluster. x-k8s.io/v1alpha3, Kind=HAProxyLoadBalancer" "load-balancer-name"="tkg1" "load-balancer-namespace"="default" I0313 19:12:40.414632 1 vspherecluster_controller.go:230] capv-controller-manager/vspherecluster-controller/default/tkg1 "msg"="load balancer is not reconciled" I0313 19:12:40.414780 1 controller.go:282] controller-runtime/controller "msg"="Successfully Reconciled" "controller"="vspherecluster" "request"={"Namespace":"default","Name":"tkg1"} I0310 19:33:45.832802 1 main.go:209] Generating self signed cert as no cert is provided I0310 19:33:45.903220 1 main.go:242] Listening securely on 0.0.0.0:8443

yastij commented 4 years ago

@justinmurray - are you facing the same issue with a newer version of CAPV ?

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 4 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot commented 4 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 4 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/issues/827#issuecomment-691713709): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.