james65535 commented 5 years ago

I aggregated troubleshooting notes which were gleaned from the github issues. I thought it may be helpful to place them in docs/self-service/README.md so new users are able to quickly find out trouble shooting methods.

Draft:

Troubleshooting Resources to Aid in Resolving Provisioning Failures

Ensure provider-components.yaml specifies a container image for the vsphere-provider-controller-manager statefulset which has ClusterAPI Provider vSphere v0.2.0. or later. If one is not available then one can be created using the make dev-build process.

After running clusterctl create cluster..., verify the two ClusterAPI Provider vSphere pods vsphere-provider-controller-manager-0 and cluster-api-controller-manager-0 are running in the bootstrap cluster:

$> kubectl get pods --all-namespaces
NAMESPACE                 NAME                                         READY   STATUS    RESTARTS   AGE
cluster-api-system        cluster-api-controller-manager-0             1/1     Running   0          5d1h
kube-system               coredns-fb8b8dccf-bs2v9                      1/1     Running   0          5d1h
kube-system               coredns-fb8b8dccf-hhc4b                      1/1     Running   0          5d1h
kube-system               etcd-kind-control-plane                      1/1     Running   0          5d1h
kube-system               ip-masq-agent-m5jkt                          1/1     Running   0          5d1h
kube-system               kindnet-5s6tz                                1/1     Running   1          5d1h
kube-system               kube-apiserver-kind-control-plane            1/1     Running   0          5d1h
kube-system               kube-controller-manager-kind-control-plane   1/1     Running   0          5d1h
kube-system               kube-proxy-x57n5                             1/1     Running   0          5d1h
kube-system               kube-scheduler-kind-control-plane            1/1     Running   0          5d1h
vsphere-provider-system   vsphere-provider-controller-manager-0        1/1     Running   0          5d1h

If any are failing then view the pod logs to review errors:
```
$> kubectl logs <pod name> --namespace <pod namespace>
```
After the bootstrap pods have been created, vSphere will create one or more VMs in accordance to the machine yaml files specified with the clusterctl create cluster command.
Should the clusterctl create cluster command fail to retrieve the admin.conf file the following steps can be used:
1. Connect to the manager pod in the bootstrap cluster:
```
$> kubectl exec vsphere-provider-controller-manager-0 -it /bin/bash --namespace vsphere-provider-system
```
2. SSH on the provisioned master VM within vSphere from the manager pod:
```
$> ssh -i ~/.ssh/vsphere_tmp ubuntu@<vm ip address>
```
3. Verify if the following file exists: /etc/kuberenetes/admin.conf. Please note it may take a couple minutes for cloud-init to process and create these files.
```
$> ls /etc/kubernetes/
admin.conf  cloud-config  controller-manager.conf  kubeadm_config.yaml  kubelet.conf  manifests  pki  scheduler.conf
```
4. If either the file or folder do not exist then check the following log files for failed commands: /var/log/cloud-init.log and /var/log/cloud-init-output.log.
5. If the log files are still being appended to then cloud-init has not finished processing and may need more time to run.
6. An example failure which may be listed in /var/log/cloud-init.log is 2019-05-06 18:22:41,691 - util.py[WARNING]: Failed loading yaml blob. unacceptable character #xdccf: special characters are not allowed. This error indicates an incorrect entry in machines.yaml or machineset.yaml which was specified in the clusterctl create cluster command. Commonly this could be leaving in the - xxxx values in the machines.yaml for sections such as DNS and trustedCerts.
From the location of where the clusterctl command was run, once a kubeconfig file is generated, check the status of the nodes:
```
$> kubectl --kubeconfig kubeconfig get nodes
```
1. If the master never enters ready state then check to see if any pods are failing:
```
$> kubectl --kubeconfig kubeconfig get pods --all-namespaces
```
2. Use the logs command to check logs of a failing pod, example:
```
$> kubectl --kubeconfig kubeconfig logs weave-net-dl2bn -c weave --namespace kube-system
```
3. If the weave-net pod is indeed failing then you may have specified a network range within cluster.yaml under the pods cidrBlocks which overlaps an existing network on the provisioned kubernetes nodes. For example if the VM IP addresses are within 192.168.0.0/16 then the default cidrBlock value will need to be changed.

frapposelli commented 5 years ago

278 fixed this

/close

k8s-ci-robot commented 5 years ago

@frapposelli: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/issues/277#issuecomment-500819118): >#278 fixed this > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / cluster-api-provider-vsphere

Add troubleshooting information to Docs #277

Troubleshooting Resources to Aid in Resolving Provisioning Failures

278 fixed this