I aggregated troubleshooting notes which were gleaned from the github issues. I thought it may be helpful to place them in docs/self-service/README.md so new users are able to quickly find out trouble shooting methods.
Draft:
Troubleshooting Resources to Aid in Resolving Provisioning Failures
Ensure provider-components.yaml specifies a container image for the vsphere-provider-controller-manager statefulset which has ClusterAPI Provider vSphere v0.2.0. or later. If one is not available then one can be created using the make dev-build process.
After running clusterctl create cluster..., verify the two ClusterAPI Provider vSphere pods vsphere-provider-controller-manager-0 and cluster-api-controller-manager-0 are running in the bootstrap cluster:
After the bootstrap pods have been created, vSphere will create one or more VMs in accordance to the machine yaml files specified with the clusterctl create cluster command.
Should the clusterctl create cluster command fail to retrieve the admin.conf file the following steps can be used:
Connect to the manager pod in the bootstrap cluster:
SSH on the provisioned master VM within vSphere from the manager pod:
$> ssh -i ~/.ssh/vsphere_tmp ubuntu@<vm ip address>
Verify if the following file exists: /etc/kuberenetes/admin.conf. Please note it may take a couple minutes for cloud-init to process and create these files.
If either the file or folder do not exist then check the following log files for failed commands: /var/log/cloud-init.log and /var/log/cloud-init-output.log.
If the log files are still being appended to then cloud-init has not finished processing and may need more time to run.
An example failure which may be listed in /var/log/cloud-init.log is 2019-05-06 18:22:41,691 - util.py[WARNING]: Failed loading yaml blob. unacceptable character #xdccf: special characters are not allowed. This error indicates an incorrect entry in machines.yaml or machineset.yaml which was specified in the clusterctl create cluster command. Commonly this could be leaving in the - xxxx values in the machines.yaml for sections such as DNS and trustedCerts.
From the location of where the clusterctl command was run, once a kubeconfig file is generated, check the status of the nodes:
$> kubectl --kubeconfig kubeconfig get nodes
If the master never enters ready state then check to see if any pods are failing:
$> kubectl --kubeconfig kubeconfig get pods --all-namespaces
Use the logs command to check logs of a failing pod, example:
If the weave-net pod is indeed failing then you may have specified a network range within cluster.yaml under the pods cidrBlocks which overlaps an existing network on the provisioned kubernetes nodes. For example if the VM IP addresses are within 192.168.0.0/16 then the default cidrBlock value will need to be changed.
In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/issues/277#issuecomment-500819118):
>#278 fixed this
>
>/close
Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
I aggregated troubleshooting notes which were gleaned from the github issues. I thought it may be helpful to place them in docs/self-service/README.md so new users are able to quickly find out trouble shooting methods.
Draft:
Troubleshooting Resources to Aid in Resolving Provisioning Failures
vsphere-provider-controller-manager-0
andcluster-api-controller-manager-0
are running in the bootstrap cluster:/etc/kuberenetes/admin.conf
. Please note it may take a couple minutes for cloud-init to process and create these files./var/log/cloud-init.log
and/var/log/cloud-init-output.log
./var/log/cloud-init.log
is2019-05-06 18:22:41,691 - util.py[WARNING]: Failed loading yaml blob. unacceptable character #xdccf: special characters are not allowed
. This error indicates an incorrect entry in machines.yaml or machineset.yaml which was specified in theclusterctl create cluster
command. Commonly this could be leaving in the- xxxx
values in the machines.yaml for sections such as DNS and trustedCerts.