vmware / container-service-extension

Container Service for VMware vCloud Director
https://vmware.github.io/container-service-extension
Other
78 stars 52 forks source link

Cluster upgrade from one rev to another failed in CSE 2.6.1 #870

Open sanjeevgorai opened 3 years ago

sanjeevgorai commented 3 years ago

Hello All,

Have upgraded CSE cluster from 2.5.1 to 2.6.1. Now if I try to upgrade one of the existing cluster(created from ubuntu templates) from one revision to another revision , getting below errors. From errors it seems that the issue is with DNS but its not, as when i try to wget the urls link (from errors logs) manually on master server, its getting downloaded. So assume this is not a dns errors.

Please suggest if some body have any idea on this.

================================================= root@cse2p2h11 [ ~ ]# vcd cse cluster upgrade ESA20 ubuntu-16.04_k8-1.18_weave-2.6.5 1 cluster operation: Upgrading cluster 'ESA20' software to match template ubuntu-16.04_k8-1.18_weave-2.6.5 (revision 1): Kubernetes: 1.17.9 -> 1.18.6, Docker-CE: 19.03.5 cluster operation: Upgrading cluster 'ESA20' software to match template ubuntu-16.04_k8-1.18_weave-2.6.5 (revision 1): Kubernetes: 1.17.9 -> 1.18.6, Docker-CE: 19.03.5 -> 19.03.12, CNI: weave 2.6.0 -> 2.6.5 cluster operation: Draining master node ['mstr-5vm3'] cluster operation: Upgrading Kubernetes (1.17.9 -> 1.18.6) in master node ['mstr-5vm3'] task: 64daf4c3-de32-4c36-9e3a-56cc6b5317c1, result: error, message: Unexpected error while upgrading cluster 'ESA20': Script execution failed on node ['mstr-5vm3'] Errors: ["W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/xenial/InRelease Temporary failure resolving 'archive.ubuntu.com'\nW: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/xenial-updates/InRelease Temporary failure resolving 'archive.ubuntu.com'\nW: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/xenial-backports/InRelease Temporary failure resolving 'archive.ubuntu.com'\nW: Failed to fetch http://security.ubuntu.com/ubuntu/dists/xenial-security/InRelease Temporary failure resolving 'security.ubuntu.com'\nW: Failed to fetch https://download.docker.com/linux/ubuntu/dists/xenial/InRelease Resolving timed out after 30535 milliseconds\nW: Failed to fetch http://apt.kubernetes.io/dists/kubernetes-xenial/InRelease Temporary failure resolving 'apt.kubernetes.io'\nW: Some index files failed to download. They have been ignored, or old ones used instead.\nE: Failed to fetch http://apt.kubernetes.io/pool/kubeadm_1.18.6-00_amd64_d4a4d123be4a196da5e34d7f8d95a224c431298ad18ab38edecbee6548d6236c.deb Temporary failure resolving 'apt.kubernetes.io'\n\nE: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?\n"]

ltimothy7 commented 3 years ago

Hi @sanjeevgorai,

Is your setup behind a proxy? Please let us know if so; if not we will see what else may be going on.

sanjeevgorai commented 3 years ago

yes, this setup is behind the proxy server. In production we are not allowed for direct internet connection.

ltimothy7 commented 3 years ago

We think you may have set up apt proxy and then rebooted the VM. Is this the case? If so, your saved configurations will be lost. This requires setting up apt- proxy again for cluster upgrade to work.

sanjeevgorai commented 3 years ago

Hello , We have set apt proxy on both master and worker node and then reboot both but still same issue. apt-proxy is provided below.

mstr-6rvc:~# cat /etc/apt/apt.conf.d/proxy.conf Acquire { HTTP::proxy "http:/172.20.24.57:8080"; HTTPS::proxy "http://172.20.24.57:8080"; FTP::proxy "http://172.20.24.57:8080"; } root@mstr-6rvc:~#

====errors=== root@cse2p2h11 [ ~ ]# vcd cse cluster upgrade ESA10 ubuntu-16.04_k8-1.17_weave-2.6.0 2 cluster operation: Upgrading cluster 'ESA10' software to match template ubuntu-16.04_k8-1.17_weave-2.6.0 (revision 2): Kubernetes: 1.16.13 -> 1.17.9, Docker-CE: 18.09.7cluster operation: Upgrading cluster 'ESA10' software to match template ubuntu-16.04_k8-1.17_weave-2.6.0 (revision 2): Kubernetes: 1.16.13 -> 1.17.9, Docker-CE: 18.09.7 -> 19.03.5, CNI: weave 2.6.0 -> 2.6.0 cluster operation: Draining master node ['mstr-6rvc'] cluster operation: Upgrading Kubernetes (1.16.13 -> 1.17.9) in master node ['mstr-6rvc'] task: 22e47bff-d502-421d-a7c1-cbc8cb176cb9, result: error, message: Unexpected error while upgrading cluster 'ESA10': Script execution failed on node ['mstr-6rvc'] Errors: ["W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/xenial/InRelease Temporary failure resolving 'archive.ubuntu.com'\nW: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/xenial-updates/InRelease Temporary failure resolving 'archive.ubuntu.com'\nW: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/xenial-backports/InRelease Temporary failure resolving 'archive.ubuntu.com'\nW: Failed to fetch http://security.ubuntu.com/ubuntu/dists/xenial-security/InRelease Temporary failure resolving 'security.ubuntu.com'\nW: Failed to fetch http://apt.kubernetes.io/dists/kubernetes-xenial/InRelease Temporary failure resolving 'apt.kubernetes.io'\nW: Some index files failed to download. They have been ignored, or old ones used instead.\nE: Failed to fetch http://apt.kubernetes.io/pool/kubeadm_1.17.9-00_amd64_572d520d47a06fee419b34c35cebf1f98307daae3a76c79da241245cc686d036.deb Temporary failure resolving 'apt.kubernetes.io'\n\nE: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?\n"]

sanjeevgorai commented 3 years ago

The cse cluster are getting updated if we manually upgrade the kubeadm, kubelet and kubectl manually on master and worker nodes but when trying to do so with cse client its getting failed with above errors.

rocknes commented 3 years ago

Hi Sanjeev,

The errors you are noticing are generally caused by network/internet connectivity issues. In past I have seen that sometime there is mismatch between state of the VM nic reported by vCD and VC (specially right after a reboot), and that can lead to these sort of errors. CSE is not doing anything special in these scripts. It might be just a race condition between the guest tool being ready vs the nic becoming functional.

May I suggest that after the proxy details are setup in the vm and the vm is rebooted, wait a few minutes and test out internet connectivity and then start the upgrade process. In case the proxy details are being injected via bashrc or something similar, try adding a poll loop at the end to make sure internet is reachable via the proxy.

Let me know the outcome of the experiments.

Regards Aritra Sen

sanjeevgorai commented 3 years ago

Hello Aritra Thanks for your comments.

if there is any issue with the internet or network connectivity then the kubeadm ,kubelet and kubectl upgrade should failed when we try to upgrade these components by logging directly into the master and worker node its getting upgraded successfully, so we consider that if there is any issue with proxy then this manual up-gradation should also fail. The upgrade is only getting failed when we try to do this from CSE upgrade command on CSE client. We need to understand that when we execute cse upgrade command , how its is getting triggered on master and worker nodes vm's. How to confirm if cse client is able to execute/run the upgrade scripts on the master and worker nodes of the clusters. we are not able to find any logs for this execution on master and worker nodes.

snmishra3008 commented 3 years ago

Hello Aritra,

There was no NIC misconfiguration as confirmed by Vivek from Orange Engineering team.

cse_logs.zip

Also If you check the attached vcd logs you will see that firstly the API call to task completed with 200 OK. But after many iteration (around 20+) it give error 500. Check snipped below.

Request uri (GET): https://vcloud.lab.local/api/task/4e521934-9612-4f8c-96f8-b53a026bbab7 Request headers: {'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': 'application/*+xml;version=32.0', 'Connection': 'keep-alive', 'x-vcloud-authorization': '[REDACTED]'} Response status code: 200 Response headers: {'Date': 'Thu, 04 Feb 2021 11:00:58 GMT', 'X-VMWARE-VCLOUD-REQUEST-ID': '8f61efd0-c920-4fdd-a111-f5efd8b1d479', 'X-VMWARE-VCLOUD-ACCESS-TOKEN': 'eyJhbGciOiJSUzI1NiJ9.eyJzdWIiOiJvcmcwNSIsImlzcyI6ImE4Mzk4YWRkLWE0NGItNDZkOS04NTllLWEyYmE1ZTFkYWFmNUA4YWQ5MWIwZC03NjFjLTQ1MzctOTQzMi1iMmZlOGVjYWU1ODEiLCJleHAiOjE2MTI1MjA2NjgsInZlcnNpb24iOiJ2Y2xvdWRfMS4wIiwianRpIjoiOWY0ZDk3YmI1NjI4NDc5YjhjYzM0YjcxNzcwY2QwOTgifQ.PjIqIM_aPuHPMpBavhj2r3cGzhHKnWkbs94pOxnwm8v36A-R_KCli4cs0eAHgS3I_JQqvHSYw_NJW_fQ1oVso-cy_4ZQsQaPCrze5Uc84KfIjRgI0sR4Clh_AyoUS0LQnpcffIj253Lj7xebgA-WfqbSQvYEg5H_ttqpPjkkRIlNbyxQw-OVaFY2tGyA7vPTnPvI9KJbV_F6lQiFw7ZHf8njCjyMHtp7YVYN0PsWY0abf820XnsSasfuYopTweyQ8Q09AwUspddNWd965sGAO5q8aynjUh9rCvujEEazOgjAw08jhhC4mwhcFdTQ5Qd3MfYJkkyjru1JC0uppcQuhQ', 'X-VMWARE-VCLOUD-TOKEN-TYPE': 'Bearer', 'x-vcloud-authorization': '[REDACTED]', 'Content-Type': 'application/vnd.vmware.vcloud.task+xml;version=32.0', 'X-VMWARE-VCLOUD-REQUEST-EXECUTION-TIME': '53', 'Cache-Control': 'no-store, must-revalidate', 'Vary': 'Accept-Encoding, User-Agent', 'Content-Length': '1866'} Response body: <?xml version="1.0" encoding="UTF-8" standalone="yes"?>

Request uri (GET): https://vcloud.lab.local/api/task/4e521934-9612-4f8c-96f8-b53a026bbab7 Request headers: {'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': 'application/*+xml;version=32.0', 'Connection': 'keep-alive', 'x-vcloud-authorization': '[REDACTED]'} Response status code: 200 Response headers: {'Date': 'Thu, 04 Feb 2021 11:01:03 GMT', 'X-VMWARE-VCLOUD-REQUEST-ID': '3733b0e9-fc5d-412d-8391-218aa9768477', 'X-VMWARE-VCLOUD-ACCESS-TOKEN': 'eyJhbGciOiJSUzI1NiJ9.eyJzdWIiOiJvcmcwNSIsImlzcyI6ImE4Mzk4YWRkLWE0NGItNDZkOS04NTllLWEyYmE1ZTFkYWFmNUA4YWQ5MWIwZC03NjFjLTQ1MzctOTQzMi1iMmZlOGVjYWU1ODEiLCJleHAiOjE2MTI1MjA2NjgsInZlcnNpb24iOiJ2Y2xvdWRfMS4wIiwianRpIjoiOWY0ZDk3YmI1NjI4NDc5YjhjYzM0YjcxNzcwY2QwOTgifQ.PjIqIM_aPuHPMpBavhj2r3cGzhHKnWkbs94pOxnwm8v36A-R_KCli4cs0eAHgS3I_JQqvHSYw_NJW_fQ1oVso-cy_4ZQsQaPCrze5Uc84KfIjRgI0sR4Clh_AyoUS0LQnpcffIj253Lj7xebgA-WfqbSQvYEg5H_ttqpPjkkRIlNbyxQw-OVaFY2tGyA7vPTnPvI9KJbV_F6lQiFw7ZHf8njCjyMHtp7YVYN0PsWY0abf820XnsSasfuYopTweyQ8Q09AwUspddNWd965sGAO5q8aynjUh9rCvujEEazOgjAw08jhhC4mwhcFdTQ5Qd3MfYJkkyjru1JC0uppcQuhQ', 'X-VMWARE-VCLOUD-TOKEN-TYPE': 'Bearer', 'x-vcloud-authorization': '[REDACTED]', 'Content-Type': 'application/vnd.vmware.vcloud.task+xml;version=32.0', 'X-VMWARE-VCLOUD-REQUEST-EXECUTION-TIME': '57', 'Cache-Control': 'no-store, must-revalidate', 'Vary': 'Accept-Encoding, User-Agent', 'Content-Length': '3178'} Response body: <?xml version="1.0" encoding="UTF-8" standalone="yes"?>

Unexpected error while upgrading cluster 'ESACL20': Script execution failed on node ['mstr-tr5t']
==================================================================== Also as communicated earlier by Sanjeev. When we run the scrip manually on master node it is able to fetched the details and successfully upgrade the k8 components. Details from CSE server below. 1) > root@cse2p2h11 [ ~ ]# vcd cse cluster upgrade-plan PHOCLS10 > Template Name Template Revision Kubernetes Docker-CE CNI > -------------------------------- ------------------- ------------ ----------- ----------- > ubuntu-16.04_k8-1.17_weave-2.6.0 2 1.17.9 19.03.5 weave 2.6.0 > List of scripts which we copy to upgrade it manually. root@cse2p2h11 [ ~/.cse_scripts/ubuntu-16.04_k8-1.17_weave-2.6.0_rev2/cluster-upgrade ]# ls docker-upgrade.sh master-cni-apply.sh master-k8s-upgrade.sh vcd_cli_error.log vcd.log vcd_sdk.log worker-k8s-upgrade.sh 2) When running same master-k8s-upgrade.sh on master node manually. It is able to upgrade to v1.17.9 Successfully. See below. oot@mstr-ghx3:/tmp# sh master-k8s-upgrade.sh upgrading packages to: kubeadm=1.17.9-00 Hit:1 https://download.docker.com/linux/ubuntu xenial InRelease Hit:2 http://archive.ubuntu.com/ubuntu xenial InRelease Hit:3 http://archive.ubuntu.com/ubuntu xenial-updates InRelease Hit:5 http://archive.ubuntu.com/ubuntu xenial-backports InRelease Get:6 http://security.ubuntu.com/ubuntu xenial-security InRelease [109 kB] Hit:4 https://packages.cloud.google.com/apt kubernetes-xenial InRelease Fetched 109 kB in 31s (3,481 B/s) Reading package lists... . . . [upgrade/successful] SUCCESS! Your cluster was upgraded to "v1.17.9". Enjoy! ================================================================= Hence could you please from the script point of view why cse client not able to run upgradation script when run from “vcd cse upgrade” on cse client. You can send the list of scripts which you think can help you to understand the errors/issue. Any logs you need we are happy to provide it. Thanks,