vmware / container-service-extension

Container Service for VMware vCloud Director
https://vmware.github.io/container-service-extension
Other
77 stars 52 forks source link

K8 Cluster is not getting created #1394

Open SeeGill opened 2 years ago

SeeGill commented 2 years ago

Describe the bug

Hello, Recently we have regarded the CSE from CSE 3.1.0 to CSE 3.1.2. Upgrade completed successfully however after upgrade we are facing the issue while creating the clusters and not able to resize the old existing clusters.

For new Native Cluster: VCD task keep on running for hours and we need to manually cancel the task . In meantime no task get started on vcenter as well. Task stuck as below: (python) root@CSE02 [ ~/.cse-logs ]# vcd task wait f552a61e-2bd8-408a-84b6-b0ae5df6a0e6 createDefinedEntity: Creating nativeClusterEntityType k8cluster02(urn:vcloud:entity:cse:nativeCluster:aa0a04f1-895e-4ad2-b666-50fa2399db5e), status: running, progress: 1% |

We have found below logs regarding the task id: (python) root@CSE02 [ ~/.cse-logs ]# more cse-server-debug.log | grep f552a61e-2bd8-408a-84b6-b0ae5df6a0e6 22-09-02 08:40:35 | request_dispatcher:1032 - process_request | Request Id: 77b31207-1613-4848-9878-7b0b1cdaf8df | DEBUG :: Outgoing response: {'status_code': 202, 'body': {'name': 'k8cluster02', 'entity': {'metadata': {'name': 'k8cluster02', 'orgName': 'ORGN02', 'virtualDataCenterName': 'ORGVDCN02', 'site': 'VCDCELL02.lab65.local'}, 'spec': {'settings': {'ovdcNetwork': 'CSE-STDN2', 'sshKey': None, 'rollbackOnFailure': True, 'network': {'cni': None, 'pods': None, 'services': None, 'expose': False}}, 'topology': {'controlPlane': {'sizingClass': 'VSP-CSE-VSP2', 'storageProfile': 'SP-BRONZE2', 'count': 1, 'cpu': None, 'memory': None}, 'workers': {'sizingClass': 'VSP-CSE-VSP2', 'storageProfile': 'SP-BRONZE2', 'count': 1, 'cpu': None, 'memory': None}, 'nfs': {'sizingClass': None, 'storageProfile': None, 'count': 0}}, 'distribution': {'templateName': 'ubuntu-16.04_k8-1.21_weave-2.8.1', 'templateRevision': 1}}, 'apiVersion': 'cse.vmware.com/v2.0', 'status': {'phase': None, 'cni': None, 'taskHref': 'https://VCDCELL02.lab65.local/api/task/f552a61e-2bd8-408a-84b6-b0ae5df6a0e6', 'kubernetes': None, 'dockerVersion': None, 'os': None, 'externalIp': None, 'nodes': None, 'uid': None, 'cloudProperties': {'site': None, 'orgName': None, 'virtualDataCenterName': None, 'ovdcNetworkName': None, 'distribution': {'templateName': '', 'templateRevision': 0}, 'sshKey': None, 'rollbackOnFailure': True, 'exposed': False}, 'persistentVolumes': None, 'virtual_IPs': None, 'private': None}, 'kind': 'native'}, 'id': 'urn:vcloud:entity:cse:nativeCluster:aa0a04f1-895e-4ad2-b666-50fa2399db5e', 'entityType': 'urn:vcloud:type:cse:nativeCluster:2.0.0', 'externalId': None, 'state': 'PRE_CREATED', 'owner': {'name': 'cseadmin', 'id': 'urn:vcloud:user:4048e1e2-2187-4055-a143-3f68ba49cb9c'}, 'org': {'name': 'ORGN02', 'id': 'urn:vcloud:org:308d7fe3-5ccb-4b7f-b597-faaaec479a24'}}} 22-09-02 08:40:35 | mqtt_consumer:85 - process_mqtt_message | Request Id: 77b31207-1613-4848-9878-7b0b1cdaf8df | DEBUG :: MQTT response: {'type': 'API_RESPONSE', 'headers': {'requestId': '77b31207-1613-4848-9878-7b0b1cdaf8df'}, 'httpResponse': {'statusCode': 202, 'headers': {'Content-Type': 'application/json', 'Content-Length': 1694, 'Location': '/api/task/f552a61e-2bd8-408a-84b6-b0ae5df6a0e6'}, 'body': 'eyJuYW1lIjogIms4Y2x1c3RlcjAyIiwgImVudGl0eSI6IHsibWV0YWRhdGEiOiB7Im5hbWUiOiAiazhjbHVzdGVyMDIiLCAib3JnTmFtZSI6ICJPUkdOMDIiLCAidmlydHVhbERhdGFDZW50ZXJOYW1lIjogIk9SR1ZEQ04wMiIsICJzaXRlIjogIlZDRENFTEwwMi5sYWI2NS5sb2NhbCJ9LCAic3BlYyI6IHsic2V0dGluZ3MiOiB7Im92ZGNOZXR3b3JrIjogIkNTRS1TVEROMiIsICJzc2hLZXkiOiBudWxsLCAicm9sbGJhY2tPbkZhaWx1cmUiOiB0cnVlLCAibmV0d29yayI6IHsiY25pIjogbnVsbCwgInBvZHMiOiBudWxsLCAic2VydmljZXMiOiBudWxsLCAiZXhwb3NlIjogZmFsc2V9fSwgInRvcG9sb2d5IjogeyJjb250cm9sUGxhbmUiOiB7InNpemluZ0NsYXNzIjogIlZTUC1DU0UtVlNQMiIsICJzdG9yYWdlUHJvZmlsZSI6ICJTUC1CUk9OWkUyIiwgImNvdW50IjogMSwgImNwdSI6IG51bGwsICJtZW1vcnkiOiBudWxsfSwgIndvcmtlcnMiOiB7InNpemluZ0NsYXNzIjogIlZTUC1DU0UtVlNQMiIsICJzdG9yYWdlUHJvZmlsZSI6ICJTUC1CUk9OWkUyIiwgImNvdW50IjogMSwgImNwdSI6IG51bGwsICJtZW1vcnkiOiBudWxsfSwgIm5mcyI6IHsic2l6aW5nQ2xhc3MiOiBudWxsLCAic3RvcmFnZVByb2ZpbGUiOiBudWxsLCAiY291bnQiOiAwfX0sICJkaXN0cmlidXRpb24iOiB7InRlbXBsYXRlTmFtZSI6ICJ1YnVudHUtMTYuMDRfazgtMS4yMV93ZWF2ZS0yLjguMSIsICJ0ZW1wbGF0ZVJldmlzaW9uIjogMX19LCAiYXBpVmVyc2lvbiI6ICJjc2Uudm13YXJlLmNvbS92Mi4wIiwgInN0YXR1cyI6IHsicGhhc2UiOiBudWxsLCAiY25pIjogbnVsbCwgInRhc2tIcmVmIjogImh0dHBzOi8vVkNEQ0VMTDAyLmxhYjY1LmxvY2FsL2FwaS90YXNrL2Y1NTJhNjFlLTJiZDgtNDA4YS04NGI2LWIwYWU1ZGY2YTBlNiIsICJrdWJlcm5ldGVzIjogbnVsbCwgImRvY2tlclZlcnNpb24iOiBudWxsLCAib3MiOiBudWxsLCAiZXh0ZXJuYWxJcCI6IG51bGwsICJub2RlcyI6IG51bGwsICJ1aWQiOiBudWxsLCAiY2xvdWRQcm9wZXJ0aWVzIjogeyJzaXRlIjogbnVsbCwgIm9yZ05hbWUiOiBudWxsLCAidmlydHVhbERhdGFDZW50ZXJOYW1lIjogbnVsbCwgIm92ZGNOZXR3b3JrTmFtZSI6IG51bGwsICJkaXN0cmlidXRpb24iOiB7InRlbXBsYXRlTmFtZSI6ICIiLCAidGVtcGxhdGVSZXZpc2lvbiI6IDB9LCAic3NoS2V5IjogbnVsbCwgInJvbGxiYWNrT25GYWlsdXJlIjogdHJ1ZSwgImV4cG9zZWQiOiBmYWxzZX0sICJwZXJzaXN0ZW50Vm9sdW1lcyI6IG51bGwsICJ2aXJ0dWFsX0lQcyI6IG51bGwsICJwcml2YXRlIjogbnVsbH0sICJraW5kIjogIm5hdGl2ZSJ9LCAiaWQiOiAidXJuOnZjbG91ZDplbnRpdHk6Y3NlOm5hdGl2ZUNsdXN0ZXI6YWEwYTA0ZjEtODk1ZS00YWQyLWI2NjYtNTBmYTIzOTlkYjVlIiwgImVudGl0eVR5cGUiOiAidXJuOnZjbG91ZDp0eXBlOmNzZTpuYXRpdmVDbHVzdGVyOjIuMC4wIiwgImV4dGVybmFsSWQiOiBudWxsLCAic3RhdGUiOiAiUFJFX0NSRUFURUQiLCAib3duZXIiOiB7Im5hbWUiOiAiY3NlYWRtaW4iLCAiaWQiOiAidXJuOnZjbG91ZDp1c2VyOjQwNDhlMWUyLTIxODctNDA1NS1hMTQzLTNmNjhiYTQ5Y2I5YyJ9LCAib3JnIjogeyJuYW1lIjogIk9SR04wMiIsICJpZCI6ICJ1cm46dmNsb3VkOm9yZzozMDhkN2ZlMy01Y2NiLTRiN2YtYjU5Ny1mYWFhZWM0NzlhMjQifX0='}}

For Updating existing Native cluster: below error appear on provider portal: Cluster resize request failed. Please contact your provider if this problem persists. (Error: Http failure response for https://vcdcell02.lab65.local/api/cse/3.0/cluster/urn:vcloud:entity:cse:nativeCluster:7dc1ea12-b0d3-450c-a329-320c5887f9ee: 400 Bad Request)

Reproduction steps

1.Create a cluster
2.
3.
...

Expected behavior

Cluster need to be created.

Additional context

No response

lzichong commented 2 years ago

Hi @SeeGill, could you update which version of VCD you were using and the process of how you performed the upgrade from CSE 3.1.0 to CSE 3.1.2?

SeeGill commented 2 years ago

Hi @SeeGill, could you update which version of VCD you were using and the process of how you performed the upgrade from CSE 3.1.0 to CSE 3.1.2?

Hi @lzichong VCD Version is 10.3.3 Attached a doc with some screenshots for reference showing the upgrade. github issue refer.docx Let me know if you need more details

lzichong commented 2 years ago

Hi @SeeGill, the upgrade workflow looks normal as the cluster k8u1 also looks like it was upgraded successfully. Could you also open a SR with GSS for this if not already exist and upload the all logs from ~/.cse-logs such as cse-server-debug.log, cse-client-debug.log, etc? We will need those additional logs to investigate further on what exactly is causing the 400 bad request on resize. Also, could you check as a provider if the Rights Bundle 'cse:nativeCluster Entitlement' has been published to the correct org (ORGNO2) after the upgrade? Thanks!

SeeGill commented 2 years ago

SR raised 22360988009. Logs uploaded yes, bundle is published

lzichong commented 2 years ago

Thanks @SeeGill, please have the GS Engineer file the SR to bugzilla as we cannot see the logs until they have done so.