vmware / container-service-extension

Container Service for VMware vCloud Director
https://vmware.github.io/container-service-extension
Other
77 stars 52 forks source link

TKGm clusters fail to deploy on CSE 3.1.1 without VCS #1283

Closed an0nz closed 2 years ago

an0nz commented 2 years ago

Describe the bug

TKGm cluster deployment fails to create the control plane node and rolls back. Tests have been conducted with a variety of oVDC allocation pool types and different resource allocations from tenant admin and sys admin accounts with the below TKGm templates all failing the same way ubuntu-2004-kube-v1.21.2 ubuntu-2004-kube-v1.20.8 ubuntu-2004-kube-v1.19.12

The test routed network being deployed to has been configured with internet access and a completely open firewall just to eliminate connectivity issues with no luck.

The control plane VM is deployed, receives an IP then hangs on Post guest customization phase guestinfo.postcustomization.kubeinit.status is in_progress and eventually fails with the below error before rolling back.

If anyone can help either find my mistake, resolve the bug, or provide a workaround it would be greatly appreciated.

22-01-13 15:32:33 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: dd4012df-0b01-499b-b41d-f2b317c1fe4d | INFO :: Post guest customization phase guestinfo.postcustomization.networkconfiguration.status is successful 22-01-13 15:32:34 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: dd4012df-0b01-499b-b41d-f2b317c1fe4d | INFO :: Post guest customization phase guestinfo.postcustomization.store.sshkey.status is successful 22-01-13 15:32:35 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: dd4012df-0b01-499b-b41d-f2b317c1fe4d | INFO :: Post guest customization phase guestinfo.postcustomization.kubeinit.status is in_progress .... .... 22-01-13 15:37:02 | pyvcloud_utils:758 - wait_for_completion_of_post_customization_procedure | Request Id: dd4012df-0b01-499b-b41d-f2b317c1fe4d | INFO :: Post guest customization phase guestinfo.postcustomization.kubeinit.status is in_progress 22-01-13 15:37:02 | pyvcloud_utils:767 - wait_for_completion_of_post_customization_procedure | Request Id: dd4012df-0b01-499b-b41d-f2b317c1fe4d | ERROR :: VM Post guest customization script failed with error:Thu 13 Jan 2022 02:13:13 AM UTC 130 /root/.customization/customize.sh: kubeadm init --config $kubeadm_config_path --v=10 > /root/kubeadm-init.out 22-01-13 15:13:26 | cluster_service_2_x_tkgm:2163 - _add_control_plane_nodes | Request Id: ddc300db-36b0-4666-b8c5-900a73234035 | ERROR :: None Traceback (most recent call last): File "/opt/cse/python/lib/python3.7/site-packages/container_service_extension/rde/backend/cluster_service_2_x_tkgm.py", line 2151, in _add_control_plane_nodes logger=LOGGER File "/opt/cse/python/lib/python3.7/site-packages/container_service_extension/common/utils/pyvcloud_utils.py", line 768, in wait_for_completion_of_post_customization_procedure raise exceptions.ScriptExecutionError container_service_extension.exception.exceptions.ScriptExecutionError: None 22-01-13 15:13:26 | cluster_service_2_x_tkgm:858 - _create_cluster_async | Request Id: ddc300db-36b0-4666-b8c5-900a73234035 | ERROR :: failure on creating nodes ['mstr-u3o4'] Error:None Traceback (most recent call last): File "/opt/cse/python/lib/python3.7/site-packages/container_service_extension/rde/backend/cluster_service_2_x_tkgm.py", line 2151, in _add_control_plane_nodes logger=LOGGER File "/opt/cse/python/lib/python3.7/site-packages/container_service_extension/common/utils/pyvcloud_utils.py", line 768, in wait_for_completion_of_post_customization_procedure raise exceptions.ScriptExecutionError container_service_extension.exception.exceptions.ScriptExecutionError: None

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/cse/python/lib/python3.7/site-packages/container_service_extension/rde/backend/cluster_service_2_x_tkgm.py", line 855, in _create_cluster_async refresh_token=refresh_token File "/opt/cse/python/lib/python3.7/site-packages/container_service_extension/rde/backend/cluster_service_2_x_tkgm.py", line 2171, in _add_control_plane_nodes raise exceptions.NodeCreationError(node_list, str(err)) container_service_extension.exception.exceptions.NodeCreationError: failure on creating nodes ['mstr-u3o4'] Error:None 22-01-13 15:13:26 | cluster_service_2_x_tkgm:954 - _create_cluster_async | Request Id: ddc300db-36b0-4666-b8c5-900a73234035 | ERROR :: Error creating cluster 'k8s-tkg-test' Traceback (most recent call last): File "/opt/cse/python/lib/python3.7/site-packages/container_service_extension/rde/backend/cluster_service_2_x_tkgm.py", line 2151, in _add_control_plane_nodes logger=LOGGER File "/opt/cse/python/lib/python3.7/site-packages/container_service_extension/common/utils/pyvcloud_utils.py", line 768, in wait_for_completion_of_post_customization_procedure raise exceptions.ScriptExecutionError container_service_extension.exception.exceptions.ScriptExecutionError: None

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/cse/python/lib/python3.7/site-packages/container_service_extension/rde/backend/cluster_service_2_x_tkgm.py", line 855, in _create_cluster_async refresh_token=refresh_token File "/opt/cse/python/lib/python3.7/site-packages/container_service_extension/rde/backend/cluster_service_2_x_tkgm.py", line 2171, in _add_control_plane_nodes raise exceptions.NodeCreationError(node_list, str(err)) container_service_extension.exception.exceptions.NodeCreationError: failure on creating nodes ['mstr-u3o4'] Error:None

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/cse/python/lib/python3.7/site-packages/container_service_extension/rde/backend/cluster_service_2_x_tkgm.py", line 860, in _create_cluster_async f"Error adding control plane node: {err}") container_service_extension.exception.exceptions.ControlPlaneNodeCreationError: Error adding control plane node: failure on creating nodes ['mstr-u3o4'] Error:None

Reproduction steps

CSE 3.1.1
vCD 10.3.1
NSX-T 3.1.3.1
NSX-ALB (Avi) 21.1.2

1. Fresh Installation of CSE 3.1.1 using Python 3.7.x venv on Ubuntu 20.04 LTS
2. CSE Configured to use MQTT without VCS config and `no_vc_communication_mode: true`, all others as per sample other than broker specifics
3. TKGm OVA's downloaded from VMware and imported with `cse template import -F`
4. CSE Installed using `cse install -t -k ~/.ssh/id_rsa.pub`, -t is required due to no VCS config
5. Additional rights added to rights bundle for TKGm as per https://vmware.github.io/container-service-extension/cse3_1/RBAC.html
6. CSE:NATIVECLUSTER rights added to the tenant admin role
7. oVDC setup with NSX-T edge gateway and a routed network
8. Go to the CSE UI Plugin and deploy a cluster using TKGm
9. Name: k8s-tkg-test, Template: ubuntu-2004-kube-v1.21.2, Control Plane 4vCPU 8GB RAM, Worker 2vCPU 4GB RAM, 192.168.1.1/24 routed network with allow external traffic to be routed to this cluster, k8s network left as default

Expected behavior

Cluster deploys or returns an error detailing what went wrong, in this case Error:None is returned.

Additional context

This also seems to cause some further errors in vCD as after the rollback completes the below errors appear along with the vCD cell throwing an email alert stating VMware Cloud Director cell with IP address x.x.x.x is successful in reconnecting to the database . The error is longer than below but is basically a loop of the same.

22-01-13 15:37:32 | cluster_service_2_x_tkgm:1714 - _update_task | Request Id: dd4012df-0b01-499b-b41d-f2b317c1fe4d | DEBUG :: Sending behavior response:{'type': 'BEHAVIOR_RESPONSE', 'headers': {'taskId': 'b0155be0-a2eb-4227-8eac-4b56410efe7e', 'entityId': 'urn:vcloud:entity:cse:nativeCluster:20e9c5d2-060d-49ab-a412-a06f7f442b98', 'contentType': 'application/vnd.vmware.vcloud.task+json'}, 'payload': '{"status": "error", "error": {"majorErrorCode": "500", "minorErrorCode": "Error creating cluster \'k8s-tkg-test\'. Deleting cluster (rollback=True)", "message": "Error adding control plane node: failure on creating nodes [\'mstr-kqss\']\nError:None"}}'} 22-01-13 15:37:32 | mqtt_publisher:116 - send_response | Request Id: dd4012df-0b01-499b-b41d-f2b317c1fe4d | DEBUG :: publish return (rc, msg_id): (0, 227) 22-01-13 15:37:33 | mqtt_consumer:117 - on_disconnect | INFO :: MQTT disconnect with reason: 1 22-01-13 15:37:33 | operation_context:134 - end | Request Id: dd4012df-0b01-499b-b41d-f2b317c1fe4d | DEBUG :: Failed to logout user: svc_cseat api_version: Default api version. Traceback (most recent call last): File "/opt/cse/python/lib/python3.7/site-packages/container_service_extension/security/context/operation_context.py", line 128, in end sysadmin_user_context.client.logout() File "/opt/cse/python/lib/python3.7/site-packages/pyvcloud/vcd/client.py", line 1140, in logout result = self._do_request('DELETE', uri) File "/opt/cse/python/lib/python3.7/site-packages/pyvcloud/vcd/client.py", line 1260, in _do_request _objectify_response(response, objectify_results)) File "/opt/cse/python/lib/python3.7/site-packages/pyvcloud/vcd/client.py", line 725, in _objectify_response return objectify.fromstring(response.content) File "src/lxml/objectify.pyx", line 1998, in lxml.objectify.fromstring File "src/lxml/etree.pyx", line 3252, in lxml.etree.fromstring File "src/lxml/parser.pxi", line 1912, in lxml.etree._parseMemoryDocument File "src/lxml/parser.pxi", line 1800, in lxml.etree._parseDoc File "src/lxml/parser.pxi", line 1141, in lxml.etree._BaseParser._parseDoc File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError File "", line 11 lxml.etree.XMLSyntaxError: Opening and ending tag mismatch: meta line 9 and head, line 11, column 8 22-01-13 15:37:34 | mqtt_consumer:100 - on_connect | INFO :: MQTT client connected with result code 0 and flags {'session present': 1} 22-01-13 15:37:34 | mqtt_consumer:114 - on_subscribe | INFO :: MQTT client subscribed with given_qos: (2,) 22-01-13 15:38:58 | mqtt_consumer:50 - process_behavior_message | Request Id: f6d9e84d-20c6-44a2-ab30-b363d8f020ae | DEBUG :: Received behavior invocation: urn:vcloud:behavior-interface:createCluster:cse:k8s:1.0.0 on entityId:urn:vcloud:entity:cse:nativeCluster:bafa37d7-a0e8-49af-b8db-7faa57635c06 with requestId: f846c470-3f22-49a2-a143-364d9045b88d 22-01-13 15:39:00 | cluster_service_2_x_tkgm:1714 - _update_task | Request Id: f6d9e84d-20c6-44a2-ab30-b363d8f020ae | DEBUG :: Sending behavior response:{'type': 'BEHAVIOR_RESPONSE', 'headers': {'taskId': '72bb7f6e-cad3-4e24-9b09-e32588a7219d', 'entityId': 'urn:vcloud:entity:cse:nativeCluster:bafa37d7-a0e8-49af-b8db-7faa57635c06', 'contentType': 'application/vnd.vmware.vcloud.task+json'}, 'payload': '{"status": "running", "operation": "Creating cluster \'k8s-tkg-test\' from template \'ubuntu-2004-kube-v1.19.12+vmware.1-tkg.2-17499219411874486722\' (revision 1)"}'} 22-01-13 15:39:00 | mqtt_publisher:116 - send_response | Request Id: f6d9e84d-20c6-44a2-ab30-b363d8f020ae | DEBUG :: publish return (rc, msg_id): (0, 229) 22-01-13 15:39:00 | entity_service:56 - exception_handler_wrapper | Request Id: f6d9e84d-20c6-44a2-ab30-b363d8f020ae | ERROR :: [ 1965f36f-1a1f-4686-be55-36c91eb7bb18 ] This operation is denied. Traceback (most recent call last): File "/opt/cse/python/lib/python3.7/site-packages/container_service_extension/rde/common/entity_service.py", line 52, in exception_handler_wrapper result = func(*args, *kwargs) File "/opt/cse/python/lib/python3.7/site-packages/container_service_extension/rde/common/entity_service.py", line 385, in get_entity resource_url_relative_path=f"{CloudApiResource.ENTITIES}/" File "/opt/cse/python/lib/python3.7/site-packages/container_service_extension/lib/cloudapi/cloudapi_client.py", line 139, in do_request response.raise_for_status() File "/opt/cse/python/lib/python3.7/site-packages/requests/models.py", line 960, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://vcd.mydomain.com/cloudapi/1.0.0/entities/urn:vcloud:entity:cse:nativeCluster:bafa37d7-a0e8-49af-b8db-7faa57635c06 22-01-13 15:39:00 | entity_service:61 - exception_handler_wrapper | Request Id: f6d9e84d-20c6-44a2-ab30-b363d8f020ae | ERROR :: [ 1965f36f-1a1f-4686-be55-36c91eb7bb18 ] This operation is denied. 22-01-13 15:39:00 | cluster_service_2_x_tkgm:289 - create_cluster | Request Id: f6d9e84d-20c6-44a2-ab30-b363d8f020ae | ERROR :: Error updating the cluster 'k8s-tkg-test' with the status 22-01-13 15:39:00 | cluster_service_2_x_tkgm:300 - create_cluster | Request Id: f6d9e84d-20c6-44a2-ab30-b363d8f020ae | ERROR :: Failed to create cluster k8s-tkg-test in org NOCTest-PMD and VDC NOC-AKLD-PMD Traceback (most recent call last): File "/opt/cse/python/lib/python3.7/site-packages/container_service_extension/rde/common/entity_service.py", line 52, in exception_handler_wrapper result = func(args, **kwargs) File "/opt/cse/python/lib/python3.7/site-packages/container_service_extension/rde/common/entity_service.py", line 385, in get_entity resource_url_relative_path=f"{CloudApiResource.ENTITIES}/" File "/opt/cse/python/lib/python3.7/site-packages/container_service_extension/lib/cloudapi/cloudapi_client.py", line 139, in do_request response.raise_for_status() File "/opt/cse/python/lib/python3.7/site-packages/requests/models.py", line 960, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://vcd.mydomain.com/cloudapi/1.0.0/entities/urn:vcloud:entity:cse:nativeCluster:bafa37d7-a0e8-49af-b8db-7faa57635c06

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/cse/python/lib/python3.7/site-packages/container_service_extension/rde/backend/cluster_service_2_x_tkgm.py", line 286, in create_cluster curr_rde = self._update_cluster_entity(entity_id, changes=changes) # noqa: E501 File "/opt/cse/python/lib/python3.7/site-packages/container_service_extension/rde/backend/cluster_service_2_x_tkgm.py", line 1693, in _update_cluster_entity cluster_id, invoke_hooks=False, changes=changes File "/opt/cse/python/lib/python3.7/site-packages/container_service_extension/rde/common/entity_service.py", line 62, in exception_handler_wrapper raise error File "/opt/cse/python/lib/python3.7/site-packages/container_service_extension/rde/common/entity_service.py", line 52, in exception_handler_wrapper result = func(*args, **kwargs) File "/opt/cse/python/lib/python3.7/site-packages/container_service_extension/rde/common/entity_service.py", line 325, in update_entity entity, etag = self._form_updated_entity(entity_id, changes) File "/opt/cse/python/lib/python3.7/site-packages/container_service_extension/rde/common/entity_service.py", line 262, in _form_updated_entity entity = self.get_entity(entity_id) File "/opt/cse/python/lib/python3.7/site-packages/container_service_extension/rde/common/entity_service.py", line 59, in exception_handler_wrapper minor_error_code=MinorErrorCode.DEFAULT_ERROR_CODE) container_service_extension.exception.exceptions.DefEntityServiceError: [ 1965f36f-1a1f-4686-be55-36c91eb7bb18 ] This operation is denied. 22-01-13 15:39:00 | entity_service:56 - exception_handler_wrapper | Request Id: f6d9e84d-20c6-44a2-ab30-b363d8f020ae | ERROR :: [ 5422347b-124e-4698-b3bc-62b19be77684 ] This operation is denied.

an0nz commented 2 years ago

For anyone else stuck with ERROR :: None

After some more digging this one was all on me, there was no external internet access from the control plane so it was failing.

When I tested internet access it was from another VM within the same subnet but the SNAT rule on the Edge Gateway did not include the IP range being assigned to the k8s cluster and I had not spotted the smaller mask when checking.