Fix AWS E2E test pipeline failures

karuppiah7890 commented 2 years ago

Recently TCE’s AWS E2E test pipelines have been failing and I thought it might be just a flaky issue, but looks like it’s not

Interestingly it was also initially only seen in recent AWS management + workload cluster E2E test runs -

https://github.com/vmware-tanzu/community-edition/actions/runs/1552245009

https://github.com/vmware-tanzu/community-edition/actions/runs/1550338695

https://github.com/vmware-tanzu/community-edition/actions/runs/1539924396

The crux of the problem is quota for Elastic IPs (public IPs) needed for NAT Gateways (AWS resource), which can be understood from the error message popping up a lot in the diagnostics data (zip) from the failed pipelines -

failed to create one or more IP addresses for NAT gateways: failed to allocate Elastic IP: AddressLimitExceeded: The maximum number of addresses has been reached.\n\tstatus code: 400, request id: b9fa4619-3f8b-475d-a546-923075f6fb6a

which shows up in the pipeline logs as something like

Error: unable to wait for cluster and get the cluster kubeconfig: error waiting for cluster to be provisioned (this may take a few minutes): cluster creation failed, reason:'NatGatewaysReconciliationFailed', message:'3 of 8 completed'

Note the NatGatewaysReconciliationFailed error

Getting more into the problem - apparently only 5 elastic IPs are allowed per region by default - https://docs.aws.amazon.com/vpc/latest/userguide/amazon-vpc-limits.html#vpc-limits-eips . This is in terms of quotas. But it’s adjustable and we can get more quota

My guess on what went wrong is - many pipelines ran in parallel - AWS standalone cluster pipelines, AWS management + workload cluster pipelines. Given a cluster (standalone / management / workload), I think it needs one NAT Gateway, which needs one elastic IP. So I think the parallelism for too many pipelines gets restricted due to the above quota on the elastic IP

Interestingly, I noticed 3 elastic IPs lying around in the AWS account we use for E2E test pipelines. I think they were not cleaned up. I was wondering how the others got cleaned up, something to check. But the aws nuke config does not mention cleaning up elastic IP, and when I deleted NAT gateway (after associating it with an elastic IP I created), it deleted the NAT gateway only but not the elastic IP. But yeah, somehow some elastic IPs are left out in the account

And all our E2E test pipelines use the same AWS region - us-east-2, so the total elastic IP we can use is only 5 (as per the current quota) across the pipelines

https://github.com/vmware-tanzu/community-edition/blob/73783c978aae5605e2e642697430c6befa9cbbda/test/aws/cluster-config.yaml#L5

https://github.com/vmware-tanzu/community-edition/blob/73783c978aae5605e2e642697430c6befa9cbbda/test/aws/cluster-config.yaml#L2

Given that 3 elastic IPs are already lying around, I think when the AWS pipelines try to get more IPs while creating NAT gateways, they fail. Like, when a commit is pushed, I think two AWS pipelines run for each commit - standalone cluster and management + workload cluster, for example, for this commit -

https://github.com/vmware-tanzu/community-edition/commit/b5d53c19a3c8f67cabd8c078a640176cc792536e

the two AWS pipelines are -

standalone cluster - https://github.com/vmware-tanzu/community-edition/actions/runs/1552245007 management + workload cluster - https://github.com/vmware-tanzu/community-edition/actions/runs/1552245009

This is according to the AWS E2E GitHub workflows -

https://github.com/vmware-tanzu/community-edition/blob/73783c978aae5605e2e642697430c6befa9cbbda/.github/workflows/e2e-aws-standalone-cluster.yaml#L3-L6

https://github.com/vmware-tanzu/community-edition/blob/73783c978aae5605e2e642697430c6befa9cbbda/.github/workflows/e2e-aws-management-and-workload-cluster.yaml#L3-L6

The standalone cluster E2E usually seems to run fast and gets it’s NAT Gateway for the cluster with one Elastic (Public) IP

The management cluster + workload cluster E2E test runs fast too and also gets NAT Gateway for the management cluster with one Elastic (Public) IP

Now with existing 3 IPs lying around and 2 new IPs (from the above tests), total 5 IPs have been created and 2 are being used

Now in management cluster + workload cluster E2E test, workload cluster creation happens and it fails, as it cannot create a NAT Gateway which needs an Elastic IP (public IP) and the quota has been reached. I think this has been happening for quite some time now, like, last few runs - for the recent commits, you can see that either the standalone cluster E2E test fails with NAT Gateway issue, or management + workload cluster E2E test fails with NAT Gateway issue, depending on which E2E test uses up the 2 Elastic IPs first which is available in quota, and if they use it fast and clean it up fast too, then other one can use it

https://github.com/vmware-tanzu/community-edition/actions/runs/1526817483 - fail , https://github.com/vmware-tanzu/community-edition/actions/runs/1526817480 - pass

https://github.com/vmware-tanzu/community-edition/actions/runs/1532904134 - fail, https://github.com/vmware-tanzu/community-edition/actions/runs/1532904138 - pass

https://github.com/vmware-tanzu/community-edition/actions/runs/1533964631 - fail, https://github.com/vmware-tanzu/community-edition/actions/runs/1533964621 - pass

https://github.com/vmware-tanzu/community-edition/actions/runs/1535806421 - fail, https://github.com/vmware-tanzu/community-edition/actions/runs/1535806430 - pass

https://github.com/vmware-tanzu/community-edition/actions/runs/1535902764 - fail, https://github.com/vmware-tanzu/community-edition/actions/runs/1535902766 - pass

https://github.com/vmware-tanzu/community-edition/actions/runs/1539924396 - fail, https://github.com/vmware-tanzu/community-edition/actions/runs/1539924399 - pass

https://github.com/vmware-tanzu/community-edition/actions/runs/1550338695 - fail, https://github.com/vmware-tanzu/community-edition/actions/runs/1550338699 - pass

https://github.com/vmware-tanzu/community-edition/actions/runs/1552245009 - fail, https://github.com/vmware-tanzu/community-edition/actions/runs/1552245007 - pass

karuppiah7890 commented 2 years ago

For now I have deleted the 3 elastic IPs that were lying around and the last 4 AWS E2E test pipelines ran successfully as the 5 elastic IPs quota was enough for running them - at each point, 2 were running in parallel and required 3 elastic IPs I think

https://github.com/vmware-tanzu/community-edition/actions/runs/1559732466 ✅

https://github.com/vmware-tanzu/community-edition/actions/runs/1560829014 ✅

https://github.com/vmware-tanzu/community-edition/actions/runs/1559732476 ✅

https://github.com/vmware-tanzu/community-edition/actions/runs/1560829015 ✅

I have also requested for an increase in quota from 5 to 15 elastic IPs for us-east-2 region and I'll monitor the pipelines for sometime and then close this issue

karuppiah7890 commented 2 years ago

We need to request more quota here https://us-east-2.console.aws.amazon.com/servicequotas/home/services/ec2/quotas/L-0263D0A3 for us-east-2 which I have done now. Previously I had requested more quota here - https://console.aws.amazon.com/servicequotas/home/services/ec2/quotas/L-0263D0A3 which requests for us-east-1 region which is not what our pipelines use and in this link the UI shows Select a Region on the top but a default region of us-east-1 is already chosen for all the quota requests and the URL remains https://console.aws.amazon.com/servicequotas/home/services/ec2/quotas/L-0263D0A3 even when us-east-1 is explicitly chosen

karuppiah7890 commented 2 years ago

Closing this as there have been no issues because of this

vmware-tanzu / community-edition

Fix AWS E2E test pipeline failures #2699