Closed karuppiah7890 closed 2 years ago
For now I have deleted the 3 elastic IPs that were lying around and the last 4 AWS E2E test pipelines ran successfully as the 5 elastic IPs quota was enough for running them - at each point, 2 were running in parallel and required 3 elastic IPs I think
https://github.com/vmware-tanzu/community-edition/actions/runs/1559732466 ✅
https://github.com/vmware-tanzu/community-edition/actions/runs/1560829014 ✅
https://github.com/vmware-tanzu/community-edition/actions/runs/1559732476 ✅
https://github.com/vmware-tanzu/community-edition/actions/runs/1560829015 ✅
I have also requested for an increase in quota from 5 to 15 elastic IPs for us-east-2
region and I'll monitor the pipelines for sometime and then close this issue
We need to request more quota here https://us-east-2.console.aws.amazon.com/servicequotas/home/services/ec2/quotas/L-0263D0A3 for us-east-2
which I have done now. Previously I had requested more quota here - https://console.aws.amazon.com/servicequotas/home/services/ec2/quotas/L-0263D0A3 which requests for us-east-1
region which is not what our pipelines use and in this link the UI shows Select a Region
on the top but a default region of us-east-1
is already chosen for all the quota requests and the URL remains https://console.aws.amazon.com/servicequotas/home/services/ec2/quotas/L-0263D0A3 even when us-east-1
is explicitly chosen
Closing this as there have been no issues because of this
Recently TCE’s AWS E2E test pipelines have been failing and I thought it might be just a flaky issue, but looks like it’s not
Interestingly it was also initially only seen in recent AWS management + workload cluster E2E test runs -
https://github.com/vmware-tanzu/community-edition/actions/runs/1552245009
https://github.com/vmware-tanzu/community-edition/actions/runs/1550338695
https://github.com/vmware-tanzu/community-edition/actions/runs/1539924396
The crux of the problem is quota for Elastic IPs (public IPs) needed for NAT Gateways (AWS resource), which can be understood from the error message popping up a lot in the diagnostics data (zip) from the failed pipelines -
which shows up in the pipeline logs as something like
Note the
NatGatewaysReconciliationFailed
errorGetting more into the problem - apparently only 5 elastic IPs are allowed per region by default - https://docs.aws.amazon.com/vpc/latest/userguide/amazon-vpc-limits.html#vpc-limits-eips . This is in terms of quotas. But it’s adjustable and we can get more quota
My guess on what went wrong is - many pipelines ran in parallel - AWS standalone cluster pipelines, AWS management + workload cluster pipelines. Given a cluster (standalone / management / workload), I think it needs one NAT Gateway, which needs one elastic IP. So I think the parallelism for too many pipelines gets restricted due to the above quota on the elastic IP
Interestingly, I noticed 3 elastic IPs lying around in the AWS account we use for E2E test pipelines. I think they were not cleaned up. I was wondering how the others got cleaned up, something to check. But the aws nuke config does not mention cleaning up elastic IP, and when I deleted NAT gateway (after associating it with an elastic IP I created), it deleted the NAT gateway only but not the elastic IP. But yeah, somehow some elastic IPs are left out in the account
And all our E2E test pipelines use the same AWS region -
us-east-2
, so the total elastic IP we can use is only 5 (as per the current quota) across the pipelineshttps://github.com/vmware-tanzu/community-edition/blob/73783c978aae5605e2e642697430c6befa9cbbda/test/aws/cluster-config.yaml#L5
https://github.com/vmware-tanzu/community-edition/blob/73783c978aae5605e2e642697430c6befa9cbbda/test/aws/cluster-config.yaml#L2
Given that 3 elastic IPs are already lying around, I think when the AWS pipelines try to get more IPs while creating NAT gateways, they fail. Like, when a commit is pushed, I think two AWS pipelines run for each commit - standalone cluster and management + workload cluster, for example, for this commit -
https://github.com/vmware-tanzu/community-edition/commit/b5d53c19a3c8f67cabd8c078a640176cc792536e
the two AWS pipelines are -
standalone cluster - https://github.com/vmware-tanzu/community-edition/actions/runs/1552245007 management + workload cluster - https://github.com/vmware-tanzu/community-edition/actions/runs/1552245009
This is according to the AWS E2E GitHub workflows -
https://github.com/vmware-tanzu/community-edition/blob/73783c978aae5605e2e642697430c6befa9cbbda/.github/workflows/e2e-aws-standalone-cluster.yaml#L3-L6
https://github.com/vmware-tanzu/community-edition/blob/73783c978aae5605e2e642697430c6befa9cbbda/.github/workflows/e2e-aws-management-and-workload-cluster.yaml#L3-L6
The standalone cluster E2E usually seems to run fast and gets it’s NAT Gateway for the cluster with one Elastic (Public) IP
The management cluster + workload cluster E2E test runs fast too and also gets NAT Gateway for the management cluster with one Elastic (Public) IP
Now with existing 3 IPs lying around and 2 new IPs (from the above tests), total 5 IPs have been created and 2 are being used
Now in management cluster + workload cluster E2E test, workload cluster creation happens and it fails, as it cannot create a NAT Gateway which needs an Elastic IP (public IP) and the quota has been reached. I think this has been happening for quite some time now, like, last few runs - for the recent commits, you can see that either the standalone cluster E2E test fails with NAT Gateway issue, or management + workload cluster E2E test fails with NAT Gateway issue, depending on which E2E test uses up the 2 Elastic IPs first which is available in quota, and if they use it fast and clean it up fast too, then other one can use it
https://github.com/vmware-tanzu/community-edition/actions/runs/1526817483 - fail , https://github.com/vmware-tanzu/community-edition/actions/runs/1526817480 - pass
https://github.com/vmware-tanzu/community-edition/actions/runs/1532904134 - fail, https://github.com/vmware-tanzu/community-edition/actions/runs/1532904138 - pass
https://github.com/vmware-tanzu/community-edition/actions/runs/1533964631 - fail, https://github.com/vmware-tanzu/community-edition/actions/runs/1533964621 - pass
https://github.com/vmware-tanzu/community-edition/actions/runs/1535806421 - fail, https://github.com/vmware-tanzu/community-edition/actions/runs/1535806430 - pass
https://github.com/vmware-tanzu/community-edition/actions/runs/1535902764 - fail, https://github.com/vmware-tanzu/community-edition/actions/runs/1535902766 - pass
https://github.com/vmware-tanzu/community-edition/actions/runs/1539924396 - fail, https://github.com/vmware-tanzu/community-edition/actions/runs/1539924399 - pass
https://github.com/vmware-tanzu/community-edition/actions/runs/1550338695 - fail, https://github.com/vmware-tanzu/community-edition/actions/runs/1550338699 - pass
https://github.com/vmware-tanzu/community-edition/actions/runs/1552245009 - fail, https://github.com/vmware-tanzu/community-edition/actions/runs/1552245007 - pass