A majority of our tests in the ECS acceptance test suite are flaky. Upon investigation, it was found that a majority of these failures are linked to the scenario where ECS mesh tasks are unable to reach the Consul server. The main reason for this is that we deploy both the server and individual mesh tasks in parallel with terraform and servers generally take a bit of time before their CloudMap DNS name gets resolved to a private IP.
This PR adds two changes to fix this issue
Associate the server with an ALB and query /v1/catalog/services for its readiness
Make sure that the ECS controller and other application task submodules depend on the server module's completion. This will ensure that the client workload tasks only get deployed after the server's ECS service and task are up and running thus making sure that the clients always connect to the server without any issues.
In order to associate the LB to the Consul server task, we'd need to be aware of the public_subnet details of the VPC and there are changes added to this PR to consume the same in the individual test's terraform configs.
Changes proposed in this PR:
This PR adds two changes to fix this issue
/v1/catalog/services
for its readinessIn order to associate the LB to the Consul server task, we'd need to be aware of the public_subnet details of the VPC and there are changes added to this PR to consume the same in the individual test's terraform configs.
How I've tested this PR:
Manual
How I expect reviewers to test this PR:
Checklist: